All of lore.kernel.org
 help / color / mirror / Atom feed
* RDMA Read: Local protection error
@ 2016-04-29 16:24 Chuck Lever
       [not found] ` <1A4F4C32-CE5A-44D9-9BFE-0E1F8D5DF44D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2016-04-29 16:24 UTC (permalink / raw)
  To: linux-rdma

I've found some new behavior, recently, while testing the
v4.6-rc Linux NFS/RDMA client and server.

When certain kernel memory debugging CONFIG options are
enabled, 1MB NFS WRITEs can sometimes result in a
IB_WC_LOC_PROT_ERR. I usually turn on most of them because
I want to see any problems, so I'm not sure which option
in particular is exposing the issue.

When debugging is enabled on the server, and the underlying
device is using FRWR to register the sink buffer, an RDMA
Read occasionally completes with LOC_PROT_ERR.

When debugging is enabled on the client, and the underlying
device uses FRWR to register the target of an RDMA Read, an
ingress RDMA Read request sometimes gets a Syndrome 99
(REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
on the client completes with LOC_PROT_ERR.

I do not see this problem when kernel memory debugging is
disabled, or when the client is using FMR, or when the
server is using physical addresses to post its RDMA Read WRs,
or when wsize is 512KB or smaller.

I have not found any obvious problems with the client logic
that registers NFS WRITE buffers, nor the server logic that
constructs and posts RDMA Read WRs.

My next step is to bisect. But first, I was wondering if
this behavior might be related to the recent problems with
s/g lists seen with iSER/SRP? ie, is this a recognized
issue?


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found] ` <1A4F4C32-CE5A-44D9-9BFE-0E1F8D5DF44D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-04-29 16:44   ` Santosh Shilimkar
       [not found]     ` <3fb4e75f-ff14-34e2-b6d3-6b6046812845-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2016-04-29 16:45   ` Bart Van Assche
  1 sibling, 1 reply; 34+ messages in thread
From: Santosh Shilimkar @ 2016-04-29 16:44 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma



On 4/29/2016 9:24 AM, Chuck Lever wrote:
> I've found some new behavior, recently, while testing the
> v4.6-rc Linux NFS/RDMA client and server.
>
> When certain kernel memory debugging CONFIG options are
> enabled, 1MB NFS WRITEs can sometimes result in a
> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
> I want to see any problems, so I'm not sure which option
> in particular is exposing the issue.
>
> When debugging is enabled on the server, and the underlying
> device is using FRWR to register the sink buffer, an RDMA
> Read occasionally completes with LOC_PROT_ERR.
>
> When debugging is enabled on the client, and the underlying
> device uses FRWR to register the target of an RDMA Read, an
> ingress RDMA Read request sometimes gets a Syndrome 99
> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
> on the client completes with LOC_PROT_ERR.
>
> I do not see this problem when kernel memory debugging is
> disabled, or when the client is using FMR, or when the
> server is using physical addresses to post its RDMA Read WRs,
> or when wsize is 512KB or smaller.
>
> I have not found any obvious problems with the client logic
> that registers NFS WRITE buffers, nor the server logic that
> constructs and posts RDMA Read WRs.
>
One possibility here could be the mismatch in posted WR for
send/receive. Can you check if for certain cases you are
posting receive WRs which can't handle whats send is putting
on the wire.

> My next step is to bisect. But first, I was wondering if
> this behavior might be related to the recent problems with
> s/g lists seen with iSER/SRP? ie, is this a recognized
> issue?
>
>
> --
> Chuck Lever
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found] ` <1A4F4C32-CE5A-44D9-9BFE-0E1F8D5DF44D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2016-04-29 16:44   ` Santosh Shilimkar
@ 2016-04-29 16:45   ` Bart Van Assche
       [not found]     ` <57238F8C.70505-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  1 sibling, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2016-04-29 16:45 UTC (permalink / raw)
  To: Chuck Lever, linux-rdma

On 04/29/2016 09:24 AM, Chuck Lever wrote:
> I've found some new behavior, recently, while testing the
> v4.6-rc Linux NFS/RDMA client and server.
>
> When certain kernel memory debugging CONFIG options are
> enabled, 1MB NFS WRITEs can sometimes result in a
> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
> I want to see any problems, so I'm not sure which option
> in particular is exposing the issue.
>
> When debugging is enabled on the server, and the underlying
> device is using FRWR to register the sink buffer, an RDMA
> Read occasionally completes with LOC_PROT_ERR.
>
> When debugging is enabled on the client, and the underlying
> device uses FRWR to register the target of an RDMA Read, an
> ingress RDMA Read request sometimes gets a Syndrome 99
> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
> on the client completes with LOC_PROT_ERR.
>
> I do not see this problem when kernel memory debugging is
> disabled, or when the client is using FMR, or when the
> server is using physical addresses to post its RDMA Read WRs,
> or when wsize is 512KB or smaller.
>
> I have not found any obvious problems with the client logic
> that registers NFS WRITE buffers, nor the server logic that
> constructs and posts RDMA Read WRs.
>
> My next step is to bisect. But first, I was wondering if
> this behavior might be related to the recent problems with
> s/g lists seen with iSER/SRP? ie, is this a recognized
> issue?

Hello Chuck,

A few days ago I observed similar behavior with the SRP protocol but 
only if I increase max_sect in /etc/srp_daemon.conf from the default to 
4096. My setup was as follows:
* Kernel 4.6.0-rc5 at the initiator side.
* A whole bunch of kernel debugging options enabled at the initiator
   side.
* The following settings in /etc/modprobe.d/ib_srp.conf:
   options ib_srp cmd_sg_entries=255 register_always=1
* The following settings in /etc/srp_daemon.conf:
   a queue_size=128,max_cmd_per_lun=128,max_sect=4096
* Kernel 3.0.101 at the target side.
* Kernel debugging disabled at the target side.
* mlx4 driver at both sides.

Decreasing max_sge at the target side from 32 to 16 did not help. I have 
not yet had the time to analyze this further.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]     ` <3fb4e75f-ff14-34e2-b6d3-6b6046812845-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-04-29 16:58       ` Chuck Lever
       [not found]         ` <72E8335B-282B-4DCC-AE4F-FE7E50ED5A08-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2016-04-29 16:58 UTC (permalink / raw)
  To: Santosh Shilimkar; +Cc: linux-rdma


> On Apr 29, 2016, at 12:44 PM, Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
> 
> 
> On 4/29/2016 9:24 AM, Chuck Lever wrote:
>> I've found some new behavior, recently, while testing the
>> v4.6-rc Linux NFS/RDMA client and server.
>> 
>> When certain kernel memory debugging CONFIG options are
>> enabled, 1MB NFS WRITEs can sometimes result in a
>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
>> I want to see any problems, so I'm not sure which option
>> in particular is exposing the issue.
>> 
>> When debugging is enabled on the server, and the underlying
>> device is using FRWR to register the sink buffer, an RDMA
>> Read occasionally completes with LOC_PROT_ERR.
>> 
>> When debugging is enabled on the client, and the underlying
>> device uses FRWR to register the target of an RDMA Read, an
>> ingress RDMA Read request sometimes gets a Syndrome 99
>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
>> on the client completes with LOC_PROT_ERR.
>> 
>> I do not see this problem when kernel memory debugging is
>> disabled, or when the client is using FMR, or when the
>> server is using physical addresses to post its RDMA Read WRs,
>> or when wsize is 512KB or smaller.
>> 
>> I have not found any obvious problems with the client logic
>> that registers NFS WRITE buffers, nor the server logic that
>> constructs and posts RDMA Read WRs.
>> 
> One possibility here could be the mismatch in posted WR for
> send/receive. Can you check if for certain cases you are
> posting receive WRs which can't handle whats send is putting
> on the wire.

I've confirmed that the client is posting only 1024-byte
Receive buffers, and that the ib_sge for each Receive
operation is the same before and after the Receive is
posted (ie, the Receive ib_sge is valid and is not
getting overwritten somehow).

The wire traffic contains Send Only requests of 230 or so
bytes. If an ingress Send is too large, the Receive should
complete with IB_WC_LOC_LEN_ERR, not IB_WC_LOC_PROT_ERR.

The server disconnects due to the REM_OP_ERR. The
LOC_PROT_ERR completion appears to be the first Receive
completion after the QP is reconnected.

The client-side error completion on the Receive WR seems
to be a latent report of an earlier problem with an ingress
RDMA Read request.


>> My next step is to bisect. But first, I was wondering if
>> this behavior might be related to the recent problems with
>> s/g lists seen with iSER/SRP? ie, is this a recognized
>> issue?
>> 
>> 
>> --
>> Chuck Lever
>> 
>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]     ` <57238F8C.70505-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2016-04-29 17:02       ` Chuck Lever
  2016-04-29 17:34       ` Laurence Oberman
  2016-05-02 15:10       ` Chuck Lever
  2 siblings, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2016-04-29 17:02 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-rdma


> On Apr 29, 2016, at 12:45 PM, Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
> 
> On 04/29/2016 09:24 AM, Chuck Lever wrote:
>> I've found some new behavior, recently, while testing the
>> v4.6-rc Linux NFS/RDMA client and server.
>> 
>> When certain kernel memory debugging CONFIG options are
>> enabled, 1MB NFS WRITEs can sometimes result in a
>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
>> I want to see any problems, so I'm not sure which option
>> in particular is exposing the issue.
>> 
>> When debugging is enabled on the server, and the underlying
>> device is using FRWR to register the sink buffer, an RDMA
>> Read occasionally completes with LOC_PROT_ERR.
>> 
>> When debugging is enabled on the client, and the underlying
>> device uses FRWR to register the target of an RDMA Read, an
>> ingress RDMA Read request sometimes gets a Syndrome 99
>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
>> on the client completes with LOC_PROT_ERR.
>> 
>> I do not see this problem when kernel memory debugging is
>> disabled, or when the client is using FMR, or when the
>> server is using physical addresses to post its RDMA Read WRs,
>> or when wsize is 512KB or smaller.
>> 
>> I have not found any obvious problems with the client logic
>> that registers NFS WRITE buffers, nor the server logic that
>> constructs and posts RDMA Read WRs.
>> 
>> My next step is to bisect. But first, I was wondering if
>> this behavior might be related to the recent problems with
>> s/g lists seen with iSER/SRP? ie, is this a recognized
>> issue?
> 
> Hello Chuck,
> 
> A few days ago I observed similar behavior with the SRP protocol but only if I increase max_sect in /etc/srp_daemon.conf from the default to 4096. My setup was as follows:
> * Kernel 4.6.0-rc5 at the initiator side.
> * A whole bunch of kernel debugging options enabled at the initiator
>  side.
> * The following settings in /etc/modprobe.d/ib_srp.conf:
>  options ib_srp cmd_sg_entries=255 register_always=1
> * The following settings in /etc/srp_daemon.conf:
>  a queue_size=128,max_cmd_per_lun=128,max_sect=4096
> * Kernel 3.0.101 at the target side.
> * Kernel debugging disabled at the target side.
> * mlx4 driver at both sides.

Indeed, I should have mentioned mlx4 for me as well
on both endpoints.

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]     ` <57238F8C.70505-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  2016-04-29 17:02       ` Chuck Lever
@ 2016-04-29 17:34       ` Laurence Oberman
  2016-05-02 15:10       ` Chuck Lever
  2 siblings, 0 replies; 34+ messages in thread
From: Laurence Oberman @ 2016-04-29 17:34 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Chuck Lever, linux-rdma

Just as an FYI

I went back and tested mlx4 srp as my customer mentioned on the older arrays he was able to set max_sectors_kb=4096.
The claim was that on mlx4 they did not get into the sg map failure issue when issuing 4MB buffered I/O to a file system backed by the SRP targets.
This made no sense to me because I know the issues we addressed in Bart's patch set were in ib_srp. 

I tested without Bart's latest SRP patch set (where we have the throttle back of the max_sectors_kb to avoid the issue) and 
I see the same sg map failures on mlx4 using using FDR not EDR.

So bottom line, with larger max_sectors_kb we will get into this issue on both mlx4 and mlx5 and for now we cannot sustain 4MB for max_sectors_kb with buffered I/O,

Laurence Oberman
Principal Software Maintenance Engineer
Red Hat Global Support Services

----- Original Message -----
From: "Bart Van Assche" <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
To: "Chuck Lever" <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>, "linux-rdma" <linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Sent: Friday, April 29, 2016 12:45:00 PM
Subject: Re: RDMA Read: Local protection error

On 04/29/2016 09:24 AM, Chuck Lever wrote:
> I've found some new behavior, recently, while testing the
> v4.6-rc Linux NFS/RDMA client and server.
>
> When certain kernel memory debugging CONFIG options are
> enabled, 1MB NFS WRITEs can sometimes result in a
> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
> I want to see any problems, so I'm not sure which option
> in particular is exposing the issue.
>
> When debugging is enabled on the server, and the underlying
> device is using FRWR to register the sink buffer, an RDMA
> Read occasionally completes with LOC_PROT_ERR.
>
> When debugging is enabled on the client, and the underlying
> device uses FRWR to register the target of an RDMA Read, an
> ingress RDMA Read request sometimes gets a Syndrome 99
> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
> on the client completes with LOC_PROT_ERR.
>
> I do not see this problem when kernel memory debugging is
> disabled, or when the client is using FMR, or when the
> server is using physical addresses to post its RDMA Read WRs,
> or when wsize is 512KB or smaller.
>
> I have not found any obvious problems with the client logic
> that registers NFS WRITE buffers, nor the server logic that
> constructs and posts RDMA Read WRs.
>
> My next step is to bisect. But first, I was wondering if
> this behavior might be related to the recent problems with
> s/g lists seen with iSER/SRP? ie, is this a recognized
> issue?

Hello Chuck,

A few days ago I observed similar behavior with the SRP protocol but 
only if I increase max_sect in /etc/srp_daemon.conf from the default to 
4096. My setup was as follows:
* Kernel 4.6.0-rc5 at the initiator side.
* A whole bunch of kernel debugging options enabled at the initiator
   side.
* The following settings in /etc/modprobe.d/ib_srp.conf:
   options ib_srp cmd_sg_entries=255 register_always=1
* The following settings in /etc/srp_daemon.conf:
   a queue_size=128,max_cmd_per_lun=128,max_sect=4096
* Kernel 3.0.101 at the target side.
* Kernel debugging disabled at the target side.
* mlx4 driver at both sides.

Decreasing max_sge at the target side from 32 to 16 did not help. I have 
not yet had the time to analyze this further.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]         ` <72E8335B-282B-4DCC-AE4F-FE7E50ED5A08-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-04-29 19:07           ` Santosh Shilimkar
  0 siblings, 0 replies; 34+ messages in thread
From: Santosh Shilimkar @ 2016-04-29 19:07 UTC (permalink / raw)
  To: Chuck Lever; +Cc: linux-rdma



On 4/29/2016 9:58 AM, Chuck Lever wrote:
>
>> On Apr 29, 2016, at 12:44 PM, Santosh Shilimkar <santosh.shilimkar-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
>>
>>
>>
>> On 4/29/2016 9:24 AM, Chuck Lever wrote:
>>> I've found some new behavior, recently, while testing the
>>> v4.6-rc Linux NFS/RDMA client and server.
>>>
>>> When certain kernel memory debugging CONFIG options are
>>> enabled, 1MB NFS WRITEs can sometimes result in a
>>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
>>> I want to see any problems, so I'm not sure which option
>>> in particular is exposing the issue.
>>>
>>> When debugging is enabled on the server, and the underlying
>>> device is using FRWR to register the sink buffer, an RDMA
>>> Read occasionally completes with LOC_PROT_ERR.
>>>
>>> When debugging is enabled on the client, and the underlying
>>> device uses FRWR to register the target of an RDMA Read, an
>>> ingress RDMA Read request sometimes gets a Syndrome 99
>>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
>>> on the client completes with LOC_PROT_ERR.
>>>
>>> I do not see this problem when kernel memory debugging is
>>> disabled, or when the client is using FMR, or when the
>>> server is using physical addresses to post its RDMA Read WRs,
>>> or when wsize is 512KB or smaller.
>>>
>>> I have not found any obvious problems with the client logic
>>> that registers NFS WRITE buffers, nor the server logic that
>>> constructs and posts RDMA Read WRs.
>>>
>> One possibility here could be the mismatch in posted WR for
>> send/receive. Can you check if for certain cases you are
>> posting receive WRs which can't handle whats send is putting
>> on the wire.
>
> I've confirmed that the client is posting only 1024-byte
> Receive buffers, and that the ib_sge for each Receive
> operation is the same before and after the Receive is
> posted (ie, the Receive ib_sge is valid and is not
> getting overwritten somehow).
>
> The wire traffic contains Send Only requests of 230 or so
> bytes. If an ingress Send is too large, the Receive should
> complete with IB_WC_LOC_LEN_ERR, not IB_WC_LOC_PROT_ERR.
>
You are right. What I described is IB_WC_LOC_LEN_ERR scenario
and not IB_WC_LOC_PROT_ERR.

Regards,
Santosh
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]     ` <57238F8C.70505-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  2016-04-29 17:02       ` Chuck Lever
  2016-04-29 17:34       ` Laurence Oberman
@ 2016-05-02 15:10       ` Chuck Lever
       [not found]         ` <B72A389F-FFF1-498C-A946-8AA72B7769F8-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2016-05-02 15:10 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-rdma


> On Apr 29, 2016, at 12:45 PM, Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
> 
> On 04/29/2016 09:24 AM, Chuck Lever wrote:
>> I've found some new behavior, recently, while testing the
>> v4.6-rc Linux NFS/RDMA client and server.
>> 
>> When certain kernel memory debugging CONFIG options are
>> enabled, 1MB NFS WRITEs can sometimes result in a
>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
>> I want to see any problems, so I'm not sure which option
>> in particular is exposing the issue.
>> 
>> When debugging is enabled on the server, and the underlying
>> device is using FRWR to register the sink buffer, an RDMA
>> Read occasionally completes with LOC_PROT_ERR.
>> 
>> When debugging is enabled on the client, and the underlying
>> device uses FRWR to register the target of an RDMA Read, an
>> ingress RDMA Read request sometimes gets a Syndrome 99
>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
>> on the client completes with LOC_PROT_ERR.
>> 
>> I do not see this problem when kernel memory debugging is
>> disabled, or when the client is using FMR, or when the
>> server is using physical addresses to post its RDMA Read WRs,
>> or when wsize is 512KB or smaller.
>> 
>> I have not found any obvious problems with the client logic
>> that registers NFS WRITE buffers, nor the server logic that
>> constructs and posts RDMA Read WRs.
>> 
>> My next step is to bisect. But first, I was wondering if
>> this behavior might be related to the recent problems with
>> s/g lists seen with iSER/SRP? ie, is this a recognized
>> issue?
> 
> Hello Chuck,
> 
> A few days ago I observed similar behavior with the SRP protocol but only if I increase max_sect in /etc/srp_daemon.conf from the default to 4096. My setup was as follows:
> * Kernel 4.6.0-rc5 at the initiator side.
> * A whole bunch of kernel debugging options enabled at the initiator
>  side.
> * The following settings in /etc/modprobe.d/ib_srp.conf:
>  options ib_srp cmd_sg_entries=255 register_always=1
> * The following settings in /etc/srp_daemon.conf:
>  a queue_size=128,max_cmd_per_lun=128,max_sect=4096
> * Kernel 3.0.101 at the target side.
> * Kernel debugging disabled at the target side.
> * mlx4 driver at both sides.
> 
> Decreasing max_sge at the target side from 32 to 16 did not help. I have not yet had the time to analyze this further.

git bisect result:

d86bd1bece6fc41d59253002db5441fe960a37f6 is the first bad commit
commit d86bd1bece6fc41d59253002db5441fe960a37f6
Author: Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org>
Date:   Tue Mar 15 14:55:12 2016 -0700

    mm/slub: support left redzone

I checked out the previous commit and was not able to
reproduce, which gives some confidence that the bisect
result is valid.

I've also investigated the wire behavior a little more.
The server I'm using for testing has FRWR artificially
disabled, so it uses physical addresses for RDMA Read.
This limits it to max_sge_rd, or 30 pages for each Read
request.

The client sends a single 1MB Read chunk. The server
emits 8 30-page Read requests, and a ninth request for
the last 16 pages in the chunk.

The client's HCA responds to the 30-page Read requests
properly. But on the last Read request, it responds
with a Read First, 14 Read Middle responses, then an
ACK with Syndrome 99 (Remote Operation Error).

This suggests the last page in the memory region is
not accessible to the HCA.

This does not happen on the first NFS WRITE, but
rather one or two subsequent NFS WRITEs during the test.

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]         ` <B72A389F-FFF1-498C-A946-8AA72B7769F8-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-05-02 16:08           ` Bart Van Assche
       [not found]             ` <57277B63.8030506-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2016-05-02 16:08 UTC (permalink / raw)
  To: Chuck Lever, Or Gerlitz; +Cc: linux-rdma

On 05/02/2016 08:10 AM, Chuck Lever wrote:
>> On Apr 29, 2016, at 12:45 PM, Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>> On 04/29/2016 09:24 AM, Chuck Lever wrote:
>>> I've found some new behavior, recently, while testing the
>>> v4.6-rc Linux NFS/RDMA client and server.
>>>
>>> When certain kernel memory debugging CONFIG options are
>>> enabled, 1MB NFS WRITEs can sometimes result in a
>>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
>>> I want to see any problems, so I'm not sure which option
>>> in particular is exposing the issue.
>>>
>>> When debugging is enabled on the server, and the underlying
>>> device is using FRWR to register the sink buffer, an RDMA
>>> Read occasionally completes with LOC_PROT_ERR.
>>>
>>> When debugging is enabled on the client, and the underlying
>>> device uses FRWR to register the target of an RDMA Read, an
>>> ingress RDMA Read request sometimes gets a Syndrome 99
>>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
>>> on the client completes with LOC_PROT_ERR.
>>>
>>> I do not see this problem when kernel memory debugging is
>>> disabled, or when the client is using FMR, or when the
>>> server is using physical addresses to post its RDMA Read WRs,
>>> or when wsize is 512KB or smaller.
>>>
>>> I have not found any obvious problems with the client logic
>>> that registers NFS WRITE buffers, nor the server logic that
>>> constructs and posts RDMA Read WRs.
>>>
>>> My next step is to bisect. But first, I was wondering if
>>> this behavior might be related to the recent problems with
>>> s/g lists seen with iSER/SRP? ie, is this a recognized
>>> issue?
>>
>> Hello Chuck,
>>
>> A few days ago I observed similar behavior with the SRP protocol but only if I increase max_sect in /etc/srp_daemon.conf from the default to 4096. My setup was as follows:
>> * Kernel 4.6.0-rc5 at the initiator side.
>> * A whole bunch of kernel debugging options enabled at the initiator
>>   side.
>> * The following settings in /etc/modprobe.d/ib_srp.conf:
>>   options ib_srp cmd_sg_entries=255 register_always=1
>> * The following settings in /etc/srp_daemon.conf:
>>   a queue_size=128,max_cmd_per_lun=128,max_sect=4096
>> * Kernel 3.0.101 at the target side.
>> * Kernel debugging disabled at the target side.
>> * mlx4 driver at both sides.
>>
>> Decreasing max_sge at the target side from 32 to 16 did not help. I have not yet had the time to analyze this further.
>
> git bisect result:
>
> d86bd1bece6fc41d59253002db5441fe960a37f6 is the first bad commit
> commit d86bd1bece6fc41d59253002db5441fe960a37f6
> Author: Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org>
> Date:   Tue Mar 15 14:55:12 2016 -0700
>
>      mm/slub: support left redzone
>
> I checked out the previous commit and was not able to
> reproduce, which gives some confidence that the bisect
> result is valid.
>
> I've also investigated the wire behavior a little more.
> The server I'm using for testing has FRWR artificially
> disabled, so it uses physical addresses for RDMA Read.
> This limits it to max_sge_rd, or 30 pages for each Read
> request.
>
> The client sends a single 1MB Read chunk. The server
> emits 8 30-page Read requests, and a ninth request for
> the last 16 pages in the chunk.
>
> The client's HCA responds to the 30-page Read requests
> properly. But on the last Read request, it responds
> with a Read First, 14 Read Middle responses, then an
> ACK with Syndrome 99 (Remote Operation Error).
>
> This suggests the last page in the memory region is
> not accessible to the HCA.
>
> This does not happen on the first NFS WRITE, but
> rather one or two subsequent NFS WRITEs during the test.

On an x86 system that patch changes the alignment of buffers > 8 bytes 
from 16 bytes to 8 bytes (ARCH_SLAB_MINALIGN / ARCH_KMALLOC_MINALIGN). 
There might be code in the mlx4 driver that makes incorrect assumptions 
about the alignment of memory allocated by kmalloc(). Can someone from 
Mellanox comment on the alignment requirements of the buffers allocated 
by mlx4_buf_alloc()?

Thanks,

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]             ` <57277B63.8030506-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2016-05-03 14:57               ` Chuck Lever
       [not found]                 ` <6BBFD126-877C-4638-BB91-ABF715E29326-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2016-05-03 14:57 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Bart Van Assche, Or Gerlitz, linux-rdma


> On May 2, 2016, at 12:08 PM, Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
> 
> On 05/02/2016 08:10 AM, Chuck Lever wrote:
>>> On Apr 29, 2016, at 12:45 PM, Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>>> On 04/29/2016 09:24 AM, Chuck Lever wrote:
>>>> I've found some new behavior, recently, while testing the
>>>> v4.6-rc Linux NFS/RDMA client and server.
>>>> 
>>>> When certain kernel memory debugging CONFIG options are
>>>> enabled, 1MB NFS WRITEs can sometimes result in a
>>>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
>>>> I want to see any problems, so I'm not sure which option
>>>> in particular is exposing the issue.
>>>> 
>>>> When debugging is enabled on the server, and the underlying
>>>> device is using FRWR to register the sink buffer, an RDMA
>>>> Read occasionally completes with LOC_PROT_ERR.
>>>> 
>>>> When debugging is enabled on the client, and the underlying
>>>> device uses FRWR to register the target of an RDMA Read, an
>>>> ingress RDMA Read request sometimes gets a Syndrome 99
>>>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
>>>> on the client completes with LOC_PROT_ERR.
>>>> 
>>>> I do not see this problem when kernel memory debugging is
>>>> disabled, or when the client is using FMR, or when the
>>>> server is using physical addresses to post its RDMA Read WRs,
>>>> or when wsize is 512KB or smaller.
>>>> 
>>>> I have not found any obvious problems with the client logic
>>>> that registers NFS WRITE buffers, nor the server logic that
>>>> constructs and posts RDMA Read WRs.
>>>> 
>>>> My next step is to bisect. But first, I was wondering if
>>>> this behavior might be related to the recent problems with
>>>> s/g lists seen with iSER/SRP? ie, is this a recognized
>>>> issue?
>>> 
>>> Hello Chuck,
>>> 
>>> A few days ago I observed similar behavior with the SRP protocol but only if I increase max_sect in /etc/srp_daemon.conf from the default to 4096. My setup was as follows:
>>> * Kernel 4.6.0-rc5 at the initiator side.
>>> * A whole bunch of kernel debugging options enabled at the initiator
>>>  side.
>>> * The following settings in /etc/modprobe.d/ib_srp.conf:
>>>  options ib_srp cmd_sg_entries=255 register_always=1
>>> * The following settings in /etc/srp_daemon.conf:
>>>  a queue_size=128,max_cmd_per_lun=128,max_sect=4096
>>> * Kernel 3.0.101 at the target side.
>>> * Kernel debugging disabled at the target side.
>>> * mlx4 driver at both sides.
>>> 
>>> Decreasing max_sge at the target side from 32 to 16 did not help. I have not yet had the time to analyze this further.
>> 
>> git bisect result:
>> 
>> d86bd1bece6fc41d59253002db5441fe960a37f6 is the first bad commit
>> commit d86bd1bece6fc41d59253002db5441fe960a37f6
>> Author: Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org>
>> Date:   Tue Mar 15 14:55:12 2016 -0700
>> 
>>     mm/slub: support left redzone
>> 
>> I checked out the previous commit and was not able to
>> reproduce, which gives some confidence that the bisect
>> result is valid.
>> 
>> I've also investigated the wire behavior a little more.
>> The server I'm using for testing has FRWR artificially
>> disabled, so it uses physical addresses for RDMA Read.
>> This limits it to max_sge_rd, or 30 pages for each Read
>> request.
>> 
>> The client sends a single 1MB Read chunk. The server
>> emits 8 30-page Read requests, and a ninth request for
>> the last 16 pages in the chunk.
>> 
>> The client's HCA responds to the 30-page Read requests
>> properly. But on the last Read request, it responds
>> with a Read First, 14 Read Middle responses, then an
>> ACK with Syndrome 99 (Remote Operation Error).
>> 
>> This suggests the last page in the memory region is
>> not accessible to the HCA.
>> 
>> This does not happen on the first NFS WRITE, but
>> rather one or two subsequent NFS WRITEs during the test.
> 
> On an x86 system that patch changes the alignment of buffers > 8 bytes from 16 bytes to 8 bytes (ARCH_SLAB_MINALIGN / ARCH_KMALLOC_MINALIGN). There might be code in the mlx4 driver that makes incorrect assumptions about the alignment of memory allocated by kmalloc(). Can someone from Mellanox comment on the alignment requirements of the buffers allocated by mlx4_buf_alloc()?
> 
> Thanks,
> 
> Bart.

Let's also bring this to the attention of the patch's author.

Joonsoo, any ideas about how to track this down? There have
been several reports on linux-rdma of unexplained issues when
SLUB debugging is enabled.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* RE: RDMA Read: Local protection error
       [not found]                 ` <6BBFD126-877C-4638-BB91-ABF715E29326-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-05-04  1:07                   ` Joonsoo Kim
  2016-05-04 19:59                     ` Chuck Lever
  2016-05-25 15:58                   ` Chuck Lever
  1 sibling, 1 reply; 34+ messages in thread
From: Joonsoo Kim @ 2016-05-04  1:07 UTC (permalink / raw)
  To: 'Chuck Lever'
  Cc: 'Bart Van Assche', 'Or Gerlitz',
	'linux-rdma',
	Joonsoo Kim



> -----Original Message-----
> From: Chuck Lever [mailto:chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org]
> Sent: Tuesday, May 03, 2016 11:57 PM
> To: Joonsoo Kim
> Cc: Bart Van Assche; Or Gerlitz; linux-rdma
> Subject: Re: RDMA Read: Local protection error
> 
> 
> > On May 2, 2016, at 12:08 PM, Bart Van Assche
> <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
> >
> > On 05/02/2016 08:10 AM, Chuck Lever wrote:
> >>> On Apr 29, 2016, at 12:45 PM, Bart Van Assche
> <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
> >>> On 04/29/2016 09:24 AM, Chuck Lever wrote:
> >>>> I've found some new behavior, recently, while testing the
> >>>> v4.6-rc Linux NFS/RDMA client and server.
> >>>>
> >>>> When certain kernel memory debugging CONFIG options are
> >>>> enabled, 1MB NFS WRITEs can sometimes result in a
> >>>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
> >>>> I want to see any problems, so I'm not sure which option
> >>>> in particular is exposing the issue.
> >>>>
> >>>> When debugging is enabled on the server, and the underlying
> >>>> device is using FRWR to register the sink buffer, an RDMA
> >>>> Read occasionally completes with LOC_PROT_ERR.
> >>>>
> >>>> When debugging is enabled on the client, and the underlying
> >>>> device uses FRWR to register the target of an RDMA Read, an
> >>>> ingress RDMA Read request sometimes gets a Syndrome 99
> >>>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
> >>>> on the client completes with LOC_PROT_ERR.
> >>>>
> >>>> I do not see this problem when kernel memory debugging is
> >>>> disabled, or when the client is using FMR, or when the
> >>>> server is using physical addresses to post its RDMA Read WRs,
> >>>> or when wsize is 512KB or smaller.
> >>>>
> >>>> I have not found any obvious problems with the client logic
> >>>> that registers NFS WRITE buffers, nor the server logic that
> >>>> constructs and posts RDMA Read WRs.
> >>>>
> >>>> My next step is to bisect. But first, I was wondering if
> >>>> this behavior might be related to the recent problems with
> >>>> s/g lists seen with iSER/SRP? ie, is this a recognized
> >>>> issue?
> >>>
> >>> Hello Chuck,
> >>>
> >>> A few days ago I observed similar behavior with the SRP protocol but
> only if I increase max_sect in /etc/srp_daemon.conf from the default to
> 4096. My setup was as follows:
> >>> * Kernel 4.6.0-rc5 at the initiator side.
> >>> * A whole bunch of kernel debugging options enabled at the initiator
> >>>  side.
> >>> * The following settings in /etc/modprobe.d/ib_srp.conf:
> >>>  options ib_srp cmd_sg_entries=255 register_always=1
> >>> * The following settings in /etc/srp_daemon.conf:
> >>>  a queue_size=128,max_cmd_per_lun=128,max_sect=4096
> >>> * Kernel 3.0.101 at the target side.
> >>> * Kernel debugging disabled at the target side.
> >>> * mlx4 driver at both sides.
> >>>
> >>> Decreasing max_sge at the target side from 32 to 16 did not help. I
> have not yet had the time to analyze this further.
> >>
> >> git bisect result:
> >>
> >> d86bd1bece6fc41d59253002db5441fe960a37f6 is the first bad commit
> >> commit d86bd1bece6fc41d59253002db5441fe960a37f6
> >> Author: Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org>
> >> Date:   Tue Mar 15 14:55:12 2016 -0700
> >>
> >>     mm/slub: support left redzone
> >>
> >> I checked out the previous commit and was not able to
> >> reproduce, which gives some confidence that the bisect
> >> result is valid.
> >>
> >> I've also investigated the wire behavior a little more.
> >> The server I'm using for testing has FRWR artificially
> >> disabled, so it uses physical addresses for RDMA Read.
> >> This limits it to max_sge_rd, or 30 pages for each Read
> >> request.
> >>
> >> The client sends a single 1MB Read chunk. The server
> >> emits 8 30-page Read requests, and a ninth request for
> >> the last 16 pages in the chunk.
> >>
> >> The client's HCA responds to the 30-page Read requests
> >> properly. But on the last Read request, it responds
> >> with a Read First, 14 Read Middle responses, then an
> >> ACK with Syndrome 99 (Remote Operation Error).
> >>
> >> This suggests the last page in the memory region is
> >> not accessible to the HCA.
> >>
> >> This does not happen on the first NFS WRITE, but
> >> rather one or two subsequent NFS WRITEs during the test.
> >
> > On an x86 system that patch changes the alignment of buffers > 8 bytes
> from 16 bytes to 8 bytes (ARCH_SLAB_MINALIGN / ARCH_KMALLOC_MINALIGN).
> There might be code in the mlx4 driver that makes incorrect assumptions
> about the alignment of memory allocated by kmalloc(). Can someone from
> Mellanox comment on the alignment requirements of the buffers allocated by
> mlx4_buf_alloc()?
> >
> > Thanks,
> >
> > Bart.
> 
> Let's also bring this to the attention of the patch's author.
> 
> Joonsoo, any ideas about how to track this down? There have
> been several reports on linux-rdma of unexplained issues when
> SLUB debugging is enabled.

(Adding another e-mail address on CC, because I will not be in
The office for a few days.)

Hello,

Hmm... we need to test if root cause is really alignment or not.
Could you test below change? It will make alignment of (kmalloce) buffer
to 16 bytes when debug option is enabled. If it will solve the issue,
someone's alignment assumption is wrong and should be fixed at that site.
If not, patch itself would be cause of the problem. In that case, I will
look at it more.

Thanks.

-------------->8--------------
diff --git a/mm/slub.c b/mm/slub.c
index f41360e..6f9783c 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -3322,9 +3322,10 @@ static int calculate_sizes(struct kmem_cache *s, int
forced_order)
                 */
                size += sizeof(void *);
 
-               s->red_left_pad = sizeof(void *);
+               s->red_left_pad = sizeof(void *) * 2;
                s->red_left_pad = ALIGN(s->red_left_pad, s->align);
                size += s->red_left_pad;
+               size = ALIGN(size, 16);
        }
 #endif




--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
  2016-05-04  1:07                   ` Joonsoo Kim
@ 2016-05-04 19:59                     ` Chuck Lever
       [not found]                       ` <F6C79393-6174-49B3-ADBB-E40627DEE85D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2016-05-04 19:59 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Bart Van Assche, Or Gerlitz, linux-rdma, Joonsoo Kim


> On May 3, 2016, at 9:07 PM, Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org> wrote:
> 
> 
> 
>> -----Original Message-----
>> From: Chuck Lever [mailto:chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org]
>> Sent: Tuesday, May 03, 2016 11:57 PM
>> To: Joonsoo Kim
>> Cc: Bart Van Assche; Or Gerlitz; linux-rdma
>> Subject: Re: RDMA Read: Local protection error
>> 
>> 
>>> On May 2, 2016, at 12:08 PM, Bart Van Assche
>> <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>>> 
>>> On 05/02/2016 08:10 AM, Chuck Lever wrote:
>>>>> On Apr 29, 2016, at 12:45 PM, Bart Van Assche
>> <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>> On 04/29/2016 09:24 AM, Chuck Lever wrote:
>>>>>> I've found some new behavior, recently, while testing the
>>>>>> v4.6-rc Linux NFS/RDMA client and server.
>>>>>> 
>>>>>> When certain kernel memory debugging CONFIG options are
>>>>>> enabled, 1MB NFS WRITEs can sometimes result in a
>>>>>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
>>>>>> I want to see any problems, so I'm not sure which option
>>>>>> in particular is exposing the issue.
>>>>>> 
>>>>>> When debugging is enabled on the server, and the underlying
>>>>>> device is using FRWR to register the sink buffer, an RDMA
>>>>>> Read occasionally completes with LOC_PROT_ERR.
>>>>>> 
>>>>>> When debugging is enabled on the client, and the underlying
>>>>>> device uses FRWR to register the target of an RDMA Read, an
>>>>>> ingress RDMA Read request sometimes gets a Syndrome 99
>>>>>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
>>>>>> on the client completes with LOC_PROT_ERR.
>>>>>> 
>>>>>> I do not see this problem when kernel memory debugging is
>>>>>> disabled, or when the client is using FMR, or when the
>>>>>> server is using physical addresses to post its RDMA Read WRs,
>>>>>> or when wsize is 512KB or smaller.
>>>>>> 
>>>>>> I have not found any obvious problems with the client logic
>>>>>> that registers NFS WRITE buffers, nor the server logic that
>>>>>> constructs and posts RDMA Read WRs.
>>>>>> 
>>>>>> My next step is to bisect. But first, I was wondering if
>>>>>> this behavior might be related to the recent problems with
>>>>>> s/g lists seen with iSER/SRP? ie, is this a recognized
>>>>>> issue?
>>>>> 
>>>>> Hello Chuck,
>>>>> 
>>>>> A few days ago I observed similar behavior with the SRP protocol but
>> only if I increase max_sect in /etc/srp_daemon.conf from the default to
>> 4096. My setup was as follows:
>>>>> * Kernel 4.6.0-rc5 at the initiator side.
>>>>> * A whole bunch of kernel debugging options enabled at the initiator
>>>>> side.
>>>>> * The following settings in /etc/modprobe.d/ib_srp.conf:
>>>>> options ib_srp cmd_sg_entries=255 register_always=1
>>>>> * The following settings in /etc/srp_daemon.conf:
>>>>> a queue_size=128,max_cmd_per_lun=128,max_sect=4096
>>>>> * Kernel 3.0.101 at the target side.
>>>>> * Kernel debugging disabled at the target side.
>>>>> * mlx4 driver at both sides.
>>>>> 
>>>>> Decreasing max_sge at the target side from 32 to 16 did not help. I
>> have not yet had the time to analyze this further.
>>>> 
>>>> git bisect result:
>>>> 
>>>> d86bd1bece6fc41d59253002db5441fe960a37f6 is the first bad commit
>>>> commit d86bd1bece6fc41d59253002db5441fe960a37f6
>>>> Author: Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org>
>>>> Date:   Tue Mar 15 14:55:12 2016 -0700
>>>> 
>>>>    mm/slub: support left redzone
>>>> 
>>>> I checked out the previous commit and was not able to
>>>> reproduce, which gives some confidence that the bisect
>>>> result is valid.
>>>> 
>>>> I've also investigated the wire behavior a little more.
>>>> The server I'm using for testing has FRWR artificially
>>>> disabled, so it uses physical addresses for RDMA Read.
>>>> This limits it to max_sge_rd, or 30 pages for each Read
>>>> request.
>>>> 
>>>> The client sends a single 1MB Read chunk. The server
>>>> emits 8 30-page Read requests, and a ninth request for
>>>> the last 16 pages in the chunk.
>>>> 
>>>> The client's HCA responds to the 30-page Read requests
>>>> properly. But on the last Read request, it responds
>>>> with a Read First, 14 Read Middle responses, then an
>>>> ACK with Syndrome 99 (Remote Operation Error).
>>>> 
>>>> This suggests the last page in the memory region is
>>>> not accessible to the HCA.
>>>> 
>>>> This does not happen on the first NFS WRITE, but
>>>> rather one or two subsequent NFS WRITEs during the test.
>>> 
>>> On an x86 system that patch changes the alignment of buffers > 8 bytes
>> from 16 bytes to 8 bytes (ARCH_SLAB_MINALIGN / ARCH_KMALLOC_MINALIGN).
>> There might be code in the mlx4 driver that makes incorrect assumptions
>> about the alignment of memory allocated by kmalloc(). Can someone from
>> Mellanox comment on the alignment requirements of the buffers allocated by
>> mlx4_buf_alloc()?
>>> 
>>> Thanks,
>>> 
>>> Bart.
>> 
>> Let's also bring this to the attention of the patch's author.
>> 
>> Joonsoo, any ideas about how to track this down? There have
>> been several reports on linux-rdma of unexplained issues when
>> SLUB debugging is enabled.
> 
> (Adding another e-mail address on CC, because I will not be in
> The office for a few days.)
> 
> Hello,
> 
> Hmm... we need to test if root cause is really alignment or not.
> Could you test below change? It will make alignment of (kmalloce) buffer
> to 16 bytes when debug option is enabled. If it will solve the issue,
> someone's alignment assumption is wrong and should be fixed at that site.
> If not, patch itself would be cause of the problem. In that case, I will
> look at it more.
> 
> Thanks.
> 
> -------------->8--------------
> diff --git a/mm/slub.c b/mm/slub.c
> index f41360e..6f9783c 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -3322,9 +3322,10 @@ static int calculate_sizes(struct kmem_cache *s, int
> forced_order)
>                 */
>                size += sizeof(void *);
> 
> -               s->red_left_pad = sizeof(void *);
> +               s->red_left_pad = sizeof(void *) * 2;
>                s->red_left_pad = ALIGN(s->red_left_pad, s->align);
>                size += s->red_left_pad;
> +               size = ALIGN(size, 16);
>        }
> #endif

I applied this patch and enabled SLUB debugging.
I was able to reproduce the "local protection error".


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                       ` <F6C79393-6174-49B3-ADBB-E40627DEE85D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-05-09  1:03                         ` Joonsoo Kim
       [not found]                           ` <CAAmzW4NbY3Og0BgQyeA4LLXTnMuPTjxVUdFbH+HLahBw+MAhsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Joonsoo Kim @ 2016-05-09  1:03 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Joonsoo Kim, Bart Van Assche, Or Gerlitz, linux-rdma

2016-05-05 4:59 GMT+09:00 Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>:
>
>> On May 3, 2016, at 9:07 PM, Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org> wrote:
>>
>>
>>
>>> -----Original Message-----
>>> From: Chuck Lever [mailto:chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org]
>>> Sent: Tuesday, May 03, 2016 11:57 PM
>>> To: Joonsoo Kim
>>> Cc: Bart Van Assche; Or Gerlitz; linux-rdma
>>> Subject: Re: RDMA Read: Local protection error
>>>
>>>
>>>> On May 2, 2016, at 12:08 PM, Bart Van Assche
>>> <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>
>>>> On 05/02/2016 08:10 AM, Chuck Lever wrote:
>>>>>> On Apr 29, 2016, at 12:45 PM, Bart Van Assche
>>> <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>>> On 04/29/2016 09:24 AM, Chuck Lever wrote:
>>>>>>> I've found some new behavior, recently, while testing the
>>>>>>> v4.6-rc Linux NFS/RDMA client and server.
>>>>>>>
>>>>>>> When certain kernel memory debugging CONFIG options are
>>>>>>> enabled, 1MB NFS WRITEs can sometimes result in a
>>>>>>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
>>>>>>> I want to see any problems, so I'm not sure which option
>>>>>>> in particular is exposing the issue.
>>>>>>>
>>>>>>> When debugging is enabled on the server, and the underlying
>>>>>>> device is using FRWR to register the sink buffer, an RDMA
>>>>>>> Read occasionally completes with LOC_PROT_ERR.
>>>>>>>
>>>>>>> When debugging is enabled on the client, and the underlying
>>>>>>> device uses FRWR to register the target of an RDMA Read, an
>>>>>>> ingress RDMA Read request sometimes gets a Syndrome 99
>>>>>>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
>>>>>>> on the client completes with LOC_PROT_ERR.
>>>>>>>
>>>>>>> I do not see this problem when kernel memory debugging is
>>>>>>> disabled, or when the client is using FMR, or when the
>>>>>>> server is using physical addresses to post its RDMA Read WRs,
>>>>>>> or when wsize is 512KB or smaller.
>>>>>>>
>>>>>>> I have not found any obvious problems with the client logic
>>>>>>> that registers NFS WRITE buffers, nor the server logic that
>>>>>>> constructs and posts RDMA Read WRs.
>>>>>>>
>>>>>>> My next step is to bisect. But first, I was wondering if
>>>>>>> this behavior might be related to the recent problems with
>>>>>>> s/g lists seen with iSER/SRP? ie, is this a recognized
>>>>>>> issue?
>>>>>>
>>>>>> Hello Chuck,
>>>>>>
>>>>>> A few days ago I observed similar behavior with the SRP protocol but
>>> only if I increase max_sect in /etc/srp_daemon.conf from the default to
>>> 4096. My setup was as follows:
>>>>>> * Kernel 4.6.0-rc5 at the initiator side.
>>>>>> * A whole bunch of kernel debugging options enabled at the initiator
>>>>>> side.
>>>>>> * The following settings in /etc/modprobe.d/ib_srp.conf:
>>>>>> options ib_srp cmd_sg_entries=255 register_always=1
>>>>>> * The following settings in /etc/srp_daemon.conf:
>>>>>> a queue_size=128,max_cmd_per_lun=128,max_sect=4096
>>>>>> * Kernel 3.0.101 at the target side.
>>>>>> * Kernel debugging disabled at the target side.
>>>>>> * mlx4 driver at both sides.
>>>>>>
>>>>>> Decreasing max_sge at the target side from 32 to 16 did not help. I
>>> have not yet had the time to analyze this further.
>>>>>
>>>>> git bisect result:
>>>>>
>>>>> d86bd1bece6fc41d59253002db5441fe960a37f6 is the first bad commit
>>>>> commit d86bd1bece6fc41d59253002db5441fe960a37f6
>>>>> Author: Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org>
>>>>> Date:   Tue Mar 15 14:55:12 2016 -0700
>>>>>
>>>>>    mm/slub: support left redzone
>>>>>
>>>>> I checked out the previous commit and was not able to
>>>>> reproduce, which gives some confidence that the bisect
>>>>> result is valid.
>>>>>
>>>>> I've also investigated the wire behavior a little more.
>>>>> The server I'm using for testing has FRWR artificially
>>>>> disabled, so it uses physical addresses for RDMA Read.
>>>>> This limits it to max_sge_rd, or 30 pages for each Read
>>>>> request.
>>>>>
>>>>> The client sends a single 1MB Read chunk. The server
>>>>> emits 8 30-page Read requests, and a ninth request for
>>>>> the last 16 pages in the chunk.
>>>>>
>>>>> The client's HCA responds to the 30-page Read requests
>>>>> properly. But on the last Read request, it responds
>>>>> with a Read First, 14 Read Middle responses, then an
>>>>> ACK with Syndrome 99 (Remote Operation Error).
>>>>>
>>>>> This suggests the last page in the memory region is
>>>>> not accessible to the HCA.
>>>>>
>>>>> This does not happen on the first NFS WRITE, but
>>>>> rather one or two subsequent NFS WRITEs during the test.
>>>>
>>>> On an x86 system that patch changes the alignment of buffers > 8 bytes
>>> from 16 bytes to 8 bytes (ARCH_SLAB_MINALIGN / ARCH_KMALLOC_MINALIGN).
>>> There might be code in the mlx4 driver that makes incorrect assumptions
>>> about the alignment of memory allocated by kmalloc(). Can someone from
>>> Mellanox comment on the alignment requirements of the buffers allocated by
>>> mlx4_buf_alloc()?
>>>>
>>>> Thanks,
>>>>
>>>> Bart.
>>>
>>> Let's also bring this to the attention of the patch's author.
>>>
>>> Joonsoo, any ideas about how to track this down? There have
>>> been several reports on linux-rdma of unexplained issues when
>>> SLUB debugging is enabled.
>>
>> (Adding another e-mail address on CC, because I will not be in
>> The office for a few days.)
>>
>> Hello,
>>
>> Hmm... we need to test if root cause is really alignment or not.
>> Could you test below change? It will make alignment of (kmalloce) buffer
>> to 16 bytes when debug option is enabled. If it will solve the issue,
>> someone's alignment assumption is wrong and should be fixed at that site.
>> If not, patch itself would be cause of the problem. In that case, I will
>> look at it more.
>>
>> Thanks.
>>
>> -------------->8--------------
>> diff --git a/mm/slub.c b/mm/slub.c
>> index f41360e..6f9783c 100644
>> --- a/mm/slub.c
>> +++ b/mm/slub.c
>> @@ -3322,9 +3322,10 @@ static int calculate_sizes(struct kmem_cache *s, int
>> forced_order)
>>                 */
>>                size += sizeof(void *);
>>
>> -               s->red_left_pad = sizeof(void *);
>> +               s->red_left_pad = sizeof(void *) * 2;
>>                s->red_left_pad = ALIGN(s->red_left_pad, s->align);
>>                size += s->red_left_pad;
>> +               size = ALIGN(size, 16);
>>        }
>> #endif
>
> I applied this patch and enabled SLUB debugging.
> I was able to reproduce the "local protection error".

I finally found one reporting problem when KASAN find an error
but it would not be related to your problem.

I have no idea why your problem happens now. Do you have
any reproducer of the problem? I'd like to regenerate an error
on my side.

If reproducer isn't available, I'm okay to revert that patch.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                           ` <CAAmzW4NbY3Og0BgQyeA4LLXTnMuPTjxVUdFbH+HLahBw+MAhsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2016-05-09  1:15                             ` Chuck Lever
       [not found]                               ` <1A79DEDE-A5C3-4581-A0AE-7C0AB056B4C7-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2016-05-09  1:15 UTC (permalink / raw)
  To: Joonsoo Kim; +Cc: Joonsoo Kim, Bart Van Assche, Or Gerlitz, linux-rdma


> On May 8, 2016, at 9:03 PM, Joonsoo Kim <js1304-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> 
> 2016-05-05 4:59 GMT+09:00 Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>:
>> 
>>> On May 3, 2016, at 9:07 PM, Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org> wrote:
>>> 
>>> 
>>> 
>>>> -----Original Message-----
>>>> From: Chuck Lever [mailto:chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org]
>>>> Sent: Tuesday, May 03, 2016 11:57 PM
>>>> To: Joonsoo Kim
>>>> Cc: Bart Van Assche; Or Gerlitz; linux-rdma
>>>> Subject: Re: RDMA Read: Local protection error
>>>> 
>>>> 
>>>>> On May 2, 2016, at 12:08 PM, Bart Van Assche
>>>> <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>> 
>>>>> On 05/02/2016 08:10 AM, Chuck Lever wrote:
>>>>>>> On Apr 29, 2016, at 12:45 PM, Bart Van Assche
>>>> <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>>>> On 04/29/2016 09:24 AM, Chuck Lever wrote:
>>>>>>>> I've found some new behavior, recently, while testing the
>>>>>>>> v4.6-rc Linux NFS/RDMA client and server.
>>>>>>>> 
>>>>>>>> When certain kernel memory debugging CONFIG options are
>>>>>>>> enabled, 1MB NFS WRITEs can sometimes result in a
>>>>>>>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
>>>>>>>> I want to see any problems, so I'm not sure which option
>>>>>>>> in particular is exposing the issue.
>>>>>>>> 
>>>>>>>> When debugging is enabled on the server, and the underlying
>>>>>>>> device is using FRWR to register the sink buffer, an RDMA
>>>>>>>> Read occasionally completes with LOC_PROT_ERR.
>>>>>>>> 
>>>>>>>> When debugging is enabled on the client, and the underlying
>>>>>>>> device uses FRWR to register the target of an RDMA Read, an
>>>>>>>> ingress RDMA Read request sometimes gets a Syndrome 99
>>>>>>>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
>>>>>>>> on the client completes with LOC_PROT_ERR.
>>>>>>>> 
>>>>>>>> I do not see this problem when kernel memory debugging is
>>>>>>>> disabled, or when the client is using FMR, or when the
>>>>>>>> server is using physical addresses to post its RDMA Read WRs,
>>>>>>>> or when wsize is 512KB or smaller.
>>>>>>>> 
>>>>>>>> I have not found any obvious problems with the client logic
>>>>>>>> that registers NFS WRITE buffers, nor the server logic that
>>>>>>>> constructs and posts RDMA Read WRs.
>>>>>>>> 
>>>>>>>> My next step is to bisect. But first, I was wondering if
>>>>>>>> this behavior might be related to the recent problems with
>>>>>>>> s/g lists seen with iSER/SRP? ie, is this a recognized
>>>>>>>> issue?
>>>>>>> 
>>>>>>> Hello Chuck,
>>>>>>> 
>>>>>>> A few days ago I observed similar behavior with the SRP protocol but
>>>> only if I increase max_sect in /etc/srp_daemon.conf from the default to
>>>> 4096. My setup was as follows:
>>>>>>> * Kernel 4.6.0-rc5 at the initiator side.
>>>>>>> * A whole bunch of kernel debugging options enabled at the initiator
>>>>>>> side.
>>>>>>> * The following settings in /etc/modprobe.d/ib_srp.conf:
>>>>>>> options ib_srp cmd_sg_entries=255 register_always=1
>>>>>>> * The following settings in /etc/srp_daemon.conf:
>>>>>>> a queue_size=128,max_cmd_per_lun=128,max_sect=4096
>>>>>>> * Kernel 3.0.101 at the target side.
>>>>>>> * Kernel debugging disabled at the target side.
>>>>>>> * mlx4 driver at both sides.
>>>>>>> 
>>>>>>> Decreasing max_sge at the target side from 32 to 16 did not help. I
>>>> have not yet had the time to analyze this further.
>>>>>> 
>>>>>> git bisect result:
>>>>>> 
>>>>>> d86bd1bece6fc41d59253002db5441fe960a37f6 is the first bad commit
>>>>>> commit d86bd1bece6fc41d59253002db5441fe960a37f6
>>>>>> Author: Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org>
>>>>>> Date:   Tue Mar 15 14:55:12 2016 -0700
>>>>>> 
>>>>>>   mm/slub: support left redzone
>>>>>> 
>>>>>> I checked out the previous commit and was not able to
>>>>>> reproduce, which gives some confidence that the bisect
>>>>>> result is valid.
>>>>>> 
>>>>>> I've also investigated the wire behavior a little more.
>>>>>> The server I'm using for testing has FRWR artificially
>>>>>> disabled, so it uses physical addresses for RDMA Read.
>>>>>> This limits it to max_sge_rd, or 30 pages for each Read
>>>>>> request.
>>>>>> 
>>>>>> The client sends a single 1MB Read chunk. The server
>>>>>> emits 8 30-page Read requests, and a ninth request for
>>>>>> the last 16 pages in the chunk.
>>>>>> 
>>>>>> The client's HCA responds to the 30-page Read requests
>>>>>> properly. But on the last Read request, it responds
>>>>>> with a Read First, 14 Read Middle responses, then an
>>>>>> ACK with Syndrome 99 (Remote Operation Error).
>>>>>> 
>>>>>> This suggests the last page in the memory region is
>>>>>> not accessible to the HCA.
>>>>>> 
>>>>>> This does not happen on the first NFS WRITE, but
>>>>>> rather one or two subsequent NFS WRITEs during the test.
>>>>> 
>>>>> On an x86 system that patch changes the alignment of buffers > 8 bytes
>>>> from 16 bytes to 8 bytes (ARCH_SLAB_MINALIGN / ARCH_KMALLOC_MINALIGN).
>>>> There might be code in the mlx4 driver that makes incorrect assumptions
>>>> about the alignment of memory allocated by kmalloc(). Can someone from
>>>> Mellanox comment on the alignment requirements of the buffers allocated by
>>>> mlx4_buf_alloc()?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Bart.
>>>> 
>>>> Let's also bring this to the attention of the patch's author.
>>>> 
>>>> Joonsoo, any ideas about how to track this down? There have
>>>> been several reports on linux-rdma of unexplained issues when
>>>> SLUB debugging is enabled.
>>> 
>>> (Adding another e-mail address on CC, because I will not be in
>>> The office for a few days.)
>>> 
>>> Hello,
>>> 
>>> Hmm... we need to test if root cause is really alignment or not.
>>> Could you test below change? It will make alignment of (kmalloce) buffer
>>> to 16 bytes when debug option is enabled. If it will solve the issue,
>>> someone's alignment assumption is wrong and should be fixed at that site.
>>> If not, patch itself would be cause of the problem. In that case, I will
>>> look at it more.
>>> 
>>> Thanks.
>>> 
>>> -------------->8--------------
>>> diff --git a/mm/slub.c b/mm/slub.c
>>> index f41360e..6f9783c 100644
>>> --- a/mm/slub.c
>>> +++ b/mm/slub.c
>>> @@ -3322,9 +3322,10 @@ static int calculate_sizes(struct kmem_cache *s, int
>>> forced_order)
>>>                */
>>>               size += sizeof(void *);
>>> 
>>> -               s->red_left_pad = sizeof(void *);
>>> +               s->red_left_pad = sizeof(void *) * 2;
>>>               s->red_left_pad = ALIGN(s->red_left_pad, s->align);
>>>               size += s->red_left_pad;
>>> +               size = ALIGN(size, 16);
>>>       }
>>> #endif
>> 
>> I applied this patch and enabled SLUB debugging.
>> I was able to reproduce the "local protection error".
> 
> I finally found one reporting problem when KASAN find an error
> but it would not be related to your problem.
> 
> I have no idea why your problem happens now. Do you have
> any reproducer of the problem? I'd like to regenerate an error
> on my side.
> 
> If reproducer isn't available, I'm okay to revert that patch.

I have a reproducer, but it requires an NFS/RDMA set up.
I know it's less optimal, but if you can give me some
direction maybe the problem can be narrowed further.


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                               ` <1A79DEDE-A5C3-4581-A0AE-7C0AB056B4C7-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-05-09  2:11                                 ` Joonsoo Kim
  0 siblings, 0 replies; 34+ messages in thread
From: Joonsoo Kim @ 2016-05-09  2:11 UTC (permalink / raw)
  To: Chuck Lever; +Cc: Joonsoo Kim, Bart Van Assche, Or Gerlitz, linux-rdma

2016-05-09 10:15 GMT+09:00 Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>:
>
>> On May 8, 2016, at 9:03 PM, Joonsoo Kim <js1304-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>>
>> 2016-05-05 4:59 GMT+09:00 Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>:
>>>
>>>> On May 3, 2016, at 9:07 PM, Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org> wrote:
>>>>
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Chuck Lever [mailto:chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org]
>>>>> Sent: Tuesday, May 03, 2016 11:57 PM
>>>>> To: Joonsoo Kim
>>>>> Cc: Bart Van Assche; Or Gerlitz; linux-rdma
>>>>> Subject: Re: RDMA Read: Local protection error
>>>>>
>>>>>
>>>>>> On May 2, 2016, at 12:08 PM, Bart Van Assche
>>>>> <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>>>
>>>>>> On 05/02/2016 08:10 AM, Chuck Lever wrote:
>>>>>>>> On Apr 29, 2016, at 12:45 PM, Bart Van Assche
>>>>> <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>>>>> On 04/29/2016 09:24 AM, Chuck Lever wrote:
>>>>>>>>> I've found some new behavior, recently, while testing the
>>>>>>>>> v4.6-rc Linux NFS/RDMA client and server.
>>>>>>>>>
>>>>>>>>> When certain kernel memory debugging CONFIG options are
>>>>>>>>> enabled, 1MB NFS WRITEs can sometimes result in a
>>>>>>>>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
>>>>>>>>> I want to see any problems, so I'm not sure which option
>>>>>>>>> in particular is exposing the issue.
>>>>>>>>>
>>>>>>>>> When debugging is enabled on the server, and the underlying
>>>>>>>>> device is using FRWR to register the sink buffer, an RDMA
>>>>>>>>> Read occasionally completes with LOC_PROT_ERR.
>>>>>>>>>
>>>>>>>>> When debugging is enabled on the client, and the underlying
>>>>>>>>> device uses FRWR to register the target of an RDMA Read, an
>>>>>>>>> ingress RDMA Read request sometimes gets a Syndrome 99
>>>>>>>>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
>>>>>>>>> on the client completes with LOC_PROT_ERR.
>>>>>>>>>
>>>>>>>>> I do not see this problem when kernel memory debugging is
>>>>>>>>> disabled, or when the client is using FMR, or when the
>>>>>>>>> server is using physical addresses to post its RDMA Read WRs,
>>>>>>>>> or when wsize is 512KB or smaller.
>>>>>>>>>
>>>>>>>>> I have not found any obvious problems with the client logic
>>>>>>>>> that registers NFS WRITE buffers, nor the server logic that
>>>>>>>>> constructs and posts RDMA Read WRs.
>>>>>>>>>
>>>>>>>>> My next step is to bisect. But first, I was wondering if
>>>>>>>>> this behavior might be related to the recent problems with
>>>>>>>>> s/g lists seen with iSER/SRP? ie, is this a recognized
>>>>>>>>> issue?
>>>>>>>>
>>>>>>>> Hello Chuck,
>>>>>>>>
>>>>>>>> A few days ago I observed similar behavior with the SRP protocol but
>>>>> only if I increase max_sect in /etc/srp_daemon.conf from the default to
>>>>> 4096. My setup was as follows:
>>>>>>>> * Kernel 4.6.0-rc5 at the initiator side.
>>>>>>>> * A whole bunch of kernel debugging options enabled at the initiator
>>>>>>>> side.
>>>>>>>> * The following settings in /etc/modprobe.d/ib_srp.conf:
>>>>>>>> options ib_srp cmd_sg_entries=255 register_always=1
>>>>>>>> * The following settings in /etc/srp_daemon.conf:
>>>>>>>> a queue_size=128,max_cmd_per_lun=128,max_sect=4096
>>>>>>>> * Kernel 3.0.101 at the target side.
>>>>>>>> * Kernel debugging disabled at the target side.
>>>>>>>> * mlx4 driver at both sides.
>>>>>>>>
>>>>>>>> Decreasing max_sge at the target side from 32 to 16 did not help. I
>>>>> have not yet had the time to analyze this further.
>>>>>>>
>>>>>>> git bisect result:
>>>>>>>
>>>>>>> d86bd1bece6fc41d59253002db5441fe960a37f6 is the first bad commit
>>>>>>> commit d86bd1bece6fc41d59253002db5441fe960a37f6
>>>>>>> Author: Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org>
>>>>>>> Date:   Tue Mar 15 14:55:12 2016 -0700
>>>>>>>
>>>>>>>   mm/slub: support left redzone
>>>>>>>
>>>>>>> I checked out the previous commit and was not able to
>>>>>>> reproduce, which gives some confidence that the bisect
>>>>>>> result is valid.
>>>>>>>
>>>>>>> I've also investigated the wire behavior a little more.
>>>>>>> The server I'm using for testing has FRWR artificially
>>>>>>> disabled, so it uses physical addresses for RDMA Read.
>>>>>>> This limits it to max_sge_rd, or 30 pages for each Read
>>>>>>> request.
>>>>>>>
>>>>>>> The client sends a single 1MB Read chunk. The server
>>>>>>> emits 8 30-page Read requests, and a ninth request for
>>>>>>> the last 16 pages in the chunk.
>>>>>>>
>>>>>>> The client's HCA responds to the 30-page Read requests
>>>>>>> properly. But on the last Read request, it responds
>>>>>>> with a Read First, 14 Read Middle responses, then an
>>>>>>> ACK with Syndrome 99 (Remote Operation Error).
>>>>>>>
>>>>>>> This suggests the last page in the memory region is
>>>>>>> not accessible to the HCA.
>>>>>>>
>>>>>>> This does not happen on the first NFS WRITE, but
>>>>>>> rather one or two subsequent NFS WRITEs during the test.
>>>>>>
>>>>>> On an x86 system that patch changes the alignment of buffers > 8 bytes
>>>>> from 16 bytes to 8 bytes (ARCH_SLAB_MINALIGN / ARCH_KMALLOC_MINALIGN).
>>>>> There might be code in the mlx4 driver that makes incorrect assumptions
>>>>> about the alignment of memory allocated by kmalloc(). Can someone from
>>>>> Mellanox comment on the alignment requirements of the buffers allocated by
>>>>> mlx4_buf_alloc()?
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Bart.
>>>>>
>>>>> Let's also bring this to the attention of the patch's author.
>>>>>
>>>>> Joonsoo, any ideas about how to track this down? There have
>>>>> been several reports on linux-rdma of unexplained issues when
>>>>> SLUB debugging is enabled.
>>>>
>>>> (Adding another e-mail address on CC, because I will not be in
>>>> The office for a few days.)
>>>>
>>>> Hello,
>>>>
>>>> Hmm... we need to test if root cause is really alignment or not.
>>>> Could you test below change? It will make alignment of (kmalloce) buffer
>>>> to 16 bytes when debug option is enabled. If it will solve the issue,
>>>> someone's alignment assumption is wrong and should be fixed at that site.
>>>> If not, patch itself would be cause of the problem. In that case, I will
>>>> look at it more.
>>>>
>>>> Thanks.
>>>>
>>>> -------------->8--------------
>>>> diff --git a/mm/slub.c b/mm/slub.c
>>>> index f41360e..6f9783c 100644
>>>> --- a/mm/slub.c
>>>> +++ b/mm/slub.c
>>>> @@ -3322,9 +3322,10 @@ static int calculate_sizes(struct kmem_cache *s, int
>>>> forced_order)
>>>>                */
>>>>               size += sizeof(void *);
>>>>
>>>> -               s->red_left_pad = sizeof(void *);
>>>> +               s->red_left_pad = sizeof(void *) * 2;
>>>>               s->red_left_pad = ALIGN(s->red_left_pad, s->align);
>>>>               size += s->red_left_pad;
>>>> +               size = ALIGN(size, 16);
>>>>       }
>>>> #endif
>>>
>>> I applied this patch and enabled SLUB debugging.
>>> I was able to reproduce the "local protection error".
>>
>> I finally found one reporting problem when KASAN find an error
>> but it would not be related to your problem.
>>
>> I have no idea why your problem happens now. Do you have
>> any reproducer of the problem? I'd like to regenerate an error
>> on my side.
>>
>> If reproducer isn't available, I'm okay to revert that patch.
>
> I have a reproducer, but it requires an NFS/RDMA set up.
> I know it's less optimal, but if you can give me some
> direction maybe the problem can be narrowed further.

Okay! Let's try it. Thanks for your help in advance.

First, I'd like to check whether the cause of the problem is object
layout or not.
Please apply below patch and run reproducer with "slub_debug=z". And, then,
with "slub_debug=zx". Please let me know the result about local protection error
and "dmesg | grep KMEM_CACHE".

If problem doesn't happen with "slub_debug=zx", please test with
"slub_debug=zxf".

And, please let me know previous kernel (with reverting my patch)'s
kmem_cache information. You can use following printk.

printk("KMEM_CACHE: %20.20s 0x%8lx %8d %8d %8d %8d %8d %8d\n",
s->name, s->flags, s->size, s->object_size, s->offset, s->inuse,
s->align, s->reserved);


Thanks.


----->8-----------
diff --git a/mm/slub.c b/mm/slub.c
index f41360e..98988d6 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -39,6 +39,8 @@

 #include "internal.h"

+#define SLAB_NO_RED_LEFT 0x10000000UL
+
 /*
  * Lock order:
  *   1. slab_mutex (Global Mutex)
@@ -1230,6 +1232,9 @@ static int __init setup_slub_debug(char *str)
                case 'a':
                        slub_debug |= SLAB_FAILSLAB;
                        break;
+               case 'x':
+                       slub_debug |= SLAB_NO_RED_LEFT;
+                       break;
                case 'o':
                        /*
                         * Avoid enabling debugging on caches if its minimum
@@ -3320,11 +3325,11 @@ static int calculate_sizes(struct kmem_cache
*s, int forced_order)
                 * corrupted if a user writes before the start
                 * of the object.
                 */
-               size += sizeof(void *);
-
-               s->red_left_pad = sizeof(void *);
-               s->red_left_pad = ALIGN(s->red_left_pad, s->align);
-               size += s->red_left_pad;
+               size += ALIGN(sizeof(void *), s->align);
+               if (flags & SLAB_NO_RED_LEFT)
+                       s->red_left_pad = 0;
+               else
+                       s->red_left_pad = ALIGN(sizeof(void *), s->align);
        }
 #endif

@@ -4001,6 +4006,8 @@ int __kmem_cache_create(struct kmem_cache *s,
unsigned long flags)
        if (err)
                return err;

+       printk("KMEM_CACHE: %20.20s 0x%8lx %8d %8d %8d %8d %8d %8d\n",
s->name, s->flags, s->size, s->object_size, s->offset, s->inuse,
s->align, s->reserved);
+
        /* Mutex is not taken during early boot */
        if (slab_state <= UP)
                return 0;
-- 
1.9.1
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                 ` <6BBFD126-877C-4638-BB91-ABF715E29326-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2016-05-04  1:07                   ` Joonsoo Kim
@ 2016-05-25 15:58                   ` Chuck Lever
       [not found]                     ` <1AFD636B-09FC-4736-B1C5-D1D9FA0B97B0-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  1 sibling, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2016-05-25 15:58 UTC (permalink / raw)
  To: Yishai Hadas; +Cc: linux-rdma, Bart Van Assche, Or Gerlitz, Joonsoo Kim

Hello Yishai-

Reporting an mlx4 IB driver bug below. Sorry for the
length.


> On May 3, 2016, at 10:57 AM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> 
> 
>> On May 2, 2016, at 12:08 PM, Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>> 
>> On 05/02/2016 08:10 AM, Chuck Lever wrote:
>>>> On Apr 29, 2016, at 12:45 PM, Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>>>> On 04/29/2016 09:24 AM, Chuck Lever wrote:
>>>>> I've found some new behavior, recently, while testing the
>>>>> v4.6-rc Linux NFS/RDMA client and server.
>>>>> 
>>>>> When certain kernel memory debugging CONFIG options are
>>>>> enabled, 1MB NFS WRITEs can sometimes result in a
>>>>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
>>>>> I want to see any problems, so I'm not sure which option
>>>>> in particular is exposing the issue.
>>>>> 
>>>>> When debugging is enabled on the server, and the underlying
>>>>> device is using FRWR to register the sink buffer, an RDMA
>>>>> Read occasionally completes with LOC_PROT_ERR.
>>>>> 
>>>>> When debugging is enabled on the client, and the underlying
>>>>> device uses FRWR to register the target of an RDMA Read, an
>>>>> ingress RDMA Read request sometimes gets a Syndrome 99
>>>>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
>>>>> on the client completes with LOC_PROT_ERR.
>>>>> 
>>>>> I do not see this problem when kernel memory debugging is
>>>>> disabled, or when the client is using FMR, or when the
>>>>> server is using physical addresses to post its RDMA Read WRs,
>>>>> or when wsize is 512KB or smaller.
>>>>> 
>>>>> I have not found any obvious problems with the client logic
>>>>> that registers NFS WRITE buffers, nor the server logic that
>>>>> constructs and posts RDMA Read WRs.
>>>>> 
>>>>> My next step is to bisect. But first, I was wondering if
>>>>> this behavior might be related to the recent problems with
>>>>> s/g lists seen with iSER/SRP? ie, is this a recognized
>>>>> issue?
>>>> 
>>>> Hello Chuck,
>>>> 
>>>> A few days ago I observed similar behavior with the SRP protocol but only if I increase max_sect in /etc/srp_daemon.conf from the default to 4096. My setup was as follows:
>>>> * Kernel 4.6.0-rc5 at the initiator side.
>>>> * A whole bunch of kernel debugging options enabled at the initiator
>>>> side.
>>>> * The following settings in /etc/modprobe.d/ib_srp.conf:
>>>> options ib_srp cmd_sg_entries=255 register_always=1
>>>> * The following settings in /etc/srp_daemon.conf:
>>>> a queue_size=128,max_cmd_per_lun=128,max_sect=4096
>>>> * Kernel 3.0.101 at the target side.
>>>> * Kernel debugging disabled at the target side.
>>>> * mlx4 driver at both sides.
>>>> 
>>>> Decreasing max_sge at the target side from 32 to 16 did not help. I have not yet had the time to analyze this further.
>>> 
>>> git bisect result:
>>> 
>>> d86bd1bece6fc41d59253002db5441fe960a37f6 is the first bad commit
>>> commit d86bd1bece6fc41d59253002db5441fe960a37f6
>>> Author: Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org>
>>> Date:   Tue Mar 15 14:55:12 2016 -0700
>>> 
>>>    mm/slub: support left redzone
>>> 
>>> I checked out the previous commit and was not able to
>>> reproduce, which gives some confidence that the bisect
>>> result is valid.
>>> 
>>> I've also investigated the wire behavior a little more.
>>> The server I'm using for testing has FRWR artificially
>>> disabled, so it uses physical addresses for RDMA Read.
>>> This limits it to max_sge_rd, or 30 pages for each Read
>>> request.
>>> 
>>> The client sends a single 1MB Read chunk. The server
>>> emits 8 30-page Read requests, and a ninth request for
>>> the last 16 pages in the chunk.
>>> 
>>> The client's HCA responds to the 30-page Read requests
>>> properly. But on the last Read request, it responds
>>> with a Read First, 14 Read Middle responses, then an
>>> ACK with Syndrome 99 (Remote Operation Error).
>>> 
>>> This suggests the last page in the memory region is
>>> not accessible to the HCA.
>>> 
>>> This does not happen on the first NFS WRITE, but
>>> rather one or two subsequent NFS WRITEs during the test.
>> 
>> On an x86 system that patch changes the alignment of buffers > 8 bytes from 16 bytes to 8 bytes (ARCH_SLAB_MINALIGN / ARCH_KMALLOC_MINALIGN). There might be code in the mlx4 driver that makes incorrect assumptions about the alignment of memory allocated by kmalloc(). Can someone from Mellanox comment on the alignment requirements of the buffers allocated by mlx4_buf_alloc()?
>> 
>> Thanks,
>> 
>> Bart.
> 
> Let's also bring this to the attention of the patch's author.
> 
> Joonsoo, any ideas about how to track this down? There have
> been several reports on linux-rdma of unexplained issues when
> SLUB debugging is enabled.

Joonsoo and I tracked this down.

The original problem report was Read and Receive WRs
completing with Local Protection Error when SLUB
debugging was enabled.

We found that the problem occurred only when debugging
was enabled for the kmalloc-4096 slab.

A kmalloc tracepoint log shows one likely mlx4 call
site that uses the kmalloc-4096 slab with NFS.

kworker/u25:0-10565 [005]  5300.132063: kmalloc:              (mlx4_ib_alloc_mr+0xb8) [FAILED TO PARSE] call_site=0xffffffffa0294048 ptr=0xffff88043d808008     bytes_req=2112 bytes_alloc=4432 gfp_flags=37781696

So let's look at mlx4_ib_alloc_mr().

The call to kzalloc() at the top of this function is for
size 136, so that's not the one in this trace log entry.

However, later in mlx4_alloc_priv_pages(), there is
a kzalloc of the right size. NFS will call ib_alloc_mr
with just over 256 sg's, which gives us a 2112 byte
allocation request.

I added some pr_err() calls in this function to have
a look at the addresses returned by kmalloc.

When debugging is disabled, kzalloc returns page-aligned
addresses:

<mlx4_ib> mlx4_alloc_priv_pages: size + add_size = 2112, pages_alloc=ffff88046a692000
<mlx4_ib> mlx4_alloc_priv_pages: size = 2056, pages=ffff88046a692000
<mlx4_ib> mlx4_alloc_priv_pages: size + add_size = 2112, pages_alloc=ffff88046a693000
<mlx4_ib> mlx4_alloc_priv_pages: size = 2056, pages=ffff88046a693000

When debugging is enabled, the addresses are not aligned:

<mlx4_ib> mlx4_alloc_priv_pages: size + add_size = 2112, pages_alloc=ffff88044f141158
<mlx4_ib> mlx4_alloc_priv_pages: size = 2056, pages=ffff88044f141180
<mlx4_ib> mlx4_alloc_priv_pages: size + add_size = 2112, pages_alloc=ffff88044f145698
<mlx4_ib> mlx4_alloc_priv_pages: size = 2056, pages=ffff88044f1456c0

Now and then we see one like this (from a later run):

<mlx4_ib> mlx4_alloc_priv_pages: size + add_size = 2312, pages_alloc=ffff880462f2e7e8
<mlx4_ib> mlx4_alloc_priv_pages: size = 2064, pages=ffff880462f2e800

(Size 2312 is because I changed the code slightly for
this later run).

Notice that the address is aligned to the half-page.
But the array we're trying to fit into this allocation
is 2064 bytes. That means the memory allocation, and
thus the array, crosses a page boundary.

See what mlx4_alloc_priv_pages does with this memory
allocation:

  mr->page_map = dma_map_single(device->dma_device, mr->pages,
                                size, DMA_TO_DEVICE);

dma_map_single() expects the mr->pages allocation to fit
on a single page, as far as I can tell.

The requested allocation size here is 2312 bytes. kmalloc
returns ffff880462f2e7e8, which is 2072 bytes on one
page, and 240 bytes on another.

So it seems like the bug is that mlx4_alloc_priv_pages
assumes that a "small" kmalloc allocation will never
hit a page boundary.

That assumption is understandable: mlx4 allows up to 511
sge's; the array size would be just under 4096 bytes
for that number of sge's. The function is careful to
deal with the alignment of the array start. But it
is not careful about where the array ends.

NB: mlx5_alloc_priv_descs() may have the same issue?

I partially tested this theory by having
mlx4_alloc_priv_pages request 8192 bytes for that
allocation. That avoids the kmalloc-4096 slab, which
still has debugging enabled.

kmalloc returns a page-aligned piece of memory, the
array fits within a single page, and the Local
Protection Error does not occur.

Joonsoo also tested the theory this way:

a) object size for kmalloc-4096 is 4424

SLUB debugging adds 39 * sizeof(u64) to real allocation
size

mr->pages_alloc object's start, end offset, page # (starting on 0)
size: 4424
0 2112 no-cross
4424 6536 no-cross
8848 10960 no-cross
13272 15384 no-cross
17696 19808 no-cross
22120 24232 no-cross
26544 28656 no-cross

No cross into another page, no Local Protection Error

b) object size for kmalloc-4096 is 4432

SLUB debugging adds 40 * sizeof(u64) to real allocation
size

size: 4432
0 2112 no-cross
4432 6544 no-cross
8864 10976 no-cross
13296 15408 no-cross
17728 19840 no-cross
22160 24272 no-cross
26592 28704 cross

Page boundary cross, Local Protection Error occurs


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                     ` <1AFD636B-09FC-4736-B1C5-D1D9FA0B97B0-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-05-26 16:24                       ` Yishai Hadas
       [not found]                         ` <8a3276bf-f716-3dca-9d54-369fc3bdcc39-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Yishai Hadas @ 2016-05-26 16:24 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Yishai Hadas, linux-rdma, Bart Van Assche, Or Gerlitz,
	Joonsoo Kim, Haggai Eran, Majd Dibbiny

On 5/25/2016 6:58 PM, Chuck Lever wrote:
> Hello Yishai-
>
> Reporting an mlx4 IB driver bug below. Sorry for the
> length.
>
>
>> On May 3, 2016, at 10:57 AM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
>>
>>
>>> On May 2, 2016, at 12:08 PM, Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>>>
>>> On 05/02/2016 08:10 AM, Chuck Lever wrote:
>>>>> On Apr 29, 2016, at 12:45 PM, Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
>>>>> On 04/29/2016 09:24 AM, Chuck Lever wrote:
>>>>>> I've found some new behavior, recently, while testing the
>>>>>> v4.6-rc Linux NFS/RDMA client and server.
>>>>>>
>>>>>> When certain kernel memory debugging CONFIG options are
>>>>>> enabled, 1MB NFS WRITEs can sometimes result in a
>>>>>> IB_WC_LOC_PROT_ERR. I usually turn on most of them because
>>>>>> I want to see any problems, so I'm not sure which option
>>>>>> in particular is exposing the issue.
>>>>>>
>>>>>> When debugging is enabled on the server, and the underlying
>>>>>> device is using FRWR to register the sink buffer, an RDMA
>>>>>> Read occasionally completes with LOC_PROT_ERR.
>>>>>>
>>>>>> When debugging is enabled on the client, and the underlying
>>>>>> device uses FRWR to register the target of an RDMA Read, an
>>>>>> ingress RDMA Read request sometimes gets a Syndrome 99
>>>>>> (REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
>>>>>> on the client completes with LOC_PROT_ERR.
>>>>>>
>>>>>> I do not see this problem when kernel memory debugging is
>>>>>> disabled, or when the client is using FMR, or when the
>>>>>> server is using physical addresses to post its RDMA Read WRs,
>>>>>> or when wsize is 512KB or smaller.
>>>>>>
>>>>>> I have not found any obvious problems with the client logic
>>>>>> that registers NFS WRITE buffers, nor the server logic that
>>>>>> constructs and posts RDMA Read WRs.
>>>>>>
>>>>>> My next step is to bisect. But first, I was wondering if
>>>>>> this behavior might be related to the recent problems with
>>>>>> s/g lists seen with iSER/SRP? ie, is this a recognized
>>>>>> issue?
>>>>>
>>>>> Hello Chuck,
>>>>>
>>>>> A few days ago I observed similar behavior with the SRP protocol but only if I increase max_sect in /etc/srp_daemon.conf from the default to 4096. My setup was as follows:
>>>>> * Kernel 4.6.0-rc5 at the initiator side.
>>>>> * A whole bunch of kernel debugging options enabled at the initiator
>>>>> side.
>>>>> * The following settings in /etc/modprobe.d/ib_srp.conf:
>>>>> options ib_srp cmd_sg_entries=255 register_always=1
>>>>> * The following settings in /etc/srp_daemon.conf:
>>>>> a queue_size=128,max_cmd_per_lun=128,max_sect=4096
>>>>> * Kernel 3.0.101 at the target side.
>>>>> * Kernel debugging disabled at the target side.
>>>>> * mlx4 driver at both sides.
>>>>>
>>>>> Decreasing max_sge at the target side from 32 to 16 did not help. I have not yet had the time to analyze this further.
>>>>
>>>> git bisect result:
>>>>
>>>> d86bd1bece6fc41d59253002db5441fe960a37f6 is the first bad commit
>>>> commit d86bd1bece6fc41d59253002db5441fe960a37f6
>>>> Author: Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org>
>>>> Date:   Tue Mar 15 14:55:12 2016 -0700
>>>>
>>>>    mm/slub: support left redzone
>>>>
>>>> I checked out the previous commit and was not able to
>>>> reproduce, which gives some confidence that the bisect
>>>> result is valid.
>>>>
>>>> I've also investigated the wire behavior a little more.
>>>> The server I'm using for testing has FRWR artificially
>>>> disabled, so it uses physical addresses for RDMA Read.
>>>> This limits it to max_sge_rd, or 30 pages for each Read
>>>> request.
>>>>
>>>> The client sends a single 1MB Read chunk. The server
>>>> emits 8 30-page Read requests, and a ninth request for
>>>> the last 16 pages in the chunk.
>>>>
>>>> The client's HCA responds to the 30-page Read requests
>>>> properly. But on the last Read request, it responds
>>>> with a Read First, 14 Read Middle responses, then an
>>>> ACK with Syndrome 99 (Remote Operation Error).
>>>>
>>>> This suggests the last page in the memory region is
>>>> not accessible to the HCA.
>>>>
>>>> This does not happen on the first NFS WRITE, but
>>>> rather one or two subsequent NFS WRITEs during the test.
>>>
>>> On an x86 system that patch changes the alignment of buffers > 8 bytes from 16 bytes to 8 bytes (ARCH_SLAB_MINALIGN / ARCH_KMALLOC_MINALIGN). There might be code in the mlx4 driver that makes incorrect assumptions about the alignment of memory allocated by kmalloc(). Can someone from Mellanox comment on the alignment requirements of the buffers allocated by mlx4_buf_alloc()?
>>>
>>> Thanks,
>>>
>>> Bart.
>>
>> Let's also bring this to the attention of the patch's author.
>>
>> Joonsoo, any ideas about how to track this down? There have
>> been several reports on linux-rdma of unexplained issues when
>> SLUB debugging is enabled.
>
> Joonsoo and I tracked this down.
>
> The original problem report was Read and Receive WRs
> completing with Local Protection Error when SLUB
> debugging was enabled.
>
> We found that the problem occurred only when debugging
> was enabled for the kmalloc-4096 slab.
>
> A kmalloc tracepoint log shows one likely mlx4 call
> site that uses the kmalloc-4096 slab with NFS.
>
> kworker/u25:0-10565 [005]  5300.132063: kmalloc:              (mlx4_ib_alloc_mr+0xb8) [FAILED TO PARSE] call_site=0xffffffffa0294048 ptr=0xffff88043d808008     bytes_req=2112 bytes_alloc=4432 gfp_flags=37781696
>
> So let's look at mlx4_ib_alloc_mr().
>
> The call to kzalloc() at the top of this function is for
> size 136, so that's not the one in this trace log entry.
>
> However, later in mlx4_alloc_priv_pages(), there is
> a kzalloc of the right size. NFS will call ib_alloc_mr
> with just over 256 sg's, which gives us a 2112 byte
> allocation request.
>
> I added some pr_err() calls in this function to have
> a look at the addresses returned by kmalloc.
>
> When debugging is disabled, kzalloc returns page-aligned
> addresses:

Is it defined some where that regular kzalloc/kmalloc guaranties to 
return a page-aligned address as you see in your testing ? if so the 
debug mode should behave the same. Otherwise we can consider using any 
flag allocation that can force that if such exists.
Let's get other people's input here.

> <mlx4_ib> mlx4_alloc_priv_pages: size + add_size = 2112, pages_alloc=ffff88046a692000
> <mlx4_ib> mlx4_alloc_priv_pages: size = 2056, pages=ffff88046a692000
> <mlx4_ib> mlx4_alloc_priv_pages: size + add_size = 2112, pages_alloc=ffff88046a693000
> <mlx4_ib> mlx4_alloc_priv_pages: size = 2056, pages=ffff88046a693000
>
> When debugging is enabled, the addresses are not aligned:
>
> <mlx4_ib> mlx4_alloc_priv_pages: size + add_size = 2112, pages_alloc=ffff88044f141158
> <mlx4_ib> mlx4_alloc_priv_pages: size = 2056, pages=ffff88044f141180
> <mlx4_ib> mlx4_alloc_priv_pages: size + add_size = 2112, pages_alloc=ffff88044f145698
> <mlx4_ib> mlx4_alloc_priv_pages: size = 2056, pages=ffff88044f1456c0
>
> Now and then we see one like this (from a later run):
>
> <mlx4_ib> mlx4_alloc_priv_pages: size + add_size = 2312, pages_alloc=ffff880462f2e7e8
> <mlx4_ib> mlx4_alloc_priv_pages: size = 2064, pages=ffff880462f2e800
>
> (Size 2312 is because I changed the code slightly for
> this later run).
>
> Notice that the address is aligned to the half-page.
> But the array we're trying to fit into this allocation
> is 2064 bytes. That means the memory allocation, and
> thus the array, crosses a page boundary.
>
> See what mlx4_alloc_priv_pages does with this memory
> allocation:
>
>   mr->page_map = dma_map_single(device->dma_device, mr->pages,
>                                 size, DMA_TO_DEVICE);
>
> dma_map_single() expects the mr->pages allocation to fit
> on a single page, as far as I can tell.

Couldn't we expect that in that case the underlying call will fail ? it 
gets both address and size.

> The requested allocation size here is 2312 bytes. kmalloc
> returns ffff880462f2e7e8, which is 2072 bytes on one
> page, and 240 bytes on another.
>
> So it seems like the bug is that mlx4_alloc_priv_pages
> assumes that a "small" kmalloc allocation will never
> hit a page boundary.


> That assumption is understandable: mlx4 allows up to 511
> sge's; the array size would be just under 4096 bytes
> for that number of sge's. The function is careful to
> deal with the alignment of the array start. But it
> is not careful about where the array ends.
>
> NB: mlx5_alloc_priv_descs() may have the same issue?

Yes, it looks same behavior.

> I partially tested this theory by having
> mlx4_alloc_priv_pages request 8192 bytes for that
> allocation. That avoids the kmalloc-4096 slab, which
> still has debugging enabled.
>
> kmalloc returns a page-aligned piece of memory, the
> array fits within a single page, and the Local
> Protection Error does not occur.
>
> Joonsoo also tested the theory this way:
>
> a) object size for kmalloc-4096 is 4424
>
> SLUB debugging adds 39 * sizeof(u64) to real allocation
> size
>
> mr->pages_alloc object's start, end offset, page # (starting on 0)
> size: 4424
> 0 2112 no-cross
> 4424 6536 no-cross
> 8848 10960 no-cross
> 13272 15384 no-cross
> 17696 19808 no-cross
> 22120 24232 no-cross
> 26544 28656 no-cross
>
> No cross into another page, no Local Protection Error
>
> b) object size for kmalloc-4096 is 4432
>
> SLUB debugging adds 40 * sizeof(u64) to real allocation
> size
>
> size: 4432
> 0 2112 no-cross
> 4432 6544 no-cross
> 8864 10976 no-cross
> 13296 15408 no-cross
> 17728 19840 no-cross
> 22160 24272 no-cross
> 26592 28704 cross
>
> Page boundary cross, Local Protection Error occurs
>
>
> --
> Chuck Lever
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                         ` <8a3276bf-f716-3dca-9d54-369fc3bdcc39-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2016-05-26 16:30                           ` Bart Van Assche
       [not found]                             ` <aaa67d51-663a-0aba-fc54-a5ab5d947a55-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  2016-05-26 16:39                           ` Leon Romanovsky
  1 sibling, 1 reply; 34+ messages in thread
From: Bart Van Assche @ 2016-05-26 16:30 UTC (permalink / raw)
  To: Yishai Hadas, Chuck Lever
  Cc: Yishai Hadas, linux-rdma, Bart Van Assche, Or Gerlitz,
	Joonsoo Kim, Haggai Eran, Majd Dibbiny

On 05/26/2016 09:24 AM, Yishai Hadas wrote:
> On 5/25/2016 6:58 PM, Chuck Lever wrote:
>> When debugging is disabled, kzalloc returns page-aligned
>> addresses:
>
> Is it defined some where that regular kzalloc/kmalloc guaranties to
> return a page-aligned address as you see in your testing ? if so the
> debug mode should behave the same. Otherwise we can consider using any
> flag allocation that can force that if such exists.
> Let's get other people's input here.

My understanding is that the fact that k[mz]alloc() returns a 
page-aligned buffer if the allocation size is > PAGE_SIZE / 2 is a side 
effect of the implementation and not something callers of that function 
should rely on. I think the only assumption k[mz]alloc() callers should 
rely on is that the allocated memory respects ARCH_KMALLOC_MINALIGN.

Bart.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                             ` <aaa67d51-663a-0aba-fc54-a5ab5d947a55-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
@ 2016-05-26 16:34                               ` Chuck Lever
       [not found]                                 ` <C0AE237D-5E5A-4F94-B717-F3A3B4B4D4A8-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
  2016-05-26 20:10                               ` Christoph Lameter
  1 sibling, 1 reply; 34+ messages in thread
From: Chuck Lever @ 2016-05-26 16:34 UTC (permalink / raw)
  To: Bart Van Assche, Yishai Hadas
  Cc: Yishai Hadas, linux-rdma, Or Gerlitz, Joonsoo Kim, Haggai Eran,
	Majd Dibbiny


> On May 26, 2016, at 12:30 PM, Bart Van Assche <bart.vanassche-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org> wrote:
> 
> On 05/26/2016 09:24 AM, Yishai Hadas wrote:
>> On 5/25/2016 6:58 PM, Chuck Lever wrote:
>>> When debugging is disabled, kzalloc returns page-aligned
>>> addresses:
>> 
>> Is it defined some where that regular kzalloc/kmalloc guaranties to
>> return a page-aligned address as you see in your testing ? if so the
>> debug mode should behave the same. Otherwise we can consider using any
>> flag allocation that can force that if such exists.
>> Let's get other people's input here.
> 
> My understanding is that the fact that k[mz]alloc() returns a page-aligned buffer if the allocation size is > PAGE_SIZE / 2 is a side effect of the implementation and not something callers of that function should rely on. I think the only assumption k[mz]alloc() callers should rely on is that the allocated memory respects ARCH_KMALLOC_MINALIGN.

I agree. mlx4_alloc_priv_pages() is carefully designed to
correct the alignment of the buffer, so it already assumes
that it is not getting a page-aligned buffer.

The alignment isn't the problem here, though. It's that
the buffer contains a page-boundary. That is guaranteed
to be the case for HCAs that support more than 512
sges, so that will have to be addressed (at least in
mlx5).


--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                         ` <8a3276bf-f716-3dca-9d54-369fc3bdcc39-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2016-05-26 16:30                           ` Bart Van Assche
@ 2016-05-26 16:39                           ` Leon Romanovsky
  1 sibling, 0 replies; 34+ messages in thread
From: Leon Romanovsky @ 2016-05-26 16:39 UTC (permalink / raw)
  To: Yishai Hadas
  Cc: Chuck Lever, Yishai Hadas, linux-rdma, Bart Van Assche,
	Or Gerlitz, Joonsoo Kim, Haggai Eran, Majd Dibbiny

[-- Attachment #1: Type: text/plain, Size: 6831 bytes --]

On Thu, May 26, 2016 at 07:24:29PM +0300, Yishai Hadas wrote:
> On 5/25/2016 6:58 PM, Chuck Lever wrote:
> >Hello Yishai-
> >
> >Reporting an mlx4 IB driver bug below. Sorry for the
> >length.
> >
> >
> >>On May 3, 2016, at 10:57 AM, Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> wrote:
> >>
> >>
> >>>On May 2, 2016, at 12:08 PM, Bart Van Assche <bart.vanassche@sandisk.com> wrote:
> >>>
> >>>On 05/02/2016 08:10 AM, Chuck Lever wrote:
> >>>>>On Apr 29, 2016, at 12:45 PM, Bart Van Assche <bart.vanassche@sandisk.com> wrote:
> >>>>>On 04/29/2016 09:24 AM, Chuck Lever wrote:
> >>>>>>I've found some new behavior, recently, while testing the
> >>>>>>v4.6-rc Linux NFS/RDMA client and server.
> >>>>>>
> >>>>>>When certain kernel memory debugging CONFIG options are
> >>>>>>enabled, 1MB NFS WRITEs can sometimes result in a
> >>>>>>IB_WC_LOC_PROT_ERR. I usually turn on most of them because
> >>>>>>I want to see any problems, so I'm not sure which option
> >>>>>>in particular is exposing the issue.
> >>>>>>
> >>>>>>When debugging is enabled on the server, and the underlying
> >>>>>>device is using FRWR to register the sink buffer, an RDMA
> >>>>>>Read occasionally completes with LOC_PROT_ERR.
> >>>>>>
> >>>>>>When debugging is enabled on the client, and the underlying
> >>>>>>device uses FRWR to register the target of an RDMA Read, an
> >>>>>>ingress RDMA Read request sometimes gets a Syndrome 99
> >>>>>>(REM_OP_ERR) acknowledgement, and a subsequent RDMA Receive
> >>>>>>on the client completes with LOC_PROT_ERR.
> >>>>>>
> >>>>>>I do not see this problem when kernel memory debugging is
> >>>>>>disabled, or when the client is using FMR, or when the
> >>>>>>server is using physical addresses to post its RDMA Read WRs,
> >>>>>>or when wsize is 512KB or smaller.
> >>>>>>
> >>>>>>I have not found any obvious problems with the client logic
> >>>>>>that registers NFS WRITE buffers, nor the server logic that
> >>>>>>constructs and posts RDMA Read WRs.
> >>>>>>
> >>>>>>My next step is to bisect. But first, I was wondering if
> >>>>>>this behavior might be related to the recent problems with
> >>>>>>s/g lists seen with iSER/SRP? ie, is this a recognized
> >>>>>>issue?
> >>>>>
> >>>>>Hello Chuck,
> >>>>>
> >>>>>A few days ago I observed similar behavior with the SRP protocol but only if I increase max_sect in /etc/srp_daemon.conf from the default to 4096. My setup was as follows:
> >>>>>* Kernel 4.6.0-rc5 at the initiator side.
> >>>>>* A whole bunch of kernel debugging options enabled at the initiator
> >>>>>side.
> >>>>>* The following settings in /etc/modprobe.d/ib_srp.conf:
> >>>>>options ib_srp cmd_sg_entries=255 register_always=1
> >>>>>* The following settings in /etc/srp_daemon.conf:
> >>>>>a queue_size=128,max_cmd_per_lun=128,max_sect=4096
> >>>>>* Kernel 3.0.101 at the target side.
> >>>>>* Kernel debugging disabled at the target side.
> >>>>>* mlx4 driver at both sides.
> >>>>>
> >>>>>Decreasing max_sge at the target side from 32 to 16 did not help. I have not yet had the time to analyze this further.
> >>>>
> >>>>git bisect result:
> >>>>
> >>>>d86bd1bece6fc41d59253002db5441fe960a37f6 is the first bad commit
> >>>>commit d86bd1bece6fc41d59253002db5441fe960a37f6
> >>>>Author: Joonsoo Kim <iamjoonsoo.kim-Hm3cg6mZ9cc@public.gmane.org>
> >>>>Date:   Tue Mar 15 14:55:12 2016 -0700
> >>>>
> >>>>   mm/slub: support left redzone
> >>>>
> >>>>I checked out the previous commit and was not able to
> >>>>reproduce, which gives some confidence that the bisect
> >>>>result is valid.
> >>>>
> >>>>I've also investigated the wire behavior a little more.
> >>>>The server I'm using for testing has FRWR artificially
> >>>>disabled, so it uses physical addresses for RDMA Read.
> >>>>This limits it to max_sge_rd, or 30 pages for each Read
> >>>>request.
> >>>>
> >>>>The client sends a single 1MB Read chunk. The server
> >>>>emits 8 30-page Read requests, and a ninth request for
> >>>>the last 16 pages in the chunk.
> >>>>
> >>>>The client's HCA responds to the 30-page Read requests
> >>>>properly. But on the last Read request, it responds
> >>>>with a Read First, 14 Read Middle responses, then an
> >>>>ACK with Syndrome 99 (Remote Operation Error).
> >>>>
> >>>>This suggests the last page in the memory region is
> >>>>not accessible to the HCA.
> >>>>
> >>>>This does not happen on the first NFS WRITE, but
> >>>>rather one or two subsequent NFS WRITEs during the test.
> >>>
> >>>On an x86 system that patch changes the alignment of buffers > 8 bytes from 16 bytes to 8 bytes (ARCH_SLAB_MINALIGN / ARCH_KMALLOC_MINALIGN). There might be code in the mlx4 driver that makes incorrect assumptions about the alignment of memory allocated by kmalloc(). Can someone from Mellanox comment on the alignment requirements of the buffers allocated by mlx4_buf_alloc()?
> >>>
> >>>Thanks,
> >>>
> >>>Bart.
> >>
> >>Let's also bring this to the attention of the patch's author.
> >>
> >>Joonsoo, any ideas about how to track this down? There have
> >>been several reports on linux-rdma of unexplained issues when
> >>SLUB debugging is enabled.
> >
> >Joonsoo and I tracked this down.
> >
> >The original problem report was Read and Receive WRs
> >completing with Local Protection Error when SLUB
> >debugging was enabled.
> >
> >We found that the problem occurred only when debugging
> >was enabled for the kmalloc-4096 slab.
> >
> >A kmalloc tracepoint log shows one likely mlx4 call
> >site that uses the kmalloc-4096 slab with NFS.
> >
> >kworker/u25:0-10565 [005]  5300.132063: kmalloc:              (mlx4_ib_alloc_mr+0xb8) [FAILED TO PARSE] call_site=0xffffffffa0294048 ptr=0xffff88043d808008     bytes_req=2112 bytes_alloc=4432 gfp_flags=37781696
> >
> >So let's look at mlx4_ib_alloc_mr().
> >
> >The call to kzalloc() at the top of this function is for
> >size 136, so that's not the one in this trace log entry.
> >
> >However, later in mlx4_alloc_priv_pages(), there is
> >a kzalloc of the right size. NFS will call ib_alloc_mr
> >with just over 256 sg's, which gives us a 2112 byte
> >allocation request.
> >
> >I added some pr_err() calls in this function to have
> >a look at the addresses returned by kmalloc.
> >
> >When debugging is disabled, kzalloc returns page-aligned
> >addresses:
> 
> Is it defined some where that regular kzalloc/kmalloc guaranties to return a
> page-aligned address as you see in your testing ? if so the debug mode
> should behave the same. Otherwise we can consider using any flag allocation
> that can force that if such exists.
> Let's get other people's input here.

No, kmalloc()/kzalloc() doesn't guarantee alignment. You should use
get_free_pages() for page-aligned allocations.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                                 ` <C0AE237D-5E5A-4F94-B717-F3A3B4B4D4A8-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
@ 2016-05-26 16:48                                   ` Sagi Grimberg
       [not found]                                     ` <574728EC.9040802-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2016-05-26 16:48 UTC (permalink / raw)
  To: Chuck Lever, Bart Van Assche, Yishai Hadas
  Cc: Yishai Hadas, linux-rdma, Or Gerlitz, Joonsoo Kim, Haggai Eran,
	Majd Dibbiny


>>>> When debugging is disabled, kzalloc returns page-aligned
>>>> addresses:
>>>
>>> Is it defined some where that regular kzalloc/kmalloc guaranties to
>>> return a page-aligned address as you see in your testing ? if so the
>>> debug mode should behave the same. Otherwise we can consider using any
>>> flag allocation that can force that if such exists.
>>> Let's get other people's input here.
>>
>> My understanding is that the fact that k[mz]alloc() returns a page-aligned buffer if the allocation size is > PAGE_SIZE / 2 is a side effect of the implementation and not something callers of that function should rely on. I think the only assumption k[mz]alloc() callers should rely on is that the allocated memory respects ARCH_KMALLOC_MINALIGN.
>
> I agree. mlx4_alloc_priv_pages() is carefully designed to
> correct the alignment of the buffer, so it already assumes
> that it is not getting a page-aligned buffer.
>
> The alignment isn't the problem here, though. It's that
> the buffer contains a page-boundary. That is guaranteed
> to be the case for HCAs that support more than 512
> sges, so that will have to be addressed (at least in
> mlx5).

rrr...

I think we should make the pages allocations dma coherent
in order to fix that...

Nice catch Chunk.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                                     ` <574728EC.9040802-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2016-05-26 17:19                                       ` Sagi Grimberg
       [not found]                                         ` <57473025.5020801-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2016-05-26 17:19 UTC (permalink / raw)
  To: Chuck Lever, Bart Van Assche, Yishai Hadas
  Cc: Yishai Hadas, linux-rdma, Or Gerlitz, Joonsoo Kim, Haggai Eran,
	Majd Dibbiny


>>>>> When debugging is disabled, kzalloc returns page-aligned
>>>>> addresses:
>>>>
>>>> Is it defined some where that regular kzalloc/kmalloc guaranties to
>>>> return a page-aligned address as you see in your testing ? if so the
>>>> debug mode should behave the same. Otherwise we can consider using any
>>>> flag allocation that can force that if such exists.
>>>> Let's get other people's input here.
>>>
>>> My understanding is that the fact that k[mz]alloc() returns a
>>> page-aligned buffer if the allocation size is > PAGE_SIZE / 2 is a
>>> side effect of the implementation and not something callers of that
>>> function should rely on. I think the only assumption k[mz]alloc()
>>> callers should rely on is that the allocated memory respects
>>> ARCH_KMALLOC_MINALIGN.
>>
>> I agree. mlx4_alloc_priv_pages() is carefully designed to
>> correct the alignment of the buffer, so it already assumes
>> that it is not getting a page-aligned buffer.
>>
>> The alignment isn't the problem here, though. It's that
>> the buffer contains a page-boundary. That is guaranteed
>> to be the case for HCAs that support more than 512
>> sges, so that will have to be addressed (at least in
>> mlx5).
>
> rrr...
>
> I think we should make the pages allocations dma coherent
> in order to fix that...
>
> Nice catch Chunk.

Does this untested patch help (if so, mlx5 will need an identical patch)?

--
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h 
b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index ba328177eae9..78e9b3addfea 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -139,7 +139,6 @@ struct mlx4_ib_mr {
         u32                     max_pages;
         struct mlx4_mr          mmr;
         struct ib_umem         *umem;
-       void                    *pages_alloc;
  };

  struct mlx4_ib_mw {
diff --git a/drivers/infiniband/hw/mlx4/mr.c 
b/drivers/infiniband/hw/mlx4/mr.c
index b04f6238e7e2..becb4a65c755 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -278,30 +278,13 @@ mlx4_alloc_priv_pages(struct ib_device *device,
                       int max_pages)
  {
         int size = max_pages * sizeof(u64);
-       int add_size;
-       int ret;
-
-       add_size = max_t(int, MLX4_MR_PAGES_ALIGN - 
ARCH_KMALLOC_MINALIGN, 0);

-       mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
-       if (!mr->pages_alloc)
+       mr->pages = dma_alloc_coherent(device->dma_device, size,
+                               &mr->page_map, GFP_KERNEL);
+       if (!mr->pages)
                 return -ENOMEM;

-       mr->pages = PTR_ALIGN(mr->pages_alloc, MLX4_MR_PAGES_ALIGN);
-
-       mr->page_map = dma_map_single(device->dma_device, mr->pages,
-                                     size, DMA_TO_DEVICE);
-
-       if (dma_mapping_error(device->dma_device, mr->page_map)) {
-               ret = -ENOMEM;
-               goto err;
-       }
-
         return 0;
-err:
-       kfree(mr->pages_alloc);
-
-       return ret;
  }

  static void
@@ -311,9 +294,8 @@ mlx4_free_priv_pages(struct mlx4_ib_mr *mr)
                 struct ib_device *device = mr->ibmr.device;
                 int size = mr->max_pages * sizeof(u64);

-               dma_unmap_single(device->dma_device, mr->page_map,
-                                size, DMA_TO_DEVICE);
-               kfree(mr->pages_alloc);
+               dma_free_coherent(device->dma_device, size,
+                               mr->pages, mr->page_map);
                 mr->pages = NULL;
         }
  }
@@ -532,19 +514,8 @@ int mlx4_ib_map_mr_sg(struct ib_mr *ibmr, struct 
scatterlist *sg, int sg_nents,
                 unsigned int sg_offset)
  {
         struct mlx4_ib_mr *mr = to_mmr(ibmr);
-       int rc;

         mr->npages = 0;

-       ib_dma_sync_single_for_cpu(ibmr->device, mr->page_map,
-                                  sizeof(u64) * mr->max_pages,
-                                  DMA_TO_DEVICE);
-
-       rc = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page);
-
-       ib_dma_sync_single_for_device(ibmr->device, mr->page_map,
-                                     sizeof(u64) * mr->max_pages,
-                                     DMA_TO_DEVICE);
-
-       return rc;
+       return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page);
  }
--
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                                         ` <57473025.5020801-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2016-05-26 17:57                                           ` Chuck Lever
  2016-05-26 19:23                                           ` Leon Romanovsky
  1 sibling, 0 replies; 34+ messages in thread
From: Chuck Lever @ 2016-05-26 17:57 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Bart Van Assche, Yishai Hadas, Yishai Hadas, linux-rdma,
	Or Gerlitz, Joonsoo Kim, Haggai Eran, Majd Dibbiny


> On May 26, 2016, at 1:19 PM, Sagi Grimberg <sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org> wrote:
> 
>>>>>> 
>>>>>> When debugging is disabled, kzalloc returns page-aligned
>>>>>> addresses:
>>>>> 
>>>>> Is it defined some where that regular kzalloc/kmalloc guaranties to
>>>>> return a page-aligned address as you see in your testing ? if so the
>>>>> debug mode should behave the same. Otherwise we can consider using any
>>>>> flag allocation that can force that if such exists.
>>>>> Let's get other people's input here.
>>>> 
>>>> My understanding is that the fact that k[mz]alloc() returns a
>>>> page-aligned buffer if the allocation size is > PAGE_SIZE / 2 is a
>>>> side effect of the implementation and not something callers of that
>>>> function should rely on. I think the only assumption k[mz]alloc()
>>>> callers should rely on is that the allocated memory respects
>>>> ARCH_KMALLOC_MINALIGN.
>>> 
>>> I agree. mlx4_alloc_priv_pages() is carefully designed to
>>> correct the alignment of the buffer, so it already assumes
>>> that it is not getting a page-aligned buffer.
>>> 
>>> The alignment isn't the problem here, though. It's that
>>> the buffer contains a page-boundary. That is guaranteed
>>> to be the case for HCAs that support more than 512
>>> sges, so that will have to be addressed (at least in
>>> mlx5).
>> 
>> rrr...
>> 
>> I think we should make the pages allocations dma coherent
>> in order to fix that...
>> 
>> Nice catch Chunk.
> 
> Does this untested patch help (if so, mlx5 will need an identical patch)?

Thanks, Sagi.

Is it safe?

Yes, IPoIB and NFS/RDMA work after the patch is applied.

Is it effective?

I booted with slub_debug=zfpu, and I am not able to reproduce
the Local Protection Error WC flushes.


However, it's not clear whether that's because DMA-coherent
memory is not allocated out of a slab cache, and thus it is
not subject to SLUB debugging. <shrug>

To test it more thoroughly, mlx4_alloc_priv_pages() could
allocate a two-page buffer and push the address of the
array up far enough that it would cross the page boundary.
I didn't try that.

(Also, your patch can delete the definition of
MLX4_MR_PAGES_ALIGN)

Tested-by: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>


> --
> diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
> index ba328177eae9..78e9b3addfea 100644
> --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
> +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
> @@ -139,7 +139,6 @@ struct mlx4_ib_mr {
>        u32                     max_pages;
>        struct mlx4_mr          mmr;
>        struct ib_umem         *umem;
> -       void                    *pages_alloc;
> };
> 
> struct mlx4_ib_mw {
> diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
> index b04f6238e7e2..becb4a65c755 100644
> --- a/drivers/infiniband/hw/mlx4/mr.c
> +++ b/drivers/infiniband/hw/mlx4/mr.c
> @@ -278,30 +278,13 @@ mlx4_alloc_priv_pages(struct ib_device *device,
>                      int max_pages)
> {
>        int size = max_pages * sizeof(u64);
> -       int add_size;
> -       int ret;
> -
> -       add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
> 
> -       mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
> -       if (!mr->pages_alloc)
> +       mr->pages = dma_alloc_coherent(device->dma_device, size,
> +                               &mr->page_map, GFP_KERNEL);
> +       if (!mr->pages)
>                return -ENOMEM;
> 
> -       mr->pages = PTR_ALIGN(mr->pages_alloc, MLX4_MR_PAGES_ALIGN);
> -
> -       mr->page_map = dma_map_single(device->dma_device, mr->pages,
> -                                     size, DMA_TO_DEVICE);
> -
> -       if (dma_mapping_error(device->dma_device, mr->page_map)) {
> -               ret = -ENOMEM;
> -               goto err;
> -       }
> -
>        return 0;
> -err:
> -       kfree(mr->pages_alloc);
> -
> -       return ret;
> }
> 
> static void
> @@ -311,9 +294,8 @@ mlx4_free_priv_pages(struct mlx4_ib_mr *mr)
>                struct ib_device *device = mr->ibmr.device;
>                int size = mr->max_pages * sizeof(u64);
> 
> -               dma_unmap_single(device->dma_device, mr->page_map,
> -                                size, DMA_TO_DEVICE);
> -               kfree(mr->pages_alloc);
> +               dma_free_coherent(device->dma_device, size,
> +                               mr->pages, mr->page_map);
>                mr->pages = NULL;
>        }
> }
> @@ -532,19 +514,8 @@ int mlx4_ib_map_mr_sg(struct ib_mr *ibmr, struct scatterlist *sg, int sg_nents,
>                unsigned int sg_offset)
> {
>        struct mlx4_ib_mr *mr = to_mmr(ibmr);
> -       int rc;
> 
>        mr->npages = 0;
> 
> -       ib_dma_sync_single_for_cpu(ibmr->device, mr->page_map,
> -                                  sizeof(u64) * mr->max_pages,
> -                                  DMA_TO_DEVICE);
> -
> -       rc = ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page);
> -
> -       ib_dma_sync_single_for_device(ibmr->device, mr->page_map,
> -                                     sizeof(u64) * mr->max_pages,
> -                                     DMA_TO_DEVICE);
> -
> -       return rc;
> +       return ib_sg_to_pages(ibmr, sg, sg_nents, sg_offset, mlx4_set_page);
> }
> --
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
Chuck Lever



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                                         ` <57473025.5020801-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  2016-05-26 17:57                                           ` Chuck Lever
@ 2016-05-26 19:23                                           ` Leon Romanovsky
       [not found]                                             ` <20160526192351.GV25500-2ukJVAZIZ/Y@public.gmane.org>
  1 sibling, 1 reply; 34+ messages in thread
From: Leon Romanovsky @ 2016-05-26 19:23 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Chuck Lever, Bart Van Assche, Yishai Hadas, Yishai Hadas,
	linux-rdma, Or Gerlitz, Joonsoo Kim, Haggai Eran, Majd Dibbiny

[-- Attachment #1: Type: text/plain, Size: 3031 bytes --]

On Thu, May 26, 2016 at 08:19:33PM +0300, Sagi Grimberg wrote:
> 
> >>>>>When debugging is disabled, kzalloc returns page-aligned
> >>>>>addresses:
> >>>>
> >>>>Is it defined some where that regular kzalloc/kmalloc guaranties to
> >>>>return a page-aligned address as you see in your testing ? if so the
> >>>>debug mode should behave the same. Otherwise we can consider using any
> >>>>flag allocation that can force that if such exists.
> >>>>Let's get other people's input here.
> >>>
> >>>My understanding is that the fact that k[mz]alloc() returns a
> >>>page-aligned buffer if the allocation size is > PAGE_SIZE / 2 is a
> >>>side effect of the implementation and not something callers of that
> >>>function should rely on. I think the only assumption k[mz]alloc()
> >>>callers should rely on is that the allocated memory respects
> >>>ARCH_KMALLOC_MINALIGN.
> >>
> >>I agree. mlx4_alloc_priv_pages() is carefully designed to
> >>correct the alignment of the buffer, so it already assumes
> >>that it is not getting a page-aligned buffer.
> >>
> >>The alignment isn't the problem here, though. It's that
> >>the buffer contains a page-boundary. That is guaranteed
> >>to be the case for HCAs that support more than 512
> >>sges, so that will have to be addressed (at least in
> >>mlx5).
> >
> >rrr...
> >
> >I think we should make the pages allocations dma coherent
> >in order to fix that...
> >
> >Nice catch Chunk.
> 
> Does this untested patch help (if so, mlx5 will need an identical patch)?
> 
> --
> diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h
> b/drivers/infiniband/hw/mlx4/mlx4_ib.h
> index ba328177eae9..78e9b3addfea 100644
> --- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
> +++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
> @@ -139,7 +139,6 @@ struct mlx4_ib_mr {
>         u32                     max_pages;
>         struct mlx4_mr          mmr;
>         struct ib_umem         *umem;
> -       void                    *pages_alloc;
>  };
> 
>  struct mlx4_ib_mw {
> diff --git a/drivers/infiniband/hw/mlx4/mr.c
> b/drivers/infiniband/hw/mlx4/mr.c
> index b04f6238e7e2..becb4a65c755 100644
> --- a/drivers/infiniband/hw/mlx4/mr.c
> +++ b/drivers/infiniband/hw/mlx4/mr.c
> @@ -278,30 +278,13 @@ mlx4_alloc_priv_pages(struct ib_device *device,
>                       int max_pages)
>  {
>         int size = max_pages * sizeof(u64);
> -       int add_size;
> -       int ret;
> -
> -       add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN,
> 0);
> 
> -       mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
> -       if (!mr->pages_alloc)
> +       mr->pages = dma_alloc_coherent(device->dma_device, size,
> +                               &mr->page_map, GFP_KERNEL);

Sagi,
I'm wondering if allocation from ZONE_DMA is the right way to
replace ZONE_NORMAL allocations.

I don't remember if memory compaction works on ZONE_DMA and it makes
me nervous to think about long and multiple alloc/dealloc MR scenario.


[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                             ` <aaa67d51-663a-0aba-fc54-a5ab5d947a55-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
  2016-05-26 16:34                               ` Chuck Lever
@ 2016-05-26 20:10                               ` Christoph Lameter
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2016-05-26 20:10 UTC (permalink / raw)
  To: Bart Van Assche
  Cc: Yishai Hadas, Chuck Lever, Yishai Hadas, linux-rdma, Or Gerlitz,
	Joonsoo Kim, Haggai Eran, Majd Dibbiny

On Thu, 26 May 2016, Bart Van Assche wrote:

> On 05/26/2016 09:24 AM, Yishai Hadas wrote:
> > On 5/25/2016 6:58 PM, Chuck Lever wrote:
> > > When debugging is disabled, kzalloc returns page-aligned
> > > addresses:
> >
> > Is it defined some where that regular kzalloc/kmalloc guaranties to
> > return a page-aligned address as you see in your testing ? if so the
> > debug mode should behave the same. Otherwise we can consider using any
> > flag allocation that can force that if such exists.
> > Let's get other people's input here.
>
> My understanding is that the fact that k[mz]alloc() returns a page-aligned
> buffer if the allocation size is > PAGE_SIZE / 2 is a side effect of the
> implementation and not something callers of that function should rely on. I
> think the only assumption k[mz]alloc() callers should rely on is that the
> allocated memory respects ARCH_KMALLOC_MINALIGN.

The alignment of slab objects is specified at slab creation with
kmem_cache_create(). For kmalloc these are aligned to cache line
boundaries unless the cache object size is less than that.

Allocators may happen to align objects arbitrary in the absense of a
alignment specfication. SLUB happens to align objects of multiple page
size to page boundaries. But that only works if no extra storage is
required (like f.e. for debugging features). It cannot be relied upon.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                                             ` <20160526192351.GV25500-2ukJVAZIZ/Y@public.gmane.org>
@ 2016-05-26 20:12                                               ` Christoph Lameter
       [not found]                                                 ` <alpine.DEB.2.20.1605261511230.8857-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
  2016-05-29  7:10                                               ` Christoph Hellwig
  1 sibling, 1 reply; 34+ messages in thread
From: Christoph Lameter @ 2016-05-26 20:12 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Sagi Grimberg, Chuck Lever, Bart Van Assche, Yishai Hadas,
	Yishai Hadas, linux-rdma, Or Gerlitz, Joonsoo Kim, Haggai Eran,
	Majd Dibbiny

On Thu, 26 May 2016, Leon Romanovsky wrote:

> Sagi,
> I'm wondering if allocation from ZONE_DMA is the right way to
> replace ZONE_NORMAL allocations.
>
> I don't remember if memory compaction works on ZONE_DMA and it makes
> me nervous to think about long and multiple alloc/dealloc MR scenario.

ZONE_DMA is a 16M memory segment used for legacy devices of the PC/AT era
(such as floppy drivers). Please do not use it.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                                                 ` <alpine.DEB.2.20.1605261511230.8857-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
@ 2016-05-29  7:02                                                   ` Sagi Grimberg
       [not found]                                                     ` <574A941D.9050404-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2016-05-29  7:02 UTC (permalink / raw)
  To: Christoph Lameter, Leon Romanovsky
  Cc: Chuck Lever, Bart Van Assche, Yishai Hadas, Yishai Hadas,
	linux-rdma, Or Gerlitz, Joonsoo Kim, Haggai Eran, Majd Dibbiny


>> Sagi,
>> I'm wondering if allocation from ZONE_DMA is the right way to
>> replace ZONE_NORMAL allocations.
>>
>> I don't remember if memory compaction works on ZONE_DMA and it makes
>> me nervous to think about long and multiple alloc/dealloc MR scenario.
>
> ZONE_DMA is a 16M memory segment used for legacy devices of the PC/AT era
> (such as floppy drivers). Please do not use it.

Really? It's used all over the stack...

So the suggestion is to restrict the allocation does not
cross the page boundary (which can easily be done with making
MLX4_MR_PAGES_ALIGN be PAGE_SIZE or larger than PAGE_SIZE/2)?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                                             ` <20160526192351.GV25500-2ukJVAZIZ/Y@public.gmane.org>
  2016-05-26 20:12                                               ` Christoph Lameter
@ 2016-05-29  7:10                                               ` Christoph Hellwig
       [not found]                                                 ` <20160529071040.GA24347-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  1 sibling, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2016-05-29  7:10 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Sagi Grimberg, Chuck Lever, Bart Van Assche, Yishai Hadas,
	Yishai Hadas, linux-rdma, Or Gerlitz, Joonsoo Kim, Haggai Eran,
	Majd Dibbiny

On Thu, May 26, 2016 at 10:23:51PM +0300, Leon Romanovsky wrote:
> > +       mr->pages = dma_alloc_coherent(device->dma_device, size,
> > +                               &mr->page_map, GFP_KERNEL);
> 
> Sagi,
> I'm wondering if allocation from ZONE_DMA is the right way to
> replace ZONE_NORMAL allocations.

Where do you see a ZONE_DMA allocation?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                                                     ` <574A941D.9050404-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2016-05-29  7:17                                                       ` Christoph Hellwig
       [not found]                                                         ` <20160529071749.GB24347-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  2016-05-31 15:14                                                       ` Christoph Lameter
  1 sibling, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2016-05-29  7:17 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Lameter, Leon Romanovsky, Chuck Lever, Bart Van Assche,
	Yishai Hadas, Yishai Hadas, linux-rdma, Or Gerlitz, Joonsoo Kim,
	Haggai Eran, Majd Dibbiny

On Sun, May 29, 2016 at 10:02:53AM +0300, Sagi Grimberg wrote:
> >ZONE_DMA is a 16M memory segment used for legacy devices of the PC/AT era
> >(such as floppy drivers). Please do not use it.
> 
> Really? It's used all over the stack...
> 
> So the suggestion is to restrict the allocation does not
> cross the page boundary (which can easily be done with making
> MLX4_MR_PAGES_ALIGN be PAGE_SIZE or larger than PAGE_SIZE/2)?

We could simply switch to alloc_pages, e.g. something like the patch below:

But I have to admit I really like the coherent dma mapping version,
for all sane architectures that works much better, although it sucks
for some architetures where dma coherent mappings cause a bit of
overhead.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                                                         ` <20160529071749.GB24347-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2016-05-29  8:13                                                           ` Sagi Grimberg
       [not found]                                                             ` <574AA4BE.2060207-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Sagi Grimberg @ 2016-05-29  8:13 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Lameter, Leon Romanovsky, Chuck Lever, Bart Van Assche,
	Yishai Hadas, Yishai Hadas, linux-rdma, Or Gerlitz, Joonsoo Kim,
	Haggai Eran, Majd Dibbiny


>> Really? It's used all over the stack...
>>
>> So the suggestion is to restrict the allocation does not
>> cross the page boundary (which can easily be done with making
>> MLX4_MR_PAGES_ALIGN be PAGE_SIZE or larger than PAGE_SIZE/2)?
>
> We could simply switch to alloc_pages, e.g. something like the patch below:

Patch below? :)

> But I have to admit I really like the coherent dma mapping version,
> for all sane architectures that works much better, although it sucks
> for some architetures where dma coherent mappings cause a bit of
> overhead.

But this is specific to mlx4 which supports up 511 pages per MR, mlx5
will need a coherent allocation anyways right (it supports up to 64K
pages per MR)?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                                                             ` <574AA4BE.2060207-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
@ 2016-05-29  8:15                                                               ` Christoph Hellwig
       [not found]                                                                 ` <20160529081527.GA5839-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
  0 siblings, 1 reply; 34+ messages in thread
From: Christoph Hellwig @ 2016-05-29  8:15 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Christoph Hellwig, Christoph Lameter, Leon Romanovsky,
	Chuck Lever, Bart Van Assche, Yishai Hadas, Yishai Hadas,
	linux-rdma, Or Gerlitz, Joonsoo Kim, Haggai Eran, Majd Dibbiny

[-- Attachment #1: Type: text/plain, Size: 850 bytes --]

On Sun, May 29, 2016 at 11:13:50AM +0300, Sagi Grimberg wrote:
> 
> >>Really? It's used all over the stack...
> >>
> >>So the suggestion is to restrict the allocation does not
> >>cross the page boundary (which can easily be done with making
> >>MLX4_MR_PAGES_ALIGN be PAGE_SIZE or larger than PAGE_SIZE/2)?
> >
> >We could simply switch to alloc_pages, e.g. something like the patch below:
> 
> Patch below? :)

Here we go.

> 
> >But I have to admit I really like the coherent dma mapping version,
> >for all sane architectures that works much better, although it sucks
> >for some architetures where dma coherent mappings cause a bit of
> >overhead.
> 
> But this is specific to mlx4 which supports up 511 pages per MR, mlx5
> will need a coherent allocation anyways right (it supports up to 64K
> pages per MR)?

Why does that make a difference?

[-- Attachment #2: mlx4-pages.diff --]
[-- Type: text/plain, Size: 1796 bytes --]

diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index ba32817..7adfa4b 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -129,8 +129,6 @@ struct mlx4_ib_cq {
 	struct list_head		recv_qp_list;
 };
 
-#define MLX4_MR_PAGES_ALIGN 0x40
-
 struct mlx4_ib_mr {
 	struct ib_mr		ibmr;
 	__be64			*pages;
@@ -139,7 +137,6 @@ struct mlx4_ib_mr {
 	u32			max_pages;
 	struct mlx4_mr		mmr;
 	struct ib_umem	       *umem;
-	void			*pages_alloc;
 };
 
 struct mlx4_ib_mw {
diff --git a/drivers/infiniband/hw/mlx4/mr.c b/drivers/infiniband/hw/mlx4/mr.c
index b04f623..3c6a21f 100644
--- a/drivers/infiniband/hw/mlx4/mr.c
+++ b/drivers/infiniband/hw/mlx4/mr.c
@@ -278,17 +278,13 @@ mlx4_alloc_priv_pages(struct ib_device *device,
 		      int max_pages)
 {
 	int size = max_pages * sizeof(u64);
-	int add_size;
 	int ret;
 
-	add_size = max_t(int, MLX4_MR_PAGES_ALIGN - ARCH_KMALLOC_MINALIGN, 0);
-
-	mr->pages_alloc = kzalloc(size + add_size, GFP_KERNEL);
-	if (!mr->pages_alloc)
+	mr->pages = (__be64 *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
+			get_order(size));
+	if (!mr->pages)
 		return -ENOMEM;
-
-	mr->pages = PTR_ALIGN(mr->pages_alloc, MLX4_MR_PAGES_ALIGN);
-
+	
 	mr->page_map = dma_map_single(device->dma_device, mr->pages,
 				      size, DMA_TO_DEVICE);
 
@@ -299,7 +295,7 @@ mlx4_alloc_priv_pages(struct ib_device *device,
 
 	return 0;
 err:
-	kfree(mr->pages_alloc);
+	free_pages((unsigned long)mr->pages, get_order(size));
 
 	return ret;
 }
@@ -313,7 +309,7 @@ mlx4_free_priv_pages(struct mlx4_ib_mr *mr)
 
 		dma_unmap_single(device->dma_device, mr->page_map,
 				 size, DMA_TO_DEVICE);
-		kfree(mr->pages_alloc);
+		free_pages((unsigned long)mr->pages, get_order(size));
 		mr->pages = NULL;
 	}
 }

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                                                                 ` <20160529081527.GA5839-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2016-05-29  8:37                                                                   ` Sagi Grimberg
  0 siblings, 0 replies; 34+ messages in thread
From: Sagi Grimberg @ 2016-05-29  8:37 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Lameter, Leon Romanovsky, Chuck Lever, Bart Van Assche,
	Yishai Hadas, Yishai Hadas, linux-rdma, Or Gerlitz, Joonsoo Kim,
	Haggai Eran, Majd Dibbiny


>> But this is specific to mlx4 which supports up 511 pages per MR, mlx5
>> will need a coherent allocation anyways right (it supports up to 64K
>> pages per MR)?
>
> Why does that make a difference?

Chuck's original bug report said that the problem with the current
code is that dma_map_single expects the pages array to fit in a single
page:

"
See what mlx4_alloc_priv_pages does with this memory
allocation:

   mr->page_map = dma_map_single(device->dma_device, mr->pages,
                                 size, DMA_TO_DEVICE);

dma_map_single() expects the mr->pages allocation to fit
on a single page, as far as I can tell.
"

But when I revisited dma_map_single in some archs I didn't
see this expectation. So actually now I don't really see what
is the problem in the first place (or how your patch fixes it)...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                                                 ` <20160529071040.GA24347-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
@ 2016-05-29  8:56                                                   ` Leon Romanovsky
  0 siblings, 0 replies; 34+ messages in thread
From: Leon Romanovsky @ 2016-05-29  8:56 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Sagi Grimberg, Chuck Lever, Bart Van Assche, Yishai Hadas,
	Yishai Hadas, linux-rdma, Or Gerlitz, Joonsoo Kim, Haggai Eran,
	Majd Dibbiny

[-- Attachment #1: Type: text/plain, Size: 657 bytes --]

On Sun, May 29, 2016 at 12:10:40AM -0700, Christoph Hellwig wrote:
> On Thu, May 26, 2016 at 10:23:51PM +0300, Leon Romanovsky wrote:
> > > +       mr->pages = dma_alloc_coherent(device->dma_device, size,
> > > +                               &mr->page_map, GFP_KERNEL);
> > 
> > Sagi,
> > I'm wondering if allocation from ZONE_DMA is the right way to
> > replace ZONE_NORMAL allocations.
> 
> Where do you see a ZONE_DMA allocation?

Thanks for pointing that out, it caused me to reread code again more
carefully and it looks like I mistakenly added GFP_DMA flag to the
proposed code which is definitely not the case.

Sorry for the noise.

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: RDMA Read: Local protection error
       [not found]                                                     ` <574A941D.9050404-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
  2016-05-29  7:17                                                       ` Christoph Hellwig
@ 2016-05-31 15:14                                                       ` Christoph Lameter
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Lameter @ 2016-05-31 15:14 UTC (permalink / raw)
  To: Sagi Grimberg
  Cc: Leon Romanovsky, Chuck Lever, Bart Van Assche, Yishai Hadas,
	Yishai Hadas, linux-rdma, Or Gerlitz, Joonsoo Kim, Haggai Eran,
	Majd Dibbiny

On Sun, 29 May 2016, Sagi Grimberg wrote:

>
> > > Sagi,
> > > I'm wondering if allocation from ZONE_DMA is the right way to
> > > replace ZONE_NORMAL allocations.
> > >
> > > I don't remember if memory compaction works on ZONE_DMA and it makes
> > > me nervous to think about long and multiple alloc/dealloc MR scenario.
> >
> > ZONE_DMA is a 16M memory segment used for legacy devices of the PC/AT era
> > (such as floppy drivers). Please do not use it.
>
> Really? It's used all over the stack...

On x86 that is the case. Other arches may have different uses for ZONE_DMA
but the intend it to support legacy devices unable to write to the whole
of memory (that is usally ZONE_NORMAL).

from include/linux/mmzone.h

enum zone_type {
#ifdef CONFIG_ZONE_DMA
        /*
         * ZONE_DMA is used when there are devices that are not able
         * to do DMA to all of addressable memory (ZONE_NORMAL). Then we
         * carve out the portion of memory that is needed for these
devices.
         * The range is arch specific.
         *
         * Some examples
         *
         * Architecture         Limit
         * ---------------------------
         * parisc, ia64, sparc  <4G
         * s390                 <2G
         * arm                  Various
         * alpha                Unlimited or 0-16MB.
         *
         * i386, x86_64 and multiple other arches
         *                      <16M.
         */
        ZONE_DMA,
#endif

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2016-05-31 15:14 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-29 16:24 RDMA Read: Local protection error Chuck Lever
     [not found] ` <1A4F4C32-CE5A-44D9-9BFE-0E1F8D5DF44D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-04-29 16:44   ` Santosh Shilimkar
     [not found]     ` <3fb4e75f-ff14-34e2-b6d3-6b6046812845-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-04-29 16:58       ` Chuck Lever
     [not found]         ` <72E8335B-282B-4DCC-AE4F-FE7E50ED5A08-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-04-29 19:07           ` Santosh Shilimkar
2016-04-29 16:45   ` Bart Van Assche
     [not found]     ` <57238F8C.70505-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-04-29 17:02       ` Chuck Lever
2016-04-29 17:34       ` Laurence Oberman
2016-05-02 15:10       ` Chuck Lever
     [not found]         ` <B72A389F-FFF1-498C-A946-8AA72B7769F8-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-05-02 16:08           ` Bart Van Assche
     [not found]             ` <57277B63.8030506-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-05-03 14:57               ` Chuck Lever
     [not found]                 ` <6BBFD126-877C-4638-BB91-ABF715E29326-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-05-04  1:07                   ` Joonsoo Kim
2016-05-04 19:59                     ` Chuck Lever
     [not found]                       ` <F6C79393-6174-49B3-ADBB-E40627DEE85D-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-05-09  1:03                         ` Joonsoo Kim
     [not found]                           ` <CAAmzW4NbY3Og0BgQyeA4LLXTnMuPTjxVUdFbH+HLahBw+MAhsw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-05-09  1:15                             ` Chuck Lever
     [not found]                               ` <1A79DEDE-A5C3-4581-A0AE-7C0AB056B4C7-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-05-09  2:11                                 ` Joonsoo Kim
2016-05-25 15:58                   ` Chuck Lever
     [not found]                     ` <1AFD636B-09FC-4736-B1C5-D1D9FA0B97B0-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-05-26 16:24                       ` Yishai Hadas
     [not found]                         ` <8a3276bf-f716-3dca-9d54-369fc3bdcc39-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2016-05-26 16:30                           ` Bart Van Assche
     [not found]                             ` <aaa67d51-663a-0aba-fc54-a5ab5d947a55-XdAiOPVOjttBDgjK7y7TUQ@public.gmane.org>
2016-05-26 16:34                               ` Chuck Lever
     [not found]                                 ` <C0AE237D-5E5A-4F94-B717-F3A3B4B4D4A8-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
2016-05-26 16:48                                   ` Sagi Grimberg
     [not found]                                     ` <574728EC.9040802-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2016-05-26 17:19                                       ` Sagi Grimberg
     [not found]                                         ` <57473025.5020801-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2016-05-26 17:57                                           ` Chuck Lever
2016-05-26 19:23                                           ` Leon Romanovsky
     [not found]                                             ` <20160526192351.GV25500-2ukJVAZIZ/Y@public.gmane.org>
2016-05-26 20:12                                               ` Christoph Lameter
     [not found]                                                 ` <alpine.DEB.2.20.1605261511230.8857-wcBtFHqTun5QOdAKl3ChDw@public.gmane.org>
2016-05-29  7:02                                                   ` Sagi Grimberg
     [not found]                                                     ` <574A941D.9050404-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2016-05-29  7:17                                                       ` Christoph Hellwig
     [not found]                                                         ` <20160529071749.GB24347-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-05-29  8:13                                                           ` Sagi Grimberg
     [not found]                                                             ` <574AA4BE.2060207-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org>
2016-05-29  8:15                                                               ` Christoph Hellwig
     [not found]                                                                 ` <20160529081527.GA5839-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-05-29  8:37                                                                   ` Sagi Grimberg
2016-05-31 15:14                                                       ` Christoph Lameter
2016-05-29  7:10                                               ` Christoph Hellwig
     [not found]                                                 ` <20160529071040.GA24347-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2016-05-29  8:56                                                   ` Leon Romanovsky
2016-05-26 20:10                               ` Christoph Lameter
2016-05-26 16:39                           ` Leon Romanovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.