linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* v5.14 RXE driver broken?
@ 2021-08-24  3:01 Bart Van Assche
  2021-08-25  3:02 ` Zhu Yanjun
  0 siblings, 1 reply; 12+ messages in thread
From: Bart Van Assche @ 2021-08-24  3:01 UTC (permalink / raw)
  To: Bob Pearson; +Cc: linux-rdma, linux-block

Hi Bob,

If I run the following test against Linus' master branch then that test
passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
headers to staging"")):

# export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
    runtime    ...  48.849s

The following test fails:

# export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
    runtime  48.849s  ...  15.024s
    --- tests/srp/002.out       2018-09-08 19:43:42.291664821 -0700
    +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad      2021-08-23 19:51:05.182958728 -0700
    @@ -1,2 +1 @@
     Configured SRP target driver
    -Passed

The only difference between these two tests is that test (1) use the siw
(soft-iWARP) driver while test (2) uses the rdma_rxe driver (soft-RoCE).
Both tests run reliably against previous Linux kernel versions, e.g.
v5.13. Can you take a look at this? The blktests software is available at
https://github.com/osandov/blktests/.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: v5.14 RXE driver broken?
  2021-08-24  3:01 v5.14 RXE driver broken? Bart Van Assche
@ 2021-08-25  3:02 ` Zhu Yanjun
  2021-08-25 16:32   ` Jason Gunthorpe
  2021-08-25 16:46   ` Bart Van Assche
  0 siblings, 2 replies; 12+ messages in thread
From: Zhu Yanjun @ 2021-08-25  3:02 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: Bob Pearson, linux-rdma, linux-block

On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> Hi Bob,
>
> If I run the following test against Linus' master branch then that test
> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
> headers to staging"")):
>
> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>     runtime    ...  48.849s
>
> The following test fails:
>
> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>     runtime  48.849s  ...  15.024s
>     --- tests/srp/002.out       2018-09-08 19:43:42.291664821 -0700
>     +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad      2021-08-23 19:51:05.182958728 -0700
>     @@ -1,2 +1 @@
>      Configured SRP target driver
>     -Passed

Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
fix this problem?

And the commit will be merged into linux upstream very soon.

Zhu Yanjun

>
> The only difference between these two tests is that test (1) use the siw
> (soft-iWARP) driver while test (2) uses the rdma_rxe driver (soft-RoCE).
> Both tests run reliably against previous Linux kernel versions, e.g.
> v5.13. Can you take a look at this? The blktests software is available at
> https://github.com/osandov/blktests/.
>
> Thanks,
>
> Bart.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: v5.14 RXE driver broken?
  2021-08-25  3:02 ` Zhu Yanjun
@ 2021-08-25 16:32   ` Jason Gunthorpe
  2021-08-25 18:03     ` Bob Pearson
                       ` (2 more replies)
  2021-08-25 16:46   ` Bart Van Assche
  1 sibling, 3 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2021-08-25 16:32 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Bart Van Assche, Bob Pearson, linux-rdma, linux-block

On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > Hi Bob,
> >
> > If I run the following test against Linus' master branch then that test
> > passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
> > headers to staging"")):
> >
> > # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
> > srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
> >     runtime    ...  48.849s
> >
> > The following test fails:
> >
> > # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
> > srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
> >     runtime  48.849s  ...  15.024s
> >     +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad      2021-08-23 19:51:05.182958728 -0700
> >     @@ -1,2 +1 @@
> >      Configured SRP target driver
> >     -Passed
> 
> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
> fix this problem?
> 
> And the commit will be merged into linux upstream very soon.

Please let me know Bart, if the rxe driver is still broken I will
definitely punt all the changes for RXE to the next cycle until it can
be fixed.

Jason

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: v5.14 RXE driver broken?
  2021-08-25  3:02 ` Zhu Yanjun
  2021-08-25 16:32   ` Jason Gunthorpe
@ 2021-08-25 16:46   ` Bart Van Assche
  1 sibling, 0 replies; 12+ messages in thread
From: Bart Van Assche @ 2021-08-25 16:46 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Bob Pearson, linux-rdma, linux-block

On 8/24/21 8:02 PM, Zhu Yanjun wrote:
> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>
>> Hi Bob,
>>
>> If I run the following test against Linus' master branch then that test
>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>> headers to staging"")):
>>
>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>>      runtime    ...  48.849s
>>
>> The following test fails:
>>
>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>>      runtime  48.849s  ...  15.024s
>>      --- tests/srp/002.out       2018-09-08 19:43:42.291664821 -0700
>>      +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad      2021-08-23 19:51:05.182958728 -0700
>>      @@ -1,2 +1 @@
>>       Configured SRP target driver
>>      -Passed
> 
> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
> fix this problem?
> 
> And the commit will be merged into linux upstream very soon.

Hi Zhu,

Thanks for having taken a look.

Isn't commit cc4f596cf85e ("RDMA/rxe: Zero out index member of struct
rxe_queue") already in Linus' tree? I think it was merged yesterday (August
24). Unfortunately the test I mentioned still fails on top of that patch.

Thanks,

Bart.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: v5.14 RXE driver broken?
  2021-08-25 16:32   ` Jason Gunthorpe
@ 2021-08-25 18:03     ` Bob Pearson
  2021-08-25 18:22     ` Bart Van Assche
  2021-08-26 19:03     ` Bob Pearson
  2 siblings, 0 replies; 12+ messages in thread
From: Bob Pearson @ 2021-08-25 18:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Zhu Yanjun; +Cc: Bart Van Assche, linux-rdma, linux-block

On 8/25/21 11:32 AM, Jason Gunthorpe wrote:
> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>
>>> Hi Bob,
>>>
>>> If I run the following test against Linus' master branch then that test
>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>>> headers to staging"")):
>>>
>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>>>     runtime    ...  48.849s
>>>
>>> The following test fails:
>>>
>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>>>     runtime  48.849s  ...  15.024s
>>>     +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad      2021-08-23 19:51:05.182958728 -0700
>>>     @@ -1,2 +1 @@
>>>      Configured SRP target driver
>>>     -Passed
>>
>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
>> fix this problem?
>>
>> And the commit will be merged into linux upstream very soon.
> 
> Please let me know Bart, if the rxe driver is still broken I will
> definitely punt all the changes for RXE to the next cycle until it can
> be fixed.
> 
> Jason
> 
Jason,

I am (I think) able to reproduce Bart's issue. I wouldn't hold up the 'bug fix' patches they are all legitimate.

Bob

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: v5.14 RXE driver broken?
  2021-08-25 16:32   ` Jason Gunthorpe
  2021-08-25 18:03     ` Bob Pearson
@ 2021-08-25 18:22     ` Bart Van Assche
  2021-08-25 20:58       ` Bart Van Assche
  2021-08-26 19:03     ` Bob Pearson
  2 siblings, 1 reply; 12+ messages in thread
From: Bart Van Assche @ 2021-08-25 18:22 UTC (permalink / raw)
  To: Jason Gunthorpe, Zhu Yanjun; +Cc: Bob Pearson, linux-rdma, linux-block

On 8/25/21 9:32 AM, Jason Gunthorpe wrote:
> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>
>>> Hi Bob,
>>>
>>> If I run the following test against Linus' master branch then that test
>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>>> headers to staging"")):
>>>
>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>>>      runtime    ...  48.849s
>>>
>>> The following test fails:
>>>
>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>>>      runtime  48.849s  ...  15.024s
>>>      +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad      2021-08-23 19:51:05.182958728 -0700
>>>      @@ -1,2 +1 @@
>>>       Configured SRP target driver
>>>      -Passed
>>
>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
>> fix this problem?
>>
>> And the commit will be merged into linux upstream very soon.
> 
> Please let me know Bart, if the rxe driver is still broken I will
> definitely punt all the changes for RXE to the next cycle until it can
> be fixed.

Hi Jason,

Thanks for having offered to revert the RXE changes from this merge window.
Unfortunately that wouldn't be sufficient. My test results so far for test
srp/002 in combination with the rdma_rxe driver are as follows:
* Kernel v5.12: test passes.
* Kernel v5.13: test fails.
* Kernel v5.14-rc7: test fails.

For the rdma_rxe tests for kernel v5.14-rc7 I found the following in the kernel
log:

ib_srp:add_target_store: ib_srp: max_sectors = 1024; max_pages_per_mr = 512; mr_page_size = 4096; max_sectors_per_mr = 4096; mr_per_cmd = 2
ib_srp: enp1s0_rxe: ib_alloc_mr() failed. Try to reduce max_cmd_per_lun, max_sect or ch_count

There is sufficient memory available in the VM in which I ran the tests. It is
not clear to me why ib_alloc_mr() fails with these parameters when using the
rdma_rxe driver? As one can see in srp_alloc_fr_pool() the SRP initiator driver
respects the max_pages_per_mr RDMA driver limit.

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: v5.14 RXE driver broken?
  2021-08-25 18:22     ` Bart Van Assche
@ 2021-08-25 20:58       ` Bart Van Assche
  2021-08-25 21:09         ` Bob Pearson
  0 siblings, 1 reply; 12+ messages in thread
From: Bart Van Assche @ 2021-08-25 20:58 UTC (permalink / raw)
  To: Jason Gunthorpe; +Cc: Bob Pearson, Zhu Yanjun, linux-rdma, linux-block

On 8/25/21 11:22 AM, Bart Van Assche wrote:
> On 8/25/21 9:32 AM, Jason Gunthorpe wrote:
>> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
>>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>>
>>>> Hi Bob,
>>>>
>>>> If I run the following test against Linus' master branch then that test
>>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>>>> headers to staging"")):
>>>>
>>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>>>>      runtime    ...  48.849s
>>>>
>>>> The following test fails:
>>>>
>>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>>>>      runtime  48.849s  ...  15.024s
>>>>      +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad      2021-08-23 19:51:05.182958728 -0700
>>>>      @@ -1,2 +1 @@
>>>>       Configured SRP target driver
>>>>      -Passed
>>>
>>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
>>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
>>> fix this problem?
>>>
>>> And the commit will be merged into linux upstream very soon.
>>
>> Please let me know Bart, if the rxe driver is still broken I will
>> definitely punt all the changes for RXE to the next cycle until it can
>> be fixed.
> 
> Hi Jason,
> 
> Thanks for having offered to revert the RXE changes from this merge window.
> Unfortunately that wouldn't be sufficient. My test results so far for test
> srp/002 in combination with the rdma_rxe driver are as follows:
> * Kernel v5.12: test passes.
> * Kernel v5.13: test fails.
> * Kernel v5.14-rc7: test fails.
> 
> For the rdma_rxe tests for kernel v5.14-rc7 I found the following in the kernel
> log:
> 
> ib_srp:add_target_store: ib_srp: max_sectors = 1024; max_pages_per_mr = 512; mr_page_size = 4096; max_sectors_per_mr = 4096; mr_per_cmd = 2
> ib_srp: enp1s0_rxe: ib_alloc_mr() failed. Try to reduce max_cmd_per_lun, max_sect or ch_count
> 
> There is sufficient memory available in the VM in which I ran the tests. It is
> not clear to me why ib_alloc_mr() fails with these parameters when using the
> rdma_rxe driver? As one can see in srp_alloc_fr_pool() the SRP initiator driver
> respects the max_pages_per_mr RDMA driver limit.

A correction: test srp/002 passes on my setup against kernel v5.13. I probably
selected the wrong kernel from the GRUB boot menu before I sent my previous email.
So the test failure is something that happens with v5.14-rc but not with v5.13.

Applying the following patch on top Linus' master branch did not help:

diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h
index 742e6ec93686..643b80e47c82 100644
--- a/drivers/infiniband/sw/rxe/rxe_param.h
+++ b/drivers/infiniband/sw/rxe/rxe_param.h
@@ -88,7 +88,7 @@ enum rxe_device_param {
  	RXE_MIN_SRQ_INDEX		= 0x00020001,
  	RXE_MAX_SRQ_INDEX		= 0x00040000,

-	RXE_MAX_MR			= 0x00001000,
+	RXE_MAX_MR			= 0x00100000,
  	RXE_MAX_MW			= 0x00001000,
  	RXE_MIN_MR_INDEX		= 0x00000001,
  	RXE_MAX_MR_INDEX		= 0x00010000,

Bart.

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: v5.14 RXE driver broken?
  2021-08-25 20:58       ` Bart Van Assche
@ 2021-08-25 21:09         ` Bob Pearson
  2021-08-25 21:44           ` Bart Van Assche
  0 siblings, 1 reply; 12+ messages in thread
From: Bob Pearson @ 2021-08-25 21:09 UTC (permalink / raw)
  To: Bart Van Assche, Jason Gunthorpe; +Cc: Zhu Yanjun, linux-rdma, linux-block

On 8/25/21 3:58 PM, Bart Van Assche wrote:
> On 8/25/21 11:22 AM, Bart Van Assche wrote:
>> On 8/25/21 9:32 AM, Jason Gunthorpe wrote:
>>> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
>>>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>>>
>>>>> Hi Bob,
>>>>>
>>>>> If I run the following test against Linus' master branch then that test
>>>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>>>>> headers to staging"")):
>>>>>
>>>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>>>>>      runtime    ...  48.849s
>>>>>
>>>>> The following test fails:
>>>>>
>>>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>>>>>      runtime  48.849s  ...  15.024s
>>>>>      +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad      2021-08-23 19:51:05.182958728 -0700
>>>>>      @@ -1,2 +1 @@
>>>>>       Configured SRP target driver
>>>>>      -Passed
>>>>
>>>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
>>>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
>>>> fix this problem?
>>>>
>>>> And the commit will be merged into linux upstream very soon.
>>>
>>> Please let me know Bart, if the rxe driver is still broken I will
>>> definitely punt all the changes for RXE to the next cycle until it can
>>> be fixed.
>>
>> Hi Jason,
>>
>> Thanks for having offered to revert the RXE changes from this merge window.
>> Unfortunately that wouldn't be sufficient. My test results so far for test
>> srp/002 in combination with the rdma_rxe driver are as follows:
>> * Kernel v5.12: test passes.
>> * Kernel v5.13: test fails.
>> * Kernel v5.14-rc7: test fails.
>>
>> For the rdma_rxe tests for kernel v5.14-rc7 I found the following in the kernel
>> log:
>>
>> ib_srp:add_target_store: ib_srp: max_sectors = 1024; max_pages_per_mr = 512; mr_page_size = 4096; max_sectors_per_mr = 4096; mr_per_cmd = 2
>> ib_srp: enp1s0_rxe: ib_alloc_mr() failed. Try to reduce max_cmd_per_lun, max_sect or ch_count
>>
>> There is sufficient memory available in the VM in which I ran the tests. It is
>> not clear to me why ib_alloc_mr() fails with these parameters when using the
>> rdma_rxe driver? As one can see in srp_alloc_fr_pool() the SRP initiator driver
>> respects the max_pages_per_mr RDMA driver limit.
> 
> A correction: test srp/002 passes on my setup against kernel v5.13. I probably
> selected the wrong kernel from the GRUB boot menu before I sent my previous email.
> So the test failure is something that happens with v5.14-rc but not with v5.13.
> 
> Applying the following patch on top Linus' master branch did not help:
> 
> diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h
> index 742e6ec93686..643b80e47c82 100644
> --- a/drivers/infiniband/sw/rxe/rxe_param.h
> +++ b/drivers/infiniband/sw/rxe/rxe_param.h
> @@ -88,7 +88,7 @@ enum rxe_device_param {
>      RXE_MIN_SRQ_INDEX        = 0x00020001,
>      RXE_MAX_SRQ_INDEX        = 0x00040000,
> 
> -    RXE_MAX_MR            = 0x00001000,
> +    RXE_MAX_MR            = 0x00100000,
>      RXE_MAX_MW            = 0x00001000,
>      RXE_MIN_MR_INDEX        = 0x00000001,
>      RXE_MAX_MR_INDEX        = 0x00010000,
> 
> Bart.
Bart,

Are you seeing the ib_alloc_mr() failure in 5.14? I thought that was just a 5.13 thing.
I am still not seeing that error in my test setup. I am getting a soft lockup error after ~20 seconds.
During most of that there is a constant exchange of req/ack packets with nothing else happening.

If you want I can send you a patch to print out error messages from MR allocation.

Bob

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: v5.14 RXE driver broken?
  2021-08-25 21:09         ` Bob Pearson
@ 2021-08-25 21:44           ` Bart Van Assche
  0 siblings, 0 replies; 12+ messages in thread
From: Bart Van Assche @ 2021-08-25 21:44 UTC (permalink / raw)
  To: Bob Pearson, Jason Gunthorpe; +Cc: Zhu Yanjun, linux-rdma, linux-block

On 8/25/21 2:09 PM, Bob Pearson wrote:
> Are you seeing the ib_alloc_mr() failure in 5.14? I thought that was just a
> 5.13 thing. I am still not seeing that error in my test setup. I am getting > a soft lockup error after ~20 seconds. During most of that there is a> constant exchange of req/ack packets with nothing else happening.
> 
> If you want I can send you a patch to print out error messages from MR
> allocation.

Hi Bob,

I see the ib_alloc_mr() failures with kernel v5.14 in two different VMs. A
different Linux distro has been installed in each VM.

If it would help your debugging efforts, please send me the patch that prints
out the MR allocation error messages.

Bart.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: v5.14 RXE driver broken?
  2021-08-25 16:32   ` Jason Gunthorpe
  2021-08-25 18:03     ` Bob Pearson
  2021-08-25 18:22     ` Bart Van Assche
@ 2021-08-26 19:03     ` Bob Pearson
  2021-08-26 20:03       ` Bob Pearson
  2021-08-27  3:18       ` Zhu Yanjun
  2 siblings, 2 replies; 12+ messages in thread
From: Bob Pearson @ 2021-08-26 19:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Zhu Yanjun; +Cc: Bart Van Assche, linux-rdma, linux-block

On 8/25/21 11:32 AM, Jason Gunthorpe wrote:
> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>
>>> Hi Bob,
>>>
>>> If I run the following test against Linus' master branch then that test
>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>>> headers to staging"")):
>>>
>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>>>     runtime    ...  48.849s
>>>
>>> The following test fails:
>>>
>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>>>     runtime  48.849s  ...  15.024s
>>>     +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad      2021-08-23 19:51:05.182958728 -0700
>>>     @@ -1,2 +1 @@
>>>      Configured SRP target driver
>>>     -Passed
>>
>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
>> fix this problem?
>>
>> And the commit will be merged into linux upstream very soon.
> 
> Please let me know Bart, if the rxe driver is still broken I will
> definitely punt all the changes for RXE to the next cycle until it can
> be fixed.
> 
> Jason
> 

Jason, Bart, Zhu

I have succeeded in getting blktest to pass on 5.14. There is a bug in rxe that I had to fix. In
loopback mode when an RNR NAK is received it requests the requester to start a retry sequence
before the rnr timer fires which results in the command being retried immediately regardless of the
value of the timeout. I made a small change which requires the requester to wait for either the
timer to fire or an ack to arrive. The srp/002 test case in blktest spends a long time before posting
a receive in some cases which caused a soft lockup. There is a second non-bug which is the number of
MRs was too small to run the test. I increased these by a factor of 256 which fixed that.

My test setup has for-next + 5 recent rxe fix patches applied in addition to the RNR timing one above.

I will submit a patch for the rnr fix.

Bob


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: v5.14 RXE driver broken?
  2021-08-26 19:03     ` Bob Pearson
@ 2021-08-26 20:03       ` Bob Pearson
  2021-08-27  3:18       ` Zhu Yanjun
  1 sibling, 0 replies; 12+ messages in thread
From: Bob Pearson @ 2021-08-26 20:03 UTC (permalink / raw)
  To: Jason Gunthorpe, Zhu Yanjun; +Cc: Bart Van Assche, linux-rdma, linux-block

On 8/26/21 2:03 PM, Bob Pearson wrote:
> On 8/25/21 11:32 AM, Jason Gunthorpe wrote:
>> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
>>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>>
>>>> Hi Bob,
>>>>
>>>> If I run the following test against Linus' master branch then that test
>>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>>>> headers to staging"")):
>>>>
>>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>>>>     runtime    ...  48.849s
>>>>
>>>> The following test fails:
>>>>
>>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>>>>     runtime  48.849s  ...  15.024s
>>>>     +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad      2021-08-23 19:51:05.182958728 -0700
>>>>     @@ -1,2 +1 @@
>>>>      Configured SRP target driver
>>>>     -Passed
>>>
>>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
>>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
>>> fix this problem?
>>>
>>> And the commit will be merged into linux upstream very soon.
>>
>> Please let me know Bart, if the rxe driver is still broken I will
>> definitely punt all the changes for RXE to the next cycle until it can
>> be fixed.
>>
>> Jason
>>
> 
> Jason, Bart, Zhu
> 
> I have succeeded in getting blktest to pass on 5.14. There is a bug in rxe that I had to fix. In
> loopback mode when an RNR NAK is received it requests the requester to start a retry sequence
> before the rnr timer fires which results in the command being retried immediately regardless of the
> value of the timeout. I made a small change which requires the requester to wait for either the
> timer to fire or an ack to arrive. The srp/002 test case in blktest spends a long time before posting
> a receive in some cases which caused a soft lockup. There is a second non-bug which is the number of
> MRs was too small to run the test. I increased these by a factor of 256 which fixed that.
> 
> My test setup has for-next + 5 recent rxe fix patches applied in addition to the RNR timing one above.
> 
> I will submit a patch for the rnr fix.
> 
> Bob
> 

Well it's better but not quite done yet.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: v5.14 RXE driver broken?
  2021-08-26 19:03     ` Bob Pearson
  2021-08-26 20:03       ` Bob Pearson
@ 2021-08-27  3:18       ` Zhu Yanjun
  1 sibling, 0 replies; 12+ messages in thread
From: Zhu Yanjun @ 2021-08-27  3:18 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Jason Gunthorpe, Bart Van Assche, linux-rdma, linux-block

On Fri, Aug 27, 2021 at 3:03 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> On 8/25/21 11:32 AM, Jason Gunthorpe wrote:
> > On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
> >> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >>>
> >>> Hi Bob,
> >>>
> >>> If I run the following test against Linus' master branch then that test
> >>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
> >>> headers to staging"")):
> >>>
> >>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
> >>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
> >>>     runtime    ...  48.849s
> >>>
> >>> The following test fails:
> >>>
> >>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
> >>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
> >>>     runtime  48.849s  ...  15.024s
> >>>     +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad      2021-08-23 19:51:05.182958728 -0700
> >>>     @@ -1,2 +1 @@
> >>>      Configured SRP target driver
> >>>     -Passed
> >>
> >> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
> >> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
> >> fix this problem?
> >>
> >> And the commit will be merged into linux upstream very soon.
> >
> > Please let me know Bart, if the rxe driver is still broken I will
> > definitely punt all the changes for RXE to the next cycle until it can
> > be fixed.
> >
> > Jason
> >
>
> Jason, Bart, Zhu
>
> I have succeeded in getting blktest to pass on 5.14. There is a bug in rxe that I had to fix. In
> loopback mode when an RNR NAK is received it requests the requester to start a retry sequence
> before the rnr timer fires which results in the command being retried immediately regardless of the
> value of the timeout. I made a small change which requires the requester to wait for either the
> timer to fire or an ack to arrive. The srp/002 test case in blktest spends a long time before posting

Can this problem be reproduced with 5.13? From Bart, this problem will
not occur with v5.13.

Thanks
Zhu Yanjun

> a receive in some cases which caused a soft lockup. There is a second non-bug which is the number of
> MRs was too small to run the test. I increased these by a factor of 256 which fixed that.
>
> My test setup has for-next + 5 recent rxe fix patches applied in addition to the RNR timing one above.
>
> I will submit a patch for the rnr fix.
>
> Bob
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-08-27  3:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-24  3:01 v5.14 RXE driver broken? Bart Van Assche
2021-08-25  3:02 ` Zhu Yanjun
2021-08-25 16:32   ` Jason Gunthorpe
2021-08-25 18:03     ` Bob Pearson
2021-08-25 18:22     ` Bart Van Assche
2021-08-25 20:58       ` Bart Van Assche
2021-08-25 21:09         ` Bob Pearson
2021-08-25 21:44           ` Bart Van Assche
2021-08-26 19:03     ` Bob Pearson
2021-08-26 20:03       ` Bob Pearson
2021-08-27  3:18       ` Zhu Yanjun
2021-08-25 16:46   ` Bart Van Assche

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).