* v5.14 RXE driver broken?
@ 2021-08-24 3:01 Bart Van Assche
2021-08-25 3:02 ` Zhu Yanjun
0 siblings, 1 reply; 12+ messages in thread
From: Bart Van Assche @ 2021-08-24 3:01 UTC (permalink / raw)
To: Bob Pearson; +Cc: linux-rdma, linux-block
Hi Bob,
If I run the following test against Linus' master branch then that test
passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
headers to staging"")):
# export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
runtime ... 48.849s
The following test fails:
# export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
runtime 48.849s ... 15.024s
--- tests/srp/002.out 2018-09-08 19:43:42.291664821 -0700
+++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700
@@ -1,2 +1 @@
Configured SRP target driver
-Passed
The only difference between these two tests is that test (1) use the siw
(soft-iWARP) driver while test (2) uses the rdma_rxe driver (soft-RoCE).
Both tests run reliably against previous Linux kernel versions, e.g.
v5.13. Can you take a look at this? The blktests software is available at
https://github.com/osandov/blktests/.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken?
2021-08-24 3:01 v5.14 RXE driver broken? Bart Van Assche
@ 2021-08-25 3:02 ` Zhu Yanjun
2021-08-25 16:32 ` Jason Gunthorpe
2021-08-25 16:46 ` Bart Van Assche
0 siblings, 2 replies; 12+ messages in thread
From: Zhu Yanjun @ 2021-08-25 3:02 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Bob Pearson, linux-rdma, linux-block
On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>
> Hi Bob,
>
> If I run the following test against Linus' master branch then that test
> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
> headers to staging"")):
>
> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
> runtime ... 48.849s
>
> The following test fails:
>
> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
> runtime 48.849s ... 15.024s
> --- tests/srp/002.out 2018-09-08 19:43:42.291664821 -0700
> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700
> @@ -1,2 +1 @@
> Configured SRP target driver
> -Passed
Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
fix this problem?
And the commit will be merged into linux upstream very soon.
Zhu Yanjun
>
> The only difference between these two tests is that test (1) use the siw
> (soft-iWARP) driver while test (2) uses the rdma_rxe driver (soft-RoCE).
> Both tests run reliably against previous Linux kernel versions, e.g.
> v5.13. Can you take a look at this? The blktests software is available at
> https://github.com/osandov/blktests/.
>
> Thanks,
>
> Bart.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken?
2021-08-25 3:02 ` Zhu Yanjun
@ 2021-08-25 16:32 ` Jason Gunthorpe
2021-08-25 18:03 ` Bob Pearson
` (2 more replies)
2021-08-25 16:46 ` Bart Van Assche
1 sibling, 3 replies; 12+ messages in thread
From: Jason Gunthorpe @ 2021-08-25 16:32 UTC (permalink / raw)
To: Zhu Yanjun; +Cc: Bart Van Assche, Bob Pearson, linux-rdma, linux-block
On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >
> > Hi Bob,
> >
> > If I run the following test against Linus' master branch then that test
> > passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
> > headers to staging"")):
> >
> > # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
> > srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
> > runtime ... 48.849s
> >
> > The following test fails:
> >
> > # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
> > srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
> > runtime 48.849s ... 15.024s
> > +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700
> > @@ -1,2 +1 @@
> > Configured SRP target driver
> > -Passed
>
> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
> fix this problem?
>
> And the commit will be merged into linux upstream very soon.
Please let me know Bart, if the rxe driver is still broken I will
definitely punt all the changes for RXE to the next cycle until it can
be fixed.
Jason
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken?
2021-08-25 3:02 ` Zhu Yanjun
2021-08-25 16:32 ` Jason Gunthorpe
@ 2021-08-25 16:46 ` Bart Van Assche
1 sibling, 0 replies; 12+ messages in thread
From: Bart Van Assche @ 2021-08-25 16:46 UTC (permalink / raw)
To: Zhu Yanjun; +Cc: Bob Pearson, linux-rdma, linux-block
On 8/24/21 8:02 PM, Zhu Yanjun wrote:
> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>
>> Hi Bob,
>>
>> If I run the following test against Linus' master branch then that test
>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>> headers to staging"")):
>>
>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>> runtime ... 48.849s
>>
>> The following test fails:
>>
>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>> runtime 48.849s ... 15.024s
>> --- tests/srp/002.out 2018-09-08 19:43:42.291664821 -0700
>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700
>> @@ -1,2 +1 @@
>> Configured SRP target driver
>> -Passed
>
> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
> fix this problem?
>
> And the commit will be merged into linux upstream very soon.
Hi Zhu,
Thanks for having taken a look.
Isn't commit cc4f596cf85e ("RDMA/rxe: Zero out index member of struct
rxe_queue") already in Linus' tree? I think it was merged yesterday (August
24). Unfortunately the test I mentioned still fails on top of that patch.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken?
2021-08-25 16:32 ` Jason Gunthorpe
@ 2021-08-25 18:03 ` Bob Pearson
2021-08-25 18:22 ` Bart Van Assche
2021-08-26 19:03 ` Bob Pearson
2 siblings, 0 replies; 12+ messages in thread
From: Bob Pearson @ 2021-08-25 18:03 UTC (permalink / raw)
To: Jason Gunthorpe, Zhu Yanjun; +Cc: Bart Van Assche, linux-rdma, linux-block
On 8/25/21 11:32 AM, Jason Gunthorpe wrote:
> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>
>>> Hi Bob,
>>>
>>> If I run the following test against Linus' master branch then that test
>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>>> headers to staging"")):
>>>
>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>>> runtime ... 48.849s
>>>
>>> The following test fails:
>>>
>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>>> runtime 48.849s ... 15.024s
>>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700
>>> @@ -1,2 +1 @@
>>> Configured SRP target driver
>>> -Passed
>>
>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
>> fix this problem?
>>
>> And the commit will be merged into linux upstream very soon.
>
> Please let me know Bart, if the rxe driver is still broken I will
> definitely punt all the changes for RXE to the next cycle until it can
> be fixed.
>
> Jason
>
Jason,
I am (I think) able to reproduce Bart's issue. I wouldn't hold up the 'bug fix' patches they are all legitimate.
Bob
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken?
2021-08-25 16:32 ` Jason Gunthorpe
2021-08-25 18:03 ` Bob Pearson
@ 2021-08-25 18:22 ` Bart Van Assche
2021-08-25 20:58 ` Bart Van Assche
2021-08-26 19:03 ` Bob Pearson
2 siblings, 1 reply; 12+ messages in thread
From: Bart Van Assche @ 2021-08-25 18:22 UTC (permalink / raw)
To: Jason Gunthorpe, Zhu Yanjun; +Cc: Bob Pearson, linux-rdma, linux-block
On 8/25/21 9:32 AM, Jason Gunthorpe wrote:
> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>
>>> Hi Bob,
>>>
>>> If I run the following test against Linus' master branch then that test
>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>>> headers to staging"")):
>>>
>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>>> runtime ... 48.849s
>>>
>>> The following test fails:
>>>
>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>>> runtime 48.849s ... 15.024s
>>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700
>>> @@ -1,2 +1 @@
>>> Configured SRP target driver
>>> -Passed
>>
>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
>> fix this problem?
>>
>> And the commit will be merged into linux upstream very soon.
>
> Please let me know Bart, if the rxe driver is still broken I will
> definitely punt all the changes for RXE to the next cycle until it can
> be fixed.
Hi Jason,
Thanks for having offered to revert the RXE changes from this merge window.
Unfortunately that wouldn't be sufficient. My test results so far for test
srp/002 in combination with the rdma_rxe driver are as follows:
* Kernel v5.12: test passes.
* Kernel v5.13: test fails.
* Kernel v5.14-rc7: test fails.
For the rdma_rxe tests for kernel v5.14-rc7 I found the following in the kernel
log:
ib_srp:add_target_store: ib_srp: max_sectors = 1024; max_pages_per_mr = 512; mr_page_size = 4096; max_sectors_per_mr = 4096; mr_per_cmd = 2
ib_srp: enp1s0_rxe: ib_alloc_mr() failed. Try to reduce max_cmd_per_lun, max_sect or ch_count
There is sufficient memory available in the VM in which I ran the tests. It is
not clear to me why ib_alloc_mr() fails with these parameters when using the
rdma_rxe driver? As one can see in srp_alloc_fr_pool() the SRP initiator driver
respects the max_pages_per_mr RDMA driver limit.
Thanks,
Bart.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken?
2021-08-25 18:22 ` Bart Van Assche
@ 2021-08-25 20:58 ` Bart Van Assche
2021-08-25 21:09 ` Bob Pearson
0 siblings, 1 reply; 12+ messages in thread
From: Bart Van Assche @ 2021-08-25 20:58 UTC (permalink / raw)
To: Jason Gunthorpe; +Cc: Bob Pearson, Zhu Yanjun, linux-rdma, linux-block
On 8/25/21 11:22 AM, Bart Van Assche wrote:
> On 8/25/21 9:32 AM, Jason Gunthorpe wrote:
>> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
>>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>>
>>>> Hi Bob,
>>>>
>>>> If I run the following test against Linus' master branch then that test
>>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>>>> headers to staging"")):
>>>>
>>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>>>> runtime ... 48.849s
>>>>
>>>> The following test fails:
>>>>
>>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>>>> runtime 48.849s ... 15.024s
>>>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700
>>>> @@ -1,2 +1 @@
>>>> Configured SRP target driver
>>>> -Passed
>>>
>>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
>>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
>>> fix this problem?
>>>
>>> And the commit will be merged into linux upstream very soon.
>>
>> Please let me know Bart, if the rxe driver is still broken I will
>> definitely punt all the changes for RXE to the next cycle until it can
>> be fixed.
>
> Hi Jason,
>
> Thanks for having offered to revert the RXE changes from this merge window.
> Unfortunately that wouldn't be sufficient. My test results so far for test
> srp/002 in combination with the rdma_rxe driver are as follows:
> * Kernel v5.12: test passes.
> * Kernel v5.13: test fails.
> * Kernel v5.14-rc7: test fails.
>
> For the rdma_rxe tests for kernel v5.14-rc7 I found the following in the kernel
> log:
>
> ib_srp:add_target_store: ib_srp: max_sectors = 1024; max_pages_per_mr = 512; mr_page_size = 4096; max_sectors_per_mr = 4096; mr_per_cmd = 2
> ib_srp: enp1s0_rxe: ib_alloc_mr() failed. Try to reduce max_cmd_per_lun, max_sect or ch_count
>
> There is sufficient memory available in the VM in which I ran the tests. It is
> not clear to me why ib_alloc_mr() fails with these parameters when using the
> rdma_rxe driver? As one can see in srp_alloc_fr_pool() the SRP initiator driver
> respects the max_pages_per_mr RDMA driver limit.
A correction: test srp/002 passes on my setup against kernel v5.13. I probably
selected the wrong kernel from the GRUB boot menu before I sent my previous email.
So the test failure is something that happens with v5.14-rc but not with v5.13.
Applying the following patch on top Linus' master branch did not help:
diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h
index 742e6ec93686..643b80e47c82 100644
--- a/drivers/infiniband/sw/rxe/rxe_param.h
+++ b/drivers/infiniband/sw/rxe/rxe_param.h
@@ -88,7 +88,7 @@ enum rxe_device_param {
RXE_MIN_SRQ_INDEX = 0x00020001,
RXE_MAX_SRQ_INDEX = 0x00040000,
- RXE_MAX_MR = 0x00001000,
+ RXE_MAX_MR = 0x00100000,
RXE_MAX_MW = 0x00001000,
RXE_MIN_MR_INDEX = 0x00000001,
RXE_MAX_MR_INDEX = 0x00010000,
Bart.
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken?
2021-08-25 20:58 ` Bart Van Assche
@ 2021-08-25 21:09 ` Bob Pearson
2021-08-25 21:44 ` Bart Van Assche
0 siblings, 1 reply; 12+ messages in thread
From: Bob Pearson @ 2021-08-25 21:09 UTC (permalink / raw)
To: Bart Van Assche, Jason Gunthorpe; +Cc: Zhu Yanjun, linux-rdma, linux-block
On 8/25/21 3:58 PM, Bart Van Assche wrote:
> On 8/25/21 11:22 AM, Bart Van Assche wrote:
>> On 8/25/21 9:32 AM, Jason Gunthorpe wrote:
>>> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
>>>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>>>
>>>>> Hi Bob,
>>>>>
>>>>> If I run the following test against Linus' master branch then that test
>>>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>>>>> headers to staging"")):
>>>>>
>>>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>>>>> runtime ... 48.849s
>>>>>
>>>>> The following test fails:
>>>>>
>>>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>>>>> runtime 48.849s ... 15.024s
>>>>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700
>>>>> @@ -1,2 +1 @@
>>>>> Configured SRP target driver
>>>>> -Passed
>>>>
>>>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
>>>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
>>>> fix this problem?
>>>>
>>>> And the commit will be merged into linux upstream very soon.
>>>
>>> Please let me know Bart, if the rxe driver is still broken I will
>>> definitely punt all the changes for RXE to the next cycle until it can
>>> be fixed.
>>
>> Hi Jason,
>>
>> Thanks for having offered to revert the RXE changes from this merge window.
>> Unfortunately that wouldn't be sufficient. My test results so far for test
>> srp/002 in combination with the rdma_rxe driver are as follows:
>> * Kernel v5.12: test passes.
>> * Kernel v5.13: test fails.
>> * Kernel v5.14-rc7: test fails.
>>
>> For the rdma_rxe tests for kernel v5.14-rc7 I found the following in the kernel
>> log:
>>
>> ib_srp:add_target_store: ib_srp: max_sectors = 1024; max_pages_per_mr = 512; mr_page_size = 4096; max_sectors_per_mr = 4096; mr_per_cmd = 2
>> ib_srp: enp1s0_rxe: ib_alloc_mr() failed. Try to reduce max_cmd_per_lun, max_sect or ch_count
>>
>> There is sufficient memory available in the VM in which I ran the tests. It is
>> not clear to me why ib_alloc_mr() fails with these parameters when using the
>> rdma_rxe driver? As one can see in srp_alloc_fr_pool() the SRP initiator driver
>> respects the max_pages_per_mr RDMA driver limit.
>
> A correction: test srp/002 passes on my setup against kernel v5.13. I probably
> selected the wrong kernel from the GRUB boot menu before I sent my previous email.
> So the test failure is something that happens with v5.14-rc but not with v5.13.
>
> Applying the following patch on top Linus' master branch did not help:
>
> diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h
> index 742e6ec93686..643b80e47c82 100644
> --- a/drivers/infiniband/sw/rxe/rxe_param.h
> +++ b/drivers/infiniband/sw/rxe/rxe_param.h
> @@ -88,7 +88,7 @@ enum rxe_device_param {
> RXE_MIN_SRQ_INDEX = 0x00020001,
> RXE_MAX_SRQ_INDEX = 0x00040000,
>
> - RXE_MAX_MR = 0x00001000,
> + RXE_MAX_MR = 0x00100000,
> RXE_MAX_MW = 0x00001000,
> RXE_MIN_MR_INDEX = 0x00000001,
> RXE_MAX_MR_INDEX = 0x00010000,
>
> Bart.
Bart,
Are you seeing the ib_alloc_mr() failure in 5.14? I thought that was just a 5.13 thing.
I am still not seeing that error in my test setup. I am getting a soft lockup error after ~20 seconds.
During most of that there is a constant exchange of req/ack packets with nothing else happening.
If you want I can send you a patch to print out error messages from MR allocation.
Bob
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken?
2021-08-25 21:09 ` Bob Pearson
@ 2021-08-25 21:44 ` Bart Van Assche
0 siblings, 0 replies; 12+ messages in thread
From: Bart Van Assche @ 2021-08-25 21:44 UTC (permalink / raw)
To: Bob Pearson, Jason Gunthorpe; +Cc: Zhu Yanjun, linux-rdma, linux-block
On 8/25/21 2:09 PM, Bob Pearson wrote:
> Are you seeing the ib_alloc_mr() failure in 5.14? I thought that was just a
> 5.13 thing. I am still not seeing that error in my test setup. I am getting > a soft lockup error after ~20 seconds. During most of that there is a> constant exchange of req/ack packets with nothing else happening.
>
> If you want I can send you a patch to print out error messages from MR
> allocation.
Hi Bob,
I see the ib_alloc_mr() failures with kernel v5.14 in two different VMs. A
different Linux distro has been installed in each VM.
If it would help your debugging efforts, please send me the patch that prints
out the MR allocation error messages.
Bart.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken?
2021-08-25 16:32 ` Jason Gunthorpe
2021-08-25 18:03 ` Bob Pearson
2021-08-25 18:22 ` Bart Van Assche
@ 2021-08-26 19:03 ` Bob Pearson
2021-08-26 20:03 ` Bob Pearson
2021-08-27 3:18 ` Zhu Yanjun
2 siblings, 2 replies; 12+ messages in thread
From: Bob Pearson @ 2021-08-26 19:03 UTC (permalink / raw)
To: Jason Gunthorpe, Zhu Yanjun; +Cc: Bart Van Assche, linux-rdma, linux-block
On 8/25/21 11:32 AM, Jason Gunthorpe wrote:
> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>
>>> Hi Bob,
>>>
>>> If I run the following test against Linus' master branch then that test
>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>>> headers to staging"")):
>>>
>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>>> runtime ... 48.849s
>>>
>>> The following test fails:
>>>
>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>>> runtime 48.849s ... 15.024s
>>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700
>>> @@ -1,2 +1 @@
>>> Configured SRP target driver
>>> -Passed
>>
>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
>> fix this problem?
>>
>> And the commit will be merged into linux upstream very soon.
>
> Please let me know Bart, if the rxe driver is still broken I will
> definitely punt all the changes for RXE to the next cycle until it can
> be fixed.
>
> Jason
>
Jason, Bart, Zhu
I have succeeded in getting blktest to pass on 5.14. There is a bug in rxe that I had to fix. In
loopback mode when an RNR NAK is received it requests the requester to start a retry sequence
before the rnr timer fires which results in the command being retried immediately regardless of the
value of the timeout. I made a small change which requires the requester to wait for either the
timer to fire or an ack to arrive. The srp/002 test case in blktest spends a long time before posting
a receive in some cases which caused a soft lockup. There is a second non-bug which is the number of
MRs was too small to run the test. I increased these by a factor of 256 which fixed that.
My test setup has for-next + 5 recent rxe fix patches applied in addition to the RNR timing one above.
I will submit a patch for the rnr fix.
Bob
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken?
2021-08-26 19:03 ` Bob Pearson
@ 2021-08-26 20:03 ` Bob Pearson
2021-08-27 3:18 ` Zhu Yanjun
1 sibling, 0 replies; 12+ messages in thread
From: Bob Pearson @ 2021-08-26 20:03 UTC (permalink / raw)
To: Jason Gunthorpe, Zhu Yanjun; +Cc: Bart Van Assche, linux-rdma, linux-block
On 8/26/21 2:03 PM, Bob Pearson wrote:
> On 8/25/21 11:32 AM, Jason Gunthorpe wrote:
>> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
>>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
>>>>
>>>> Hi Bob,
>>>>
>>>> If I run the following test against Linus' master branch then that test
>>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
>>>> headers to staging"")):
>>>>
>>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
>>>> runtime ... 48.849s
>>>>
>>>> The following test fails:
>>>>
>>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
>>>> runtime 48.849s ... 15.024s
>>>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700
>>>> @@ -1,2 +1 @@
>>>> Configured SRP target driver
>>>> -Passed
>>>
>>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
>>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
>>> fix this problem?
>>>
>>> And the commit will be merged into linux upstream very soon.
>>
>> Please let me know Bart, if the rxe driver is still broken I will
>> definitely punt all the changes for RXE to the next cycle until it can
>> be fixed.
>>
>> Jason
>>
>
> Jason, Bart, Zhu
>
> I have succeeded in getting blktest to pass on 5.14. There is a bug in rxe that I had to fix. In
> loopback mode when an RNR NAK is received it requests the requester to start a retry sequence
> before the rnr timer fires which results in the command being retried immediately regardless of the
> value of the timeout. I made a small change which requires the requester to wait for either the
> timer to fire or an ack to arrive. The srp/002 test case in blktest spends a long time before posting
> a receive in some cases which caused a soft lockup. There is a second non-bug which is the number of
> MRs was too small to run the test. I increased these by a factor of 256 which fixed that.
>
> My test setup has for-next + 5 recent rxe fix patches applied in addition to the RNR timing one above.
>
> I will submit a patch for the rnr fix.
>
> Bob
>
Well it's better but not quite done yet.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken?
2021-08-26 19:03 ` Bob Pearson
2021-08-26 20:03 ` Bob Pearson
@ 2021-08-27 3:18 ` Zhu Yanjun
1 sibling, 0 replies; 12+ messages in thread
From: Zhu Yanjun @ 2021-08-27 3:18 UTC (permalink / raw)
To: Bob Pearson; +Cc: Jason Gunthorpe, Bart Van Assche, linux-rdma, linux-block
On Fri, Aug 27, 2021 at 3:03 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> On 8/25/21 11:32 AM, Jason Gunthorpe wrote:
> > On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote:
> >> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote:
> >>>
> >>> Hi Bob,
> >>>
> >>> If I run the following test against Linus' master branch then that test
> >>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some
> >>> headers to staging"")):
> >>>
> >>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002)
> >>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed]
> >>> runtime ... 48.849s
> >>>
> >>> The following test fails:
> >>>
> >>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002)
> >>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed]
> >>> runtime 48.849s ... 15.024s
> >>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700
> >>> @@ -1,2 +1 @@
> >>> Configured SRP target driver
> >>> -Passed
> >>
> >> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue"
> >> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc
> >> fix this problem?
> >>
> >> And the commit will be merged into linux upstream very soon.
> >
> > Please let me know Bart, if the rxe driver is still broken I will
> > definitely punt all the changes for RXE to the next cycle until it can
> > be fixed.
> >
> > Jason
> >
>
> Jason, Bart, Zhu
>
> I have succeeded in getting blktest to pass on 5.14. There is a bug in rxe that I had to fix. In
> loopback mode when an RNR NAK is received it requests the requester to start a retry sequence
> before the rnr timer fires which results in the command being retried immediately regardless of the
> value of the timeout. I made a small change which requires the requester to wait for either the
> timer to fire or an ack to arrive. The srp/002 test case in blktest spends a long time before posting
Can this problem be reproduced with 5.13? From Bart, this problem will
not occur with v5.13.
Thanks
Zhu Yanjun
> a receive in some cases which caused a soft lockup. There is a second non-bug which is the number of
> MRs was too small to run the test. I increased these by a factor of 256 which fixed that.
>
> My test setup has for-next + 5 recent rxe fix patches applied in addition to the RNR timing one above.
>
> I will submit a patch for the rnr fix.
>
> Bob
>
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2021-08-27 3:18 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-24 3:01 v5.14 RXE driver broken? Bart Van Assche
2021-08-25 3:02 ` Zhu Yanjun
2021-08-25 16:32 ` Jason Gunthorpe
2021-08-25 18:03 ` Bob Pearson
2021-08-25 18:22 ` Bart Van Assche
2021-08-25 20:58 ` Bart Van Assche
2021-08-25 21:09 ` Bob Pearson
2021-08-25 21:44 ` Bart Van Assche
2021-08-26 19:03 ` Bob Pearson
2021-08-26 20:03 ` Bob Pearson
2021-08-27 3:18 ` Zhu Yanjun
2021-08-25 16:46 ` Bart Van Assche
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.