* v5.14 RXE driver broken? @ 2021-08-24 3:01 Bart Van Assche 2021-08-25 3:02 ` Zhu Yanjun 0 siblings, 1 reply; 12+ messages in thread From: Bart Van Assche @ 2021-08-24 3:01 UTC (permalink / raw) To: Bob Pearson; +Cc: linux-rdma, linux-block Hi Bob, If I run the following test against Linus' master branch then that test passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some headers to staging"")): # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002) srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed] runtime ... 48.849s The following test fails: # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002) srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed] runtime 48.849s ... 15.024s --- tests/srp/002.out 2018-09-08 19:43:42.291664821 -0700 +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700 @@ -1,2 +1 @@ Configured SRP target driver -Passed The only difference between these two tests is that test (1) use the siw (soft-iWARP) driver while test (2) uses the rdma_rxe driver (soft-RoCE). Both tests run reliably against previous Linux kernel versions, e.g. v5.13. Can you take a look at this? The blktests software is available at https://github.com/osandov/blktests/. Thanks, Bart. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken? 2021-08-24 3:01 v5.14 RXE driver broken? Bart Van Assche @ 2021-08-25 3:02 ` Zhu Yanjun 2021-08-25 16:32 ` Jason Gunthorpe 2021-08-25 16:46 ` Bart Van Assche 0 siblings, 2 replies; 12+ messages in thread From: Zhu Yanjun @ 2021-08-25 3:02 UTC (permalink / raw) To: Bart Van Assche; +Cc: Bob Pearson, linux-rdma, linux-block On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote: > > Hi Bob, > > If I run the following test against Linus' master branch then that test > passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some > headers to staging"")): > > # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002) > srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed] > runtime ... 48.849s > > The following test fails: > > # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002) > srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed] > runtime 48.849s ... 15.024s > --- tests/srp/002.out 2018-09-08 19:43:42.291664821 -0700 > +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700 > @@ -1,2 +1 @@ > Configured SRP target driver > -Passed Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue" in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc fix this problem? And the commit will be merged into linux upstream very soon. Zhu Yanjun > > The only difference between these two tests is that test (1) use the siw > (soft-iWARP) driver while test (2) uses the rdma_rxe driver (soft-RoCE). > Both tests run reliably against previous Linux kernel versions, e.g. > v5.13. Can you take a look at this? The blktests software is available at > https://github.com/osandov/blktests/. > > Thanks, > > Bart. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken? 2021-08-25 3:02 ` Zhu Yanjun @ 2021-08-25 16:32 ` Jason Gunthorpe 2021-08-25 18:03 ` Bob Pearson ` (2 more replies) 2021-08-25 16:46 ` Bart Van Assche 1 sibling, 3 replies; 12+ messages in thread From: Jason Gunthorpe @ 2021-08-25 16:32 UTC (permalink / raw) To: Zhu Yanjun; +Cc: Bart Van Assche, Bob Pearson, linux-rdma, linux-block On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote: > On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote: > > > > Hi Bob, > > > > If I run the following test against Linus' master branch then that test > > passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some > > headers to staging"")): > > > > # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002) > > srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed] > > runtime ... 48.849s > > > > The following test fails: > > > > # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002) > > srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed] > > runtime 48.849s ... 15.024s > > +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700 > > @@ -1,2 +1 @@ > > Configured SRP target driver > > -Passed > > Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue" > in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc > fix this problem? > > And the commit will be merged into linux upstream very soon. Please let me know Bart, if the rxe driver is still broken I will definitely punt all the changes for RXE to the next cycle until it can be fixed. Jason ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken? 2021-08-25 16:32 ` Jason Gunthorpe @ 2021-08-25 18:03 ` Bob Pearson 2021-08-25 18:22 ` Bart Van Assche 2021-08-26 19:03 ` Bob Pearson 2 siblings, 0 replies; 12+ messages in thread From: Bob Pearson @ 2021-08-25 18:03 UTC (permalink / raw) To: Jason Gunthorpe, Zhu Yanjun; +Cc: Bart Van Assche, linux-rdma, linux-block On 8/25/21 11:32 AM, Jason Gunthorpe wrote: > On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote: >> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote: >>> >>> Hi Bob, >>> >>> If I run the following test against Linus' master branch then that test >>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some >>> headers to staging"")): >>> >>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002) >>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed] >>> runtime ... 48.849s >>> >>> The following test fails: >>> >>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002) >>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed] >>> runtime 48.849s ... 15.024s >>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700 >>> @@ -1,2 +1 @@ >>> Configured SRP target driver >>> -Passed >> >> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue" >> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc >> fix this problem? >> >> And the commit will be merged into linux upstream very soon. > > Please let me know Bart, if the rxe driver is still broken I will > definitely punt all the changes for RXE to the next cycle until it can > be fixed. > > Jason > Jason, I am (I think) able to reproduce Bart's issue. I wouldn't hold up the 'bug fix' patches they are all legitimate. Bob ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken? 2021-08-25 16:32 ` Jason Gunthorpe 2021-08-25 18:03 ` Bob Pearson @ 2021-08-25 18:22 ` Bart Van Assche 2021-08-25 20:58 ` Bart Van Assche 2021-08-26 19:03 ` Bob Pearson 2 siblings, 1 reply; 12+ messages in thread From: Bart Van Assche @ 2021-08-25 18:22 UTC (permalink / raw) To: Jason Gunthorpe, Zhu Yanjun; +Cc: Bob Pearson, linux-rdma, linux-block On 8/25/21 9:32 AM, Jason Gunthorpe wrote: > On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote: >> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote: >>> >>> Hi Bob, >>> >>> If I run the following test against Linus' master branch then that test >>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some >>> headers to staging"")): >>> >>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002) >>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed] >>> runtime ... 48.849s >>> >>> The following test fails: >>> >>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002) >>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed] >>> runtime 48.849s ... 15.024s >>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700 >>> @@ -1,2 +1 @@ >>> Configured SRP target driver >>> -Passed >> >> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue" >> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc >> fix this problem? >> >> And the commit will be merged into linux upstream very soon. > > Please let me know Bart, if the rxe driver is still broken I will > definitely punt all the changes for RXE to the next cycle until it can > be fixed. Hi Jason, Thanks for having offered to revert the RXE changes from this merge window. Unfortunately that wouldn't be sufficient. My test results so far for test srp/002 in combination with the rdma_rxe driver are as follows: * Kernel v5.12: test passes. * Kernel v5.13: test fails. * Kernel v5.14-rc7: test fails. For the rdma_rxe tests for kernel v5.14-rc7 I found the following in the kernel log: ib_srp:add_target_store: ib_srp: max_sectors = 1024; max_pages_per_mr = 512; mr_page_size = 4096; max_sectors_per_mr = 4096; mr_per_cmd = 2 ib_srp: enp1s0_rxe: ib_alloc_mr() failed. Try to reduce max_cmd_per_lun, max_sect or ch_count There is sufficient memory available in the VM in which I ran the tests. It is not clear to me why ib_alloc_mr() fails with these parameters when using the rdma_rxe driver? As one can see in srp_alloc_fr_pool() the SRP initiator driver respects the max_pages_per_mr RDMA driver limit. Thanks, Bart. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken? 2021-08-25 18:22 ` Bart Van Assche @ 2021-08-25 20:58 ` Bart Van Assche 2021-08-25 21:09 ` Bob Pearson 0 siblings, 1 reply; 12+ messages in thread From: Bart Van Assche @ 2021-08-25 20:58 UTC (permalink / raw) To: Jason Gunthorpe; +Cc: Bob Pearson, Zhu Yanjun, linux-rdma, linux-block On 8/25/21 11:22 AM, Bart Van Assche wrote: > On 8/25/21 9:32 AM, Jason Gunthorpe wrote: >> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote: >>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote: >>>> >>>> Hi Bob, >>>> >>>> If I run the following test against Linus' master branch then that test >>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some >>>> headers to staging"")): >>>> >>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002) >>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed] >>>> runtime ... 48.849s >>>> >>>> The following test fails: >>>> >>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002) >>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed] >>>> runtime 48.849s ... 15.024s >>>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700 >>>> @@ -1,2 +1 @@ >>>> Configured SRP target driver >>>> -Passed >>> >>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue" >>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc >>> fix this problem? >>> >>> And the commit will be merged into linux upstream very soon. >> >> Please let me know Bart, if the rxe driver is still broken I will >> definitely punt all the changes for RXE to the next cycle until it can >> be fixed. > > Hi Jason, > > Thanks for having offered to revert the RXE changes from this merge window. > Unfortunately that wouldn't be sufficient. My test results so far for test > srp/002 in combination with the rdma_rxe driver are as follows: > * Kernel v5.12: test passes. > * Kernel v5.13: test fails. > * Kernel v5.14-rc7: test fails. > > For the rdma_rxe tests for kernel v5.14-rc7 I found the following in the kernel > log: > > ib_srp:add_target_store: ib_srp: max_sectors = 1024; max_pages_per_mr = 512; mr_page_size = 4096; max_sectors_per_mr = 4096; mr_per_cmd = 2 > ib_srp: enp1s0_rxe: ib_alloc_mr() failed. Try to reduce max_cmd_per_lun, max_sect or ch_count > > There is sufficient memory available in the VM in which I ran the tests. It is > not clear to me why ib_alloc_mr() fails with these parameters when using the > rdma_rxe driver? As one can see in srp_alloc_fr_pool() the SRP initiator driver > respects the max_pages_per_mr RDMA driver limit. A correction: test srp/002 passes on my setup against kernel v5.13. I probably selected the wrong kernel from the GRUB boot menu before I sent my previous email. So the test failure is something that happens with v5.14-rc but not with v5.13. Applying the following patch on top Linus' master branch did not help: diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h index 742e6ec93686..643b80e47c82 100644 --- a/drivers/infiniband/sw/rxe/rxe_param.h +++ b/drivers/infiniband/sw/rxe/rxe_param.h @@ -88,7 +88,7 @@ enum rxe_device_param { RXE_MIN_SRQ_INDEX = 0x00020001, RXE_MAX_SRQ_INDEX = 0x00040000, - RXE_MAX_MR = 0x00001000, + RXE_MAX_MR = 0x00100000, RXE_MAX_MW = 0x00001000, RXE_MIN_MR_INDEX = 0x00000001, RXE_MAX_MR_INDEX = 0x00010000, Bart. ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken? 2021-08-25 20:58 ` Bart Van Assche @ 2021-08-25 21:09 ` Bob Pearson 2021-08-25 21:44 ` Bart Van Assche 0 siblings, 1 reply; 12+ messages in thread From: Bob Pearson @ 2021-08-25 21:09 UTC (permalink / raw) To: Bart Van Assche, Jason Gunthorpe; +Cc: Zhu Yanjun, linux-rdma, linux-block On 8/25/21 3:58 PM, Bart Van Assche wrote: > On 8/25/21 11:22 AM, Bart Van Assche wrote: >> On 8/25/21 9:32 AM, Jason Gunthorpe wrote: >>> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote: >>>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote: >>>>> >>>>> Hi Bob, >>>>> >>>>> If I run the following test against Linus' master branch then that test >>>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some >>>>> headers to staging"")): >>>>> >>>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002) >>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed] >>>>> runtime ... 48.849s >>>>> >>>>> The following test fails: >>>>> >>>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002) >>>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed] >>>>> runtime 48.849s ... 15.024s >>>>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700 >>>>> @@ -1,2 +1 @@ >>>>> Configured SRP target driver >>>>> -Passed >>>> >>>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue" >>>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc >>>> fix this problem? >>>> >>>> And the commit will be merged into linux upstream very soon. >>> >>> Please let me know Bart, if the rxe driver is still broken I will >>> definitely punt all the changes for RXE to the next cycle until it can >>> be fixed. >> >> Hi Jason, >> >> Thanks for having offered to revert the RXE changes from this merge window. >> Unfortunately that wouldn't be sufficient. My test results so far for test >> srp/002 in combination with the rdma_rxe driver are as follows: >> * Kernel v5.12: test passes. >> * Kernel v5.13: test fails. >> * Kernel v5.14-rc7: test fails. >> >> For the rdma_rxe tests for kernel v5.14-rc7 I found the following in the kernel >> log: >> >> ib_srp:add_target_store: ib_srp: max_sectors = 1024; max_pages_per_mr = 512; mr_page_size = 4096; max_sectors_per_mr = 4096; mr_per_cmd = 2 >> ib_srp: enp1s0_rxe: ib_alloc_mr() failed. Try to reduce max_cmd_per_lun, max_sect or ch_count >> >> There is sufficient memory available in the VM in which I ran the tests. It is >> not clear to me why ib_alloc_mr() fails with these parameters when using the >> rdma_rxe driver? As one can see in srp_alloc_fr_pool() the SRP initiator driver >> respects the max_pages_per_mr RDMA driver limit. > > A correction: test srp/002 passes on my setup against kernel v5.13. I probably > selected the wrong kernel from the GRUB boot menu before I sent my previous email. > So the test failure is something that happens with v5.14-rc but not with v5.13. > > Applying the following patch on top Linus' master branch did not help: > > diff --git a/drivers/infiniband/sw/rxe/rxe_param.h b/drivers/infiniband/sw/rxe/rxe_param.h > index 742e6ec93686..643b80e47c82 100644 > --- a/drivers/infiniband/sw/rxe/rxe_param.h > +++ b/drivers/infiniband/sw/rxe/rxe_param.h > @@ -88,7 +88,7 @@ enum rxe_device_param { > RXE_MIN_SRQ_INDEX = 0x00020001, > RXE_MAX_SRQ_INDEX = 0x00040000, > > - RXE_MAX_MR = 0x00001000, > + RXE_MAX_MR = 0x00100000, > RXE_MAX_MW = 0x00001000, > RXE_MIN_MR_INDEX = 0x00000001, > RXE_MAX_MR_INDEX = 0x00010000, > > Bart. Bart, Are you seeing the ib_alloc_mr() failure in 5.14? I thought that was just a 5.13 thing. I am still not seeing that error in my test setup. I am getting a soft lockup error after ~20 seconds. During most of that there is a constant exchange of req/ack packets with nothing else happening. If you want I can send you a patch to print out error messages from MR allocation. Bob ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken? 2021-08-25 21:09 ` Bob Pearson @ 2021-08-25 21:44 ` Bart Van Assche 0 siblings, 0 replies; 12+ messages in thread From: Bart Van Assche @ 2021-08-25 21:44 UTC (permalink / raw) To: Bob Pearson, Jason Gunthorpe; +Cc: Zhu Yanjun, linux-rdma, linux-block On 8/25/21 2:09 PM, Bob Pearson wrote: > Are you seeing the ib_alloc_mr() failure in 5.14? I thought that was just a > 5.13 thing. I am still not seeing that error in my test setup. I am getting > a soft lockup error after ~20 seconds. During most of that there is a> constant exchange of req/ack packets with nothing else happening. > > If you want I can send you a patch to print out error messages from MR > allocation. Hi Bob, I see the ib_alloc_mr() failures with kernel v5.14 in two different VMs. A different Linux distro has been installed in each VM. If it would help your debugging efforts, please send me the patch that prints out the MR allocation error messages. Bart. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken? 2021-08-25 16:32 ` Jason Gunthorpe 2021-08-25 18:03 ` Bob Pearson 2021-08-25 18:22 ` Bart Van Assche @ 2021-08-26 19:03 ` Bob Pearson 2021-08-26 20:03 ` Bob Pearson 2021-08-27 3:18 ` Zhu Yanjun 2 siblings, 2 replies; 12+ messages in thread From: Bob Pearson @ 2021-08-26 19:03 UTC (permalink / raw) To: Jason Gunthorpe, Zhu Yanjun; +Cc: Bart Van Assche, linux-rdma, linux-block On 8/25/21 11:32 AM, Jason Gunthorpe wrote: > On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote: >> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote: >>> >>> Hi Bob, >>> >>> If I run the following test against Linus' master branch then that test >>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some >>> headers to staging"")): >>> >>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002) >>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed] >>> runtime ... 48.849s >>> >>> The following test fails: >>> >>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002) >>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed] >>> runtime 48.849s ... 15.024s >>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700 >>> @@ -1,2 +1 @@ >>> Configured SRP target driver >>> -Passed >> >> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue" >> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc >> fix this problem? >> >> And the commit will be merged into linux upstream very soon. > > Please let me know Bart, if the rxe driver is still broken I will > definitely punt all the changes for RXE to the next cycle until it can > be fixed. > > Jason > Jason, Bart, Zhu I have succeeded in getting blktest to pass on 5.14. There is a bug in rxe that I had to fix. In loopback mode when an RNR NAK is received it requests the requester to start a retry sequence before the rnr timer fires which results in the command being retried immediately regardless of the value of the timeout. I made a small change which requires the requester to wait for either the timer to fire or an ack to arrive. The srp/002 test case in blktest spends a long time before posting a receive in some cases which caused a soft lockup. There is a second non-bug which is the number of MRs was too small to run the test. I increased these by a factor of 256 which fixed that. My test setup has for-next + 5 recent rxe fix patches applied in addition to the RNR timing one above. I will submit a patch for the rnr fix. Bob ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken? 2021-08-26 19:03 ` Bob Pearson @ 2021-08-26 20:03 ` Bob Pearson 2021-08-27 3:18 ` Zhu Yanjun 1 sibling, 0 replies; 12+ messages in thread From: Bob Pearson @ 2021-08-26 20:03 UTC (permalink / raw) To: Jason Gunthorpe, Zhu Yanjun; +Cc: Bart Van Assche, linux-rdma, linux-block On 8/26/21 2:03 PM, Bob Pearson wrote: > On 8/25/21 11:32 AM, Jason Gunthorpe wrote: >> On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote: >>> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote: >>>> >>>> Hi Bob, >>>> >>>> If I run the following test against Linus' master branch then that test >>>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some >>>> headers to staging"")): >>>> >>>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002) >>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed] >>>> runtime ... 48.849s >>>> >>>> The following test fails: >>>> >>>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002) >>>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed] >>>> runtime 48.849s ... 15.024s >>>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700 >>>> @@ -1,2 +1 @@ >>>> Configured SRP target driver >>>> -Passed >>> >>> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue" >>> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc >>> fix this problem? >>> >>> And the commit will be merged into linux upstream very soon. >> >> Please let me know Bart, if the rxe driver is still broken I will >> definitely punt all the changes for RXE to the next cycle until it can >> be fixed. >> >> Jason >> > > Jason, Bart, Zhu > > I have succeeded in getting blktest to pass on 5.14. There is a bug in rxe that I had to fix. In > loopback mode when an RNR NAK is received it requests the requester to start a retry sequence > before the rnr timer fires which results in the command being retried immediately regardless of the > value of the timeout. I made a small change which requires the requester to wait for either the > timer to fire or an ack to arrive. The srp/002 test case in blktest spends a long time before posting > a receive in some cases which caused a soft lockup. There is a second non-bug which is the number of > MRs was too small to run the test. I increased these by a factor of 256 which fixed that. > > My test setup has for-next + 5 recent rxe fix patches applied in addition to the RNR timing one above. > > I will submit a patch for the rnr fix. > > Bob > Well it's better but not quite done yet. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken? 2021-08-26 19:03 ` Bob Pearson 2021-08-26 20:03 ` Bob Pearson @ 2021-08-27 3:18 ` Zhu Yanjun 1 sibling, 0 replies; 12+ messages in thread From: Zhu Yanjun @ 2021-08-27 3:18 UTC (permalink / raw) To: Bob Pearson; +Cc: Jason Gunthorpe, Bart Van Assche, linux-rdma, linux-block On Fri, Aug 27, 2021 at 3:03 AM Bob Pearson <rpearsonhpe@gmail.com> wrote: > > On 8/25/21 11:32 AM, Jason Gunthorpe wrote: > > On Wed, Aug 25, 2021 at 11:02:14AM +0800, Zhu Yanjun wrote: > >> On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote: > >>> > >>> Hi Bob, > >>> > >>> If I run the following test against Linus' master branch then that test > >>> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some > >>> headers to staging"")): > >>> > >>> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002) > >>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed] > >>> runtime ... 48.849s > >>> > >>> The following test fails: > >>> > >>> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002) > >>> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed] > >>> runtime 48.849s ... 15.024s > >>> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700 > >>> @@ -1,2 +1 @@ > >>> Configured SRP target driver > >>> -Passed > >> > >> Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue" > >> in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc > >> fix this problem? > >> > >> And the commit will be merged into linux upstream very soon. > > > > Please let me know Bart, if the rxe driver is still broken I will > > definitely punt all the changes for RXE to the next cycle until it can > > be fixed. > > > > Jason > > > > Jason, Bart, Zhu > > I have succeeded in getting blktest to pass on 5.14. There is a bug in rxe that I had to fix. In > loopback mode when an RNR NAK is received it requests the requester to start a retry sequence > before the rnr timer fires which results in the command being retried immediately regardless of the > value of the timeout. I made a small change which requires the requester to wait for either the > timer to fire or an ack to arrive. The srp/002 test case in blktest spends a long time before posting Can this problem be reproduced with 5.13? From Bart, this problem will not occur with v5.13. Thanks Zhu Yanjun > a receive in some cases which caused a soft lockup. There is a second non-bug which is the number of > MRs was too small to run the test. I increased these by a factor of 256 which fixed that. > > My test setup has for-next + 5 recent rxe fix patches applied in addition to the RNR timing one above. > > I will submit a patch for the rnr fix. > > Bob > ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: v5.14 RXE driver broken? 2021-08-25 3:02 ` Zhu Yanjun 2021-08-25 16:32 ` Jason Gunthorpe @ 2021-08-25 16:46 ` Bart Van Assche 1 sibling, 0 replies; 12+ messages in thread From: Bart Van Assche @ 2021-08-25 16:46 UTC (permalink / raw) To: Zhu Yanjun; +Cc: Bob Pearson, linux-rdma, linux-block On 8/24/21 8:02 PM, Zhu Yanjun wrote: > On Tue, Aug 24, 2021 at 11:02 AM Bart Van Assche <bvanassche@acm.org> wrote: >> >> Hi Bob, >> >> If I run the following test against Linus' master branch then that test >> passes (commit d5ae8d7f85b7 ("Revert "media: dvb header files: move some >> headers to staging"")): >> >> # export use_siw=1 && modprobe brd && (cd blktests && ./check -q srp/002) >> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [passed] >> runtime ... 48.849s >> >> The following test fails: >> >> # export use_siw= && modprobe brd && (cd blktests && ./check -q srp/002) >> srp/002 (File I/O on top of multipath concurrently with logout and login (mq)) [failed] >> runtime 48.849s ... 15.024s >> --- tests/srp/002.out 2018-09-08 19:43:42.291664821 -0700 >> +++ /home/bart/software/blktests/results/nodev/srp/002.out.bad 2021-08-23 19:51:05.182958728 -0700 >> @@ -1,2 +1 @@ >> Configured SRP target driver >> -Passed > > Can this commit "RDMA/rxe: Zero out index member of struct rxe_queue" > in the link https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?h=wip/jgg-for-rc > fix this problem? > > And the commit will be merged into linux upstream very soon. Hi Zhu, Thanks for having taken a look. Isn't commit cc4f596cf85e ("RDMA/rxe: Zero out index member of struct rxe_queue") already in Linus' tree? I think it was merged yesterday (August 24). Unfortunately the test I mentioned still fails on top of that patch. Thanks, Bart. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2021-08-27 3:18 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-08-24 3:01 v5.14 RXE driver broken? Bart Van Assche 2021-08-25 3:02 ` Zhu Yanjun 2021-08-25 16:32 ` Jason Gunthorpe 2021-08-25 18:03 ` Bob Pearson 2021-08-25 18:22 ` Bart Van Assche 2021-08-25 20:58 ` Bart Van Assche 2021-08-25 21:09 ` Bob Pearson 2021-08-25 21:44 ` Bart Van Assche 2021-08-26 19:03 ` Bob Pearson 2021-08-26 20:03 ` Bob Pearson 2021-08-27 3:18 ` Zhu Yanjun 2021-08-25 16:46 ` Bart Van Assche
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).