* RXE status in the upstream rping using rxe @ 2021-08-03 18:07 Leon Romanovsky 2021-08-04 1:01 ` Zhu Yanjun 2021-08-23 7:53 ` Zhu Yanjun 0 siblings, 2 replies; 19+ messages in thread From: Leon Romanovsky @ 2021-08-03 18:07 UTC (permalink / raw) To: Olga Kornievskaia, Bob Pearson, Zhu Yanjun; +Cc: Jason Gunthorpe, linux-rdma Hi, Can you please help me to understand the RXE status in the upstream? Does we still have crashes/interop issues/e.t.c? Latest commit is: 20da44dfe8ef ("RDMA/mlx5: Drop in-driver verbs object creations") Thanks ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-03 18:07 RXE status in the upstream rping using rxe Leon Romanovsky @ 2021-08-04 1:01 ` Zhu Yanjun 2021-08-04 1:09 ` Zhu Yanjun 2021-08-23 7:53 ` Zhu Yanjun 1 sibling, 1 reply; 19+ messages in thread From: Zhu Yanjun @ 2021-08-04 1:01 UTC (permalink / raw) To: Leon Romanovsky Cc: Olga Kornievskaia, Bob Pearson, Jason Gunthorpe, linux-rdma On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote: > > Hi, > > Can you please help me to understand the RXE status in the upstream? > > Does we still have crashes/interop issues/e.t.c? I made some developments with the RXE in the upstream, from my usage with latest RXE, I found the following: 1. rdma-core can not work well with latest RDMA git; 2. There are some problems with rxe in different kernel version; To 2, the commit https://patchwork.kernel.org/project/linux-rdma/patch/20210729220039.18549-3-rpearsonhpe@gmail.com/ can relieve this problem, not sure if some bugs still exist in RXE. To 1, I will continue to make further investigations. So we should have time to make tests, fix bugs and make rxe stable. Zhu Yanjun > > Latest commit is: > 20da44dfe8ef ("RDMA/mlx5: Drop in-driver verbs object creations") > > Thanks ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-04 1:01 ` Zhu Yanjun @ 2021-08-04 1:09 ` Zhu Yanjun 2021-08-04 5:41 ` Leon Romanovsky 0 siblings, 1 reply; 19+ messages in thread From: Zhu Yanjun @ 2021-08-04 1:09 UTC (permalink / raw) To: Leon Romanovsky Cc: Olga Kornievskaia, Bob Pearson, Jason Gunthorpe, linux-rdma On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote: > > On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > Hi, > > > > Can you please help me to understand the RXE status in the upstream? > > > > Does we still have crashes/interop issues/e.t.c? > > I made some developments with the RXE in the upstream, from my usage > with latest RXE, > I found the following: > > 1. rdma-core can not work well with latest RDMA git; The latest RDMA git is https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git Zhu Yanjun > 2. There are some problems with rxe in different kernel version; > > To 2, the commit > https://patchwork.kernel.org/project/linux-rdma/patch/20210729220039.18549-3-rpearsonhpe@gmail.com/ > can relieve this problem, > not sure if some bugs still exist in RXE. > > To 1, I will continue to make further investigations. > So we should have time to make tests, fix bugs and make rxe stable. > > Zhu Yanjun > > > > Latest commit is: > > 20da44dfe8ef ("RDMA/mlx5: Drop in-driver verbs object creations") > > > > Thanks ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-04 1:09 ` Zhu Yanjun @ 2021-08-04 5:41 ` Leon Romanovsky 2021-08-04 9:05 ` Zhu Yanjun 0 siblings, 1 reply; 19+ messages in thread From: Leon Romanovsky @ 2021-08-04 5:41 UTC (permalink / raw) To: Zhu Yanjun; +Cc: Olga Kornievskaia, Bob Pearson, Jason Gunthorpe, linux-rdma On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote: > On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote: > > > > On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > Hi, > > > > > > Can you please help me to understand the RXE status in the upstream? > > > > > > Does we still have crashes/interop issues/e.t.c? > > > > I made some developments with the RXE in the upstream, from my usage > > with latest RXE, > > I found the following: > > > > 1. rdma-core can not work well with latest RDMA git; > > The latest RDMA git is > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git "Latest" is a relative term, what SHA did you test? Let's focus on fixing RXE before we will continue with new features. Thanks ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-04 5:41 ` Leon Romanovsky @ 2021-08-04 9:05 ` Zhu Yanjun 2021-08-06 2:37 ` Olga Kornievskaia 2021-08-13 21:53 ` Bob Pearson 0 siblings, 2 replies; 19+ messages in thread From: Zhu Yanjun @ 2021-08-04 9:05 UTC (permalink / raw) To: Leon Romanovsky Cc: Olga Kornievskaia, Bob Pearson, Jason Gunthorpe, linux-rdma On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky <leon@kernel.org> wrote: > > On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote: > > On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote: > > > > > > On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > Hi, > > > > > > > > Can you please help me to understand the RXE status in the upstream? > > > > > > > > Does we still have crashes/interop issues/e.t.c? > > > > > > I made some developments with the RXE in the upstream, from my usage > > > with latest RXE, > > > I found the following: > > > > > > 1. rdma-core can not work well with latest RDMA git; > > > > The latest RDMA git is > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git > > "Latest" is a relative term, what SHA did you test? > Let's focus on fixing RXE before we will continue with new features. Thanks a lot. I agree with you. rdma-core: 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull request #1038 from selvintxavier/master 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr 327d45e0 tests: Add missing MAC element to args list 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices 8754fb51 bnxt_re/lib: Use separate indices for shadow queue be4d8abf bnxt_re/lib: add a function to initialize software queue kernel rdma: 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD) RDMA/qedr: Improve error logs for rdma_alloc_tid error return 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc 991c4274dc17 RDMA/hfi1: Fix typo in comments 8d7e415d5561 docs: Fix infiniband uverbs minor number bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure that requests are valid bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on hfi1_devdata->user_refcount with the above kernel and rdma-core, the following messages will appear. " [ 54.214608] rdma_rxe: loaded [ 54.217089] infiniband rxe0: set active [ 54.217101] infiniband rxe0: added enp0s8 [ 167.623200] rdma_rxe: cqe(32768) > max_cqe(32767) [ 167.645590] rdma_rxe: cqe(1) < current # elements in queue (6) [ 167.733297] rdma_rxe: cqe(32768) > max_cqe(32767) [ 169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247 [ 169.074796] rdma_rxe: qp#27 moved to error state [ 169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de [ 169.138889] rdma_rxe: qp#30 moved to error state [ 169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7 [ 169.160601] rdma_rxe: qp#31 moved to error state [ 169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782 [ 169.182170] rdma_rxe: qp#32 moved to error state [ 169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8 [ 169.667850] rdma_rxe: qp#39 moved to error state [ 198.872649] rdma_rxe: cqe(32768) > max_cqe(32767) [ 198.894829] rdma_rxe: cqe(1) < current # elements in queue (6) [ 198.981839] rdma_rxe: cqe(32768) > max_cqe(32767) [ 200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887 [ 200.332086] rdma_rxe: qp#58 moved to error state [ 200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d [ 200.396514] rdma_rxe: qp#61 moved to error state [ 200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40 [ 200.417956] rdma_rxe: qp#62 moved to error state [ 200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24 [ 200.439654] rdma_rxe: qp#63 moved to error state [ 200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8 [ 200.933153] rdma_rxe: qp#70 moved to error state [ 206.880305] rdma_rxe: cqe(32768) > max_cqe(32767) [ 206.904030] rdma_rxe: cqe(1) < current # elements in queue (6) [ 206.991494] rdma_rxe: cqe(32768) > max_cqe(32767) [ 208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d [ 208.360028] rdma_rxe: qp#89 moved to error state [ 208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136 [ 208.425675] rdma_rxe: qp#92 moved to error state [ 208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8 [ 208.447370] rdma_rxe: qp#93 moved to error state [ 208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a [ 208.469550] rdma_rxe: qp#94 moved to error state [ 208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670 [ 208.956731] rdma_rxe: qp#100 moved to error state [ 216.879703] rdma_rxe: cqe(32768) > max_cqe(32767) [ 216.902199] rdma_rxe: cqe(1) < current # elements in queue (6) [ 216.989264] rdma_rxe: cqe(32768) > max_cqe(32767) [ 218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6 [ 218.363808] rdma_rxe: qp#119 moved to error state [ 218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4 [ 218.429513] rdma_rxe: qp#122 moved to error state [ 218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895 [ 218.451481] rdma_rxe: qp#123 moved to error state [ 218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910 [ 218.473908] rdma_rxe: qp#124 moved to error state [ 218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b [ 218.963641] rdma_rxe: qp#130 moved to error state [ 233.855140] rdma_rxe: cqe(32768) > max_cqe(32767) [ 233.877202] rdma_rxe: cqe(1) < current # elements in queue (6) [ 233.963952] rdma_rxe: cqe(32768) > max_cqe(32767) [ 235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2 [ 235.305319] rdma_rxe: qp#149 moved to error state [ 235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8 [ 235.368838] rdma_rxe: qp#152 moved to error state [ 235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d [ 235.390192] rdma_rxe: qp#153 moved to error state [ 235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c [ 235.411374] rdma_rxe: qp#154 moved to error state [ 235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482 [ 235.895828] rdma_rxe: qp#161 moved to error state " Not sure if they are problems. IMO, we should make further investigations. Thanks Zhu Yanjun > > Thanks ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-04 9:05 ` Zhu Yanjun @ 2021-08-06 2:37 ` Olga Kornievskaia 2021-08-17 2:28 ` Zhu Yanjun 2021-08-13 21:53 ` Bob Pearson 1 sibling, 1 reply; 19+ messages in thread From: Olga Kornievskaia @ 2021-08-06 2:37 UTC (permalink / raw) To: Zhu Yanjun; +Cc: Leon Romanovsky, Bob Pearson, Jason Gunthorpe, linux-rdma On Wed, Aug 4, 2021 at 5:05 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote: > > On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote: > > > On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote: > > > > > > > > On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > Hi, > > > > > > > > > > Can you please help me to understand the RXE status in the upstream? > > > > > > > > > > Does we still have crashes/interop issues/e.t.c? > > > > > > > > I made some developments with the RXE in the upstream, from my usage > > > > with latest RXE, > > > > I found the following: > > > > > > > > 1. rdma-core can not work well with latest RDMA git; > > > > > > The latest RDMA git is > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git > > > > "Latest" is a relative term, what SHA did you test? > > Let's focus on fixing RXE before we will continue with new features. > > Thanks a lot. I agree with you. I believe simple rping still doesn't work linux-to-linux. The last working version (of rping in rxe) was 5.13 I think. I have posted a number of crashes rping encounters (gotta get that working before I can even try NFSoRDMA). Thank you for working on the code. We (NFS community) do test NFSoRDMA every git pull using rxe and siw but lately have been encountering problems. > rdma-core: > 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull > request #1038 from selvintxavier/master > 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr > 327d45e0 tests: Add missing MAC element to args list > 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices > 8754fb51 bnxt_re/lib: Use separate indices for shadow queue > be4d8abf bnxt_re/lib: add a function to initialize software queue > > kernel rdma: > 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD) > RDMA/qedr: Improve error logs for rdma_alloc_tid error return > 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc > 991c4274dc17 RDMA/hfi1: Fix typo in comments > 8d7e415d5561 docs: Fix infiniband uverbs minor number > bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure > that requests are valid > bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting > e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails > a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on > hfi1_devdata->user_refcount > > with the above kernel and rdma-core, the following messages will appear. > " > [ 54.214608] rdma_rxe: loaded > [ 54.217089] infiniband rxe0: set active > [ 54.217101] infiniband rxe0: added enp0s8 > [ 167.623200] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 167.645590] rdma_rxe: cqe(1) < current # elements in queue (6) > [ 167.733297] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247 > [ 169.074796] rdma_rxe: qp#27 moved to error state > [ 169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de > [ 169.138889] rdma_rxe: qp#30 moved to error state > [ 169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7 > [ 169.160601] rdma_rxe: qp#31 moved to error state > [ 169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782 > [ 169.182170] rdma_rxe: qp#32 moved to error state > [ 169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8 > [ 169.667850] rdma_rxe: qp#39 moved to error state > [ 198.872649] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 198.894829] rdma_rxe: cqe(1) < current # elements in queue (6) > [ 198.981839] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887 > [ 200.332086] rdma_rxe: qp#58 moved to error state > [ 200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d > [ 200.396514] rdma_rxe: qp#61 moved to error state > [ 200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40 > [ 200.417956] rdma_rxe: qp#62 moved to error state > [ 200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24 > [ 200.439654] rdma_rxe: qp#63 moved to error state > [ 200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8 > [ 200.933153] rdma_rxe: qp#70 moved to error state > [ 206.880305] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 206.904030] rdma_rxe: cqe(1) < current # elements in queue (6) > [ 206.991494] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d > [ 208.360028] rdma_rxe: qp#89 moved to error state > [ 208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136 > [ 208.425675] rdma_rxe: qp#92 moved to error state > [ 208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8 > [ 208.447370] rdma_rxe: qp#93 moved to error state > [ 208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a > [ 208.469550] rdma_rxe: qp#94 moved to error state > [ 208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670 > [ 208.956731] rdma_rxe: qp#100 moved to error state > [ 216.879703] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 216.902199] rdma_rxe: cqe(1) < current # elements in queue (6) > [ 216.989264] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6 > [ 218.363808] rdma_rxe: qp#119 moved to error state > [ 218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4 > [ 218.429513] rdma_rxe: qp#122 moved to error state > [ 218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895 > [ 218.451481] rdma_rxe: qp#123 moved to error state > [ 218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910 > [ 218.473908] rdma_rxe: qp#124 moved to error state > [ 218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b > [ 218.963641] rdma_rxe: qp#130 moved to error state > [ 233.855140] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 233.877202] rdma_rxe: cqe(1) < current # elements in queue (6) > [ 233.963952] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2 > [ 235.305319] rdma_rxe: qp#149 moved to error state > [ 235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8 > [ 235.368838] rdma_rxe: qp#152 moved to error state > [ 235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d > [ 235.390192] rdma_rxe: qp#153 moved to error state > [ 235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c > [ 235.411374] rdma_rxe: qp#154 moved to error state > [ 235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482 > [ 235.895828] rdma_rxe: qp#161 moved to error state > " > Not sure if they are problems. > IMO, we should make further investigations. > > Thanks > Zhu Yanjun > > > > Thanks ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-06 2:37 ` Olga Kornievskaia @ 2021-08-17 2:28 ` Zhu Yanjun 2021-08-18 6:43 ` yangx.jy 0 siblings, 1 reply; 19+ messages in thread From: Zhu Yanjun @ 2021-08-17 2:28 UTC (permalink / raw) To: Olga Kornievskaia Cc: Leon Romanovsky, Bob Pearson, Jason Gunthorpe, linux-rdma On Fri, Aug 6, 2021 at 10:37 AM Olga Kornievskaia <aglo@umich.edu> wrote: > > On Wed, Aug 4, 2021 at 5:05 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote: > > > > On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote: > > > > On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote: > > > > > > > > > > On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote: > > > > > > > > > > > > Hi, > > > > > > > > > > > > Can you please help me to understand the RXE status in the upstream? > > > > > > > > > > > > Does we still have crashes/interop issues/e.t.c? > > > > > > > > > > I made some developments with the RXE in the upstream, from my usage > > > > > with latest RXE, > > > > > I found the following: > > > > > > > > > > 1. rdma-core can not work well with latest RDMA git; > > > > > > > > The latest RDMA git is > > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git > > > > > > "Latest" is a relative term, what SHA did you test? > > > Let's focus on fixing RXE before we will continue with new features. > > > > Thanks a lot. I agree with you. > > I believe simple rping still doesn't work linux-to-linux. The last > working version (of rping in rxe) was 5.13 I think. I have posted a > number of crashes rping encounters (gotta get that working before I > can even try NFSoRDMA). The following are my tests. 1. Modprobe rdma_rxe 2. Modprobe -v -r rdma_rxe 3. Rdma link add rxe 4. Rdma link del rxe 5. Latest rdma-core && latest kernel upstream; 6. Latest kernel < ------rping---- > 5.10.y stable 7. Latest kernel < ------rping---- > 5.11.y stable 8. Latest kernel < ------rping---- > 5.12.y stable 9. Latest kernel < ------rping---- > 5.13.y stable It seems that the latest kernel upstream (5.14-rc6) can rping other stable kernels. Can you make tests again? Zhu Yanjun > > Thank you for working on the code. > > We (NFS community) do test NFSoRDMA every git pull using rxe and siw > but lately have been encountering problems. > > > rdma-core: > > 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull > > request #1038 from selvintxavier/master > > 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr > > 327d45e0 tests: Add missing MAC element to args list > > 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices > > 8754fb51 bnxt_re/lib: Use separate indices for shadow queue > > be4d8abf bnxt_re/lib: add a function to initialize software queue > > > > kernel rdma: > > 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD) > > RDMA/qedr: Improve error logs for rdma_alloc_tid error return > > 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc > > 991c4274dc17 RDMA/hfi1: Fix typo in comments > > 8d7e415d5561 docs: Fix infiniband uverbs minor number > > bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure > > that requests are valid > > bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting > > e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails > > a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on > > hfi1_devdata->user_refcount > > > > with the above kernel and rdma-core, the following messages will appear. > > " > > [ 54.214608] rdma_rxe: loaded > > [ 54.217089] infiniband rxe0: set active > > [ 54.217101] infiniband rxe0: added enp0s8 > > [ 167.623200] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 167.645590] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 167.733297] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247 > > [ 169.074796] rdma_rxe: qp#27 moved to error state > > [ 169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de > > [ 169.138889] rdma_rxe: qp#30 moved to error state > > [ 169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7 > > [ 169.160601] rdma_rxe: qp#31 moved to error state > > [ 169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782 > > [ 169.182170] rdma_rxe: qp#32 moved to error state > > [ 169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8 > > [ 169.667850] rdma_rxe: qp#39 moved to error state > > [ 198.872649] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 198.894829] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 198.981839] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887 > > [ 200.332086] rdma_rxe: qp#58 moved to error state > > [ 200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d > > [ 200.396514] rdma_rxe: qp#61 moved to error state > > [ 200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40 > > [ 200.417956] rdma_rxe: qp#62 moved to error state > > [ 200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24 > > [ 200.439654] rdma_rxe: qp#63 moved to error state > > [ 200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8 > > [ 200.933153] rdma_rxe: qp#70 moved to error state > > [ 206.880305] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 206.904030] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 206.991494] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d > > [ 208.360028] rdma_rxe: qp#89 moved to error state > > [ 208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136 > > [ 208.425675] rdma_rxe: qp#92 moved to error state > > [ 208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8 > > [ 208.447370] rdma_rxe: qp#93 moved to error state > > [ 208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a > > [ 208.469550] rdma_rxe: qp#94 moved to error state > > [ 208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670 > > [ 208.956731] rdma_rxe: qp#100 moved to error state > > [ 216.879703] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 216.902199] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 216.989264] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6 > > [ 218.363808] rdma_rxe: qp#119 moved to error state > > [ 218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4 > > [ 218.429513] rdma_rxe: qp#122 moved to error state > > [ 218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895 > > [ 218.451481] rdma_rxe: qp#123 moved to error state > > [ 218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910 > > [ 218.473908] rdma_rxe: qp#124 moved to error state > > [ 218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b > > [ 218.963641] rdma_rxe: qp#130 moved to error state > > [ 233.855140] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 233.877202] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 233.963952] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2 > > [ 235.305319] rdma_rxe: qp#149 moved to error state > > [ 235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8 > > [ 235.368838] rdma_rxe: qp#152 moved to error state > > [ 235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d > > [ 235.390192] rdma_rxe: qp#153 moved to error state > > [ 235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c > > [ 235.411374] rdma_rxe: qp#154 moved to error state > > [ 235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482 > > [ 235.895828] rdma_rxe: qp#161 moved to error state > > " > > Not sure if they are problems. > > IMO, we should make further investigations. > > > > Thanks > > Zhu Yanjun > > > > > > Thanks ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-17 2:28 ` Zhu Yanjun @ 2021-08-18 6:43 ` yangx.jy 2021-08-18 7:20 ` Zhu Yanjun 0 siblings, 1 reply; 19+ messages in thread From: yangx.jy @ 2021-08-18 6:43 UTC (permalink / raw) To: Zhu Yanjun Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe, linux-rdma 于 2021/8/17 10:28, Zhu Yanjun 写道: > On Fri, Aug 6, 2021 at 10:37 AM Olga Kornievskaia<aglo@umich.edu> wrote: >> On Wed, Aug 4, 2021 at 5:05 AM Zhu Yanjun<zyjzyj2000@gmail.com> wrote: >>> On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky<leon@kernel.org> wrote: >>>> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote: >>>>> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun<zyjzyj2000@gmail.com> wrote: >>>>>> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky<leon@kernel.org> wrote: >>>>>>> Hi, >>>>>>> >>>>>>> Can you please help me to understand the RXE status in the upstream? >>>>>>> >>>>>>> Does we still have crashes/interop issues/e.t.c? >>>>>> I made some developments with the RXE in the upstream, from my usage >>>>>> with latest RXE, >>>>>> I found the following: >>>>>> >>>>>> 1. rdma-core can not work well with latest RDMA git; >>>>> The latest RDMA git is >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git >>>> "Latest" is a relative term, what SHA did you test? >>>> Let's focus on fixing RXE before we will continue with new features. >>> Thanks a lot. I agree with you. >> I believe simple rping still doesn't work linux-to-linux. The last >> working version (of rping in rxe) was 5.13 I think. I have posted a >> number of crashes rping encounters (gotta get that working before I >> can even try NFSoRDMA). > The following are my tests. > > 1. Modprobe rdma_rxe > 2. Modprobe -v -r rdma_rxe > 3. Rdma link add rxe > 4. Rdma link del rxe > 5. Latest rdma-core&& latest kernel upstream; > 6. Latest kernel< ------rping----> 5.10.y stable > 7. Latest kernel< ------rping----> 5.11.y stable > 8. Latest kernel< ------rping----> 5.12.y stable > 9. Latest kernel< ------rping----> 5.13.y stable > > It seems that the latest kernel upstream (5.14-rc6) can rping other > stable kernels. > Can you make tests again? > > Zhu Yanjun Hi, I still get two simliar panic by rping or rdma_client/server on latest kernel vs 5.13: Panic1: -------------------------------------------------------- [ 268.248642] BUG: unable to handle page fault for address: ffff9ae2c07a1414 [ 268.251049] #PF: supervisor read access in kernel mode [ 268.252491] #PF: error_code(0x0000) - not-present page [ 268.253919] PGD 1000067 P4D 1000067 PUD 0 [ 268.255052] Oops: 0000 [#1] SMP PTI [ 268.256055] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1 [ 268.257893] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014 [ 268.259995] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe] [ 268.261114] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44> 8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b [ 268.265005] RSP: 0018:ffff9ae2404108b8 EFLAGS: 00010202 [ 268.266145] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8f7a8bf9da76 [ 268.267703] RDX: ffff9ae2c07a1410 RSI: 0000000000000001 RDI: ffff8f7a71874400 [ 268.269291] RBP: ffff8f7a482f87cc R08: 0000000000000010 R09: 0000000000000000 [ 268.270871] R10: 00000000000000cb R11: 0000000000000001 R12: ffff8f7a482f8000 [ 268.272468] R13: ffff8f7a8c038928 R14: ffff8f7a482f8008 R15: 0000000000000010 [ 268.274080] FS: 0000000000000000(0000) GS:ffff8f7abec00000(0000) knlGS:0000000000000000 [ 268.275899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 268.277205] CR2: ffff9ae2c07a1414 CR3: 000000000263c002 CR4: 0000000000060ee0 [ 268.278825] Call Trace: [ 268.279358]<IRQ> [ 268.279747] rxe_responder+0x11b1/0x2490 [rdma_rxe] [ 268.280798] rxe_do_task+0x9c/0xe0 [rdma_rxe] [ 268.281895] rxe_rcv+0x286/0x8e0 [rdma_rxe] ... ------------------------------------------------------ Panic2: -------------------------------------------------------- [ 212.526854] BUG: unable to handle page fault for address: ffffbb97142acc14 [ 212.530688] #PF: supervisor read access in kernel mode [ 212.533030] #PF: error_code(0x0000) - not-present page [ 212.535428] PGD 1000067 P4D 1000067 PUD 0 [ 212.536970] Oops: 0000 [#1] SMP PTI [ 212.537748] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1 [ 212.538984] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014 [ 212.540853] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe] [ 212.541957] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44> 8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b [ 212.546041] RSP: 0018:ffffbb9640410898 EFLAGS: 00010202 [ 212.547200] RAX: ffffbb97142acc00 RBX: ffff95510ee6d000 RCX: ffff95510a802076 [ 212.548782] RDX: ffffbb97142acc10 RSI: 0000000000000001 RDI: ffff95510ca00700 [ 212.550369] RBP: 0000000000000010 R08: 0000000000000010 R09: 0000000000000000 [ 212.551992] R10: 0000000000000001 R11: 0000000000000001 R12: ffff95510a802076 [ 212.553613] R13: ffff9550f29acd28 R14: ffff95510ee6d008 R15: 0000000000000010 [ 212.555225] FS: 0000000000000000(0000) GS:ffff95513ec00000(0000) knlGS:0000000000000000 [ 212.556749] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 212.557846] CR2: ffffbb97142acc14 CR3: 0000000003b66005 CR4: 0000000000060ee0 [ 212.559177] Call Trace: [ 212.559655]<IRQ> [ 212.560055] send_data_in+0x55/0x73 [rdma_rxe] [ 212.560903] rxe_responder.cold+0xea/0x1f8 [rdma_rxe] [ 212.561865] rxe_do_task+0x9c/0xe0 [rdma_rxe] [ 212.562699] rxe_rcv+0x286/0x8e0 [rdma_rxe] ... ------------------------------------------------------ Note: it is easy to reproduce the panic on the lastest kernel. Best Regards, Xiao Yang >> Thank you for working on the code. >> >> We (NFS community) do test NFSoRDMA every git pull using rxe and siw >> but lately have been encountering problems. >> >>> rdma-core: >>> 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull >>> request #1038 from selvintxavier/master >>> 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr >>> 327d45e0 tests: Add missing MAC element to args list >>> 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices >>> 8754fb51 bnxt_re/lib: Use separate indices for shadow queue >>> be4d8abf bnxt_re/lib: add a function to initialize software queue >>> >>> kernel rdma: >>> 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD) >>> RDMA/qedr: Improve error logs for rdma_alloc_tid error return >>> 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc >>> 991c4274dc17 RDMA/hfi1: Fix typo in comments >>> 8d7e415d5561 docs: Fix infiniband uverbs minor number >>> bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure >>> that requests are valid >>> bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting >>> e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails >>> a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on >>> hfi1_devdata->user_refcount >>> >>> with the above kernel and rdma-core, the following messages will appear. >>> " >>> [ 54.214608] rdma_rxe: loaded >>> [ 54.217089] infiniband rxe0: set active >>> [ 54.217101] infiniband rxe0: added enp0s8 >>> [ 167.623200] rdma_rxe: cqe(32768)> max_cqe(32767) >>> [ 167.645590] rdma_rxe: cqe(1)< current # elements in queue (6) >>> [ 167.733297] rdma_rxe: cqe(32768)> max_cqe(32767) >>> [ 169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247 >>> [ 169.074796] rdma_rxe: qp#27 moved to error state >>> [ 169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de >>> [ 169.138889] rdma_rxe: qp#30 moved to error state >>> [ 169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7 >>> [ 169.160601] rdma_rxe: qp#31 moved to error state >>> [ 169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782 >>> [ 169.182170] rdma_rxe: qp#32 moved to error state >>> [ 169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8 >>> [ 169.667850] rdma_rxe: qp#39 moved to error state >>> [ 198.872649] rdma_rxe: cqe(32768)> max_cqe(32767) >>> [ 198.894829] rdma_rxe: cqe(1)< current # elements in queue (6) >>> [ 198.981839] rdma_rxe: cqe(32768)> max_cqe(32767) >>> [ 200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887 >>> [ 200.332086] rdma_rxe: qp#58 moved to error state >>> [ 200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d >>> [ 200.396514] rdma_rxe: qp#61 moved to error state >>> [ 200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40 >>> [ 200.417956] rdma_rxe: qp#62 moved to error state >>> [ 200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24 >>> [ 200.439654] rdma_rxe: qp#63 moved to error state >>> [ 200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8 >>> [ 200.933153] rdma_rxe: qp#70 moved to error state >>> [ 206.880305] rdma_rxe: cqe(32768)> max_cqe(32767) >>> [ 206.904030] rdma_rxe: cqe(1)< current # elements in queue (6) >>> [ 206.991494] rdma_rxe: cqe(32768)> max_cqe(32767) >>> [ 208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d >>> [ 208.360028] rdma_rxe: qp#89 moved to error state >>> [ 208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136 >>> [ 208.425675] rdma_rxe: qp#92 moved to error state >>> [ 208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8 >>> [ 208.447370] rdma_rxe: qp#93 moved to error state >>> [ 208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a >>> [ 208.469550] rdma_rxe: qp#94 moved to error state >>> [ 208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670 >>> [ 208.956731] rdma_rxe: qp#100 moved to error state >>> [ 216.879703] rdma_rxe: cqe(32768)> max_cqe(32767) >>> [ 216.902199] rdma_rxe: cqe(1)< current # elements in queue (6) >>> [ 216.989264] rdma_rxe: cqe(32768)> max_cqe(32767) >>> [ 218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6 >>> [ 218.363808] rdma_rxe: qp#119 moved to error state >>> [ 218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4 >>> [ 218.429513] rdma_rxe: qp#122 moved to error state >>> [ 218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895 >>> [ 218.451481] rdma_rxe: qp#123 moved to error state >>> [ 218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910 >>> [ 218.473908] rdma_rxe: qp#124 moved to error state >>> [ 218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b >>> [ 218.963641] rdma_rxe: qp#130 moved to error state >>> [ 233.855140] rdma_rxe: cqe(32768)> max_cqe(32767) >>> [ 233.877202] rdma_rxe: cqe(1)< current # elements in queue (6) >>> [ 233.963952] rdma_rxe: cqe(32768)> max_cqe(32767) >>> [ 235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2 >>> [ 235.305319] rdma_rxe: qp#149 moved to error state >>> [ 235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8 >>> [ 235.368838] rdma_rxe: qp#152 moved to error state >>> [ 235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d >>> [ 235.390192] rdma_rxe: qp#153 moved to error state >>> [ 235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c >>> [ 235.411374] rdma_rxe: qp#154 moved to error state >>> [ 235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482 >>> [ 235.895828] rdma_rxe: qp#161 moved to error state >>> " >>> Not sure if they are problems. >>> IMO, we should make further investigations. >>> >>> Thanks >>> Zhu Yanjun >>>> Thanks > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-18 6:43 ` yangx.jy @ 2021-08-18 7:20 ` Zhu Yanjun 2021-08-18 7:44 ` yangx.jy 0 siblings, 1 reply; 19+ messages in thread From: Zhu Yanjun @ 2021-08-18 7:20 UTC (permalink / raw) To: yangx.jy Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe, linux-rdma On Wed, Aug 18, 2021 at 2:43 PM yangx.jy@fujitsu.com <yangx.jy@fujitsu.com> wrote: > > 于 2021/8/17 10:28, Zhu Yanjun 写道: > > On Fri, Aug 6, 2021 at 10:37 AM Olga Kornievskaia<aglo@umich.edu> wrote: > >> On Wed, Aug 4, 2021 at 5:05 AM Zhu Yanjun<zyjzyj2000@gmail.com> wrote: > >>> On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky<leon@kernel.org> wrote: > >>>> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote: > >>>>> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun<zyjzyj2000@gmail.com> wrote: > >>>>>> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky<leon@kernel.org> wrote: > >>>>>>> Hi, > >>>>>>> > >>>>>>> Can you please help me to understand the RXE status in the upstream? > >>>>>>> > >>>>>>> Does we still have crashes/interop issues/e.t.c? > >>>>>> I made some developments with the RXE in the upstream, from my usage > >>>>>> with latest RXE, > >>>>>> I found the following: > >>>>>> > >>>>>> 1. rdma-core can not work well with latest RDMA git; > >>>>> The latest RDMA git is > >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git > >>>> "Latest" is a relative term, what SHA did you test? > >>>> Let's focus on fixing RXE before we will continue with new features. > >>> Thanks a lot. I agree with you. > >> I believe simple rping still doesn't work linux-to-linux. The last > >> working version (of rping in rxe) was 5.13 I think. I have posted a > >> number of crashes rping encounters (gotta get that working before I > >> can even try NFSoRDMA). > > The following are my tests. > > > > 1. Modprobe rdma_rxe > > 2. Modprobe -v -r rdma_rxe > > 3. Rdma link add rxe > > 4. Rdma link del rxe > > 5. Latest rdma-core&& latest kernel upstream; > > 6. Latest kernel< ------rping----> 5.10.y stable > > 7. Latest kernel< ------rping----> 5.11.y stable > > 8. Latest kernel< ------rping----> 5.12.y stable > > 9. Latest kernel< ------rping----> 5.13.y stable > > > > It seems that the latest kernel upstream (5.14-rc6) can rping other > > stable kernels. > > Can you make tests again? > > > > Zhu Yanjun > Hi, > > I still get two simliar panic by rping or rdma_client/server on latest kernel vs 5.13: > Panic1: > -------------------------------------------------------- > [ 268.248642] BUG: unable to handle page fault for address: ffff9ae2c07a1414 > [ 268.251049] #PF: supervisor read access in kernel mode > [ 268.252491] #PF: error_code(0x0000) - not-present page > [ 268.253919] PGD 1000067 P4D 1000067 PUD 0 > [ 268.255052] Oops: 0000 [#1] SMP PTI > [ 268.256055] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1 > [ 268.257893] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014 > [ 268.259995] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe] > [ 268.261114] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44> 8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b > [ 268.265005] RSP: 0018:ffff9ae2404108b8 EFLAGS: 00010202 > [ 268.266145] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8f7a8bf9da76 > [ 268.267703] RDX: ffff9ae2c07a1410 RSI: 0000000000000001 RDI: ffff8f7a71874400 > [ 268.269291] RBP: ffff8f7a482f87cc R08: 0000000000000010 R09: 0000000000000000 > [ 268.270871] R10: 00000000000000cb R11: 0000000000000001 R12: ffff8f7a482f8000 > [ 268.272468] R13: ffff8f7a8c038928 R14: ffff8f7a482f8008 R15: 0000000000000010 > [ 268.274080] FS: 0000000000000000(0000) GS:ffff8f7abec00000(0000) knlGS:0000000000000000 > [ 268.275899] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 268.277205] CR2: ffff9ae2c07a1414 CR3: 000000000263c002 CR4: 0000000000060ee0 > [ 268.278825] Call Trace: > [ 268.279358]<IRQ> > [ 268.279747] rxe_responder+0x11b1/0x2490 [rdma_rxe] > [ 268.280798] rxe_do_task+0x9c/0xe0 [rdma_rxe] > [ 268.281895] rxe_rcv+0x286/0x8e0 [rdma_rxe] > ... > ------------------------------------------------------ > > Panic2: > -------------------------------------------------------- > [ 212.526854] BUG: unable to handle page fault for address: ffffbb97142acc14 > [ 212.530688] #PF: supervisor read access in kernel mode > [ 212.533030] #PF: error_code(0x0000) - not-present page > [ 212.535428] PGD 1000067 P4D 1000067 PUD 0 > [ 212.536970] Oops: 0000 [#1] SMP PTI > [ 212.537748] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1 > [ 212.538984] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014 > [ 212.540853] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe] > [ 212.541957] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44> 8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b > [ 212.546041] RSP: 0018:ffffbb9640410898 EFLAGS: 00010202 > [ 212.547200] RAX: ffffbb97142acc00 RBX: ffff95510ee6d000 RCX: ffff95510a802076 > [ 212.548782] RDX: ffffbb97142acc10 RSI: 0000000000000001 RDI: ffff95510ca00700 > [ 212.550369] RBP: 0000000000000010 R08: 0000000000000010 R09: 0000000000000000 > [ 212.551992] R10: 0000000000000001 R11: 0000000000000001 R12: ffff95510a802076 > [ 212.553613] R13: ffff9550f29acd28 R14: ffff95510ee6d008 R15: 0000000000000010 > [ 212.555225] FS: 0000000000000000(0000) GS:ffff95513ec00000(0000) knlGS:0000000000000000 > [ 212.556749] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 212.557846] CR2: ffffbb97142acc14 CR3: 0000000003b66005 CR4: 0000000000060ee0 > [ 212.559177] Call Trace: > [ 212.559655]<IRQ> > [ 212.560055] send_data_in+0x55/0x73 [rdma_rxe] > [ 212.560903] rxe_responder.cold+0xea/0x1f8 [rdma_rxe] > [ 212.561865] rxe_do_task+0x9c/0xe0 [rdma_rxe] > [ 212.562699] rxe_rcv+0x286/0x8e0 [rdma_rxe] > ... > ------------------------------------------------------ > > Note: it is easy to reproduce the panic on the lastest kernel. Can you let me know how to reproduce the panic? 1. linux upstream < ----rping---- > linux upstream? 2. just run rping? 3. how do you create rxe? with rdma link or rxe_cfg? 4. do you make other operations? 5. other operations? Thanks. Zhu Yanjun > > Best Regards, > Xiao Yang > > > > >> Thank you for working on the code. > >> > >> We (NFS community) do test NFSoRDMA every git pull using rxe and siw > >> but lately have been encountering problems. > >> > >>> rdma-core: > >>> 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull > >>> request #1038 from selvintxavier/master > >>> 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr > >>> 327d45e0 tests: Add missing MAC element to args list > >>> 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices > >>> 8754fb51 bnxt_re/lib: Use separate indices for shadow queue > >>> be4d8abf bnxt_re/lib: add a function to initialize software queue > >>> > >>> kernel rdma: > >>> 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD) > >>> RDMA/qedr: Improve error logs for rdma_alloc_tid error return > >>> 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc > >>> 991c4274dc17 RDMA/hfi1: Fix typo in comments > >>> 8d7e415d5561 docs: Fix infiniband uverbs minor number > >>> bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure > >>> that requests are valid > >>> bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting > >>> e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails > >>> a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on > >>> hfi1_devdata->user_refcount > >>> > >>> with the above kernel and rdma-core, the following messages will appear. > >>> " > >>> [ 54.214608] rdma_rxe: loaded > >>> [ 54.217089] infiniband rxe0: set active > >>> [ 54.217101] infiniband rxe0: added enp0s8 > >>> [ 167.623200] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 167.645590] rdma_rxe: cqe(1)< current # elements in queue (6) > >>> [ 167.733297] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247 > >>> [ 169.074796] rdma_rxe: qp#27 moved to error state > >>> [ 169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de > >>> [ 169.138889] rdma_rxe: qp#30 moved to error state > >>> [ 169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7 > >>> [ 169.160601] rdma_rxe: qp#31 moved to error state > >>> [ 169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782 > >>> [ 169.182170] rdma_rxe: qp#32 moved to error state > >>> [ 169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8 > >>> [ 169.667850] rdma_rxe: qp#39 moved to error state > >>> [ 198.872649] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 198.894829] rdma_rxe: cqe(1)< current # elements in queue (6) > >>> [ 198.981839] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887 > >>> [ 200.332086] rdma_rxe: qp#58 moved to error state > >>> [ 200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d > >>> [ 200.396514] rdma_rxe: qp#61 moved to error state > >>> [ 200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40 > >>> [ 200.417956] rdma_rxe: qp#62 moved to error state > >>> [ 200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24 > >>> [ 200.439654] rdma_rxe: qp#63 moved to error state > >>> [ 200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8 > >>> [ 200.933153] rdma_rxe: qp#70 moved to error state > >>> [ 206.880305] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 206.904030] rdma_rxe: cqe(1)< current # elements in queue (6) > >>> [ 206.991494] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d > >>> [ 208.360028] rdma_rxe: qp#89 moved to error state > >>> [ 208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136 > >>> [ 208.425675] rdma_rxe: qp#92 moved to error state > >>> [ 208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8 > >>> [ 208.447370] rdma_rxe: qp#93 moved to error state > >>> [ 208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a > >>> [ 208.469550] rdma_rxe: qp#94 moved to error state > >>> [ 208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670 > >>> [ 208.956731] rdma_rxe: qp#100 moved to error state > >>> [ 216.879703] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 216.902199] rdma_rxe: cqe(1)< current # elements in queue (6) > >>> [ 216.989264] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6 > >>> [ 218.363808] rdma_rxe: qp#119 moved to error state > >>> [ 218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4 > >>> [ 218.429513] rdma_rxe: qp#122 moved to error state > >>> [ 218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895 > >>> [ 218.451481] rdma_rxe: qp#123 moved to error state > >>> [ 218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910 > >>> [ 218.473908] rdma_rxe: qp#124 moved to error state > >>> [ 218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b > >>> [ 218.963641] rdma_rxe: qp#130 moved to error state > >>> [ 233.855140] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 233.877202] rdma_rxe: cqe(1)< current # elements in queue (6) > >>> [ 233.963952] rdma_rxe: cqe(32768)> max_cqe(32767) > >>> [ 235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2 > >>> [ 235.305319] rdma_rxe: qp#149 moved to error state > >>> [ 235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8 > >>> [ 235.368838] rdma_rxe: qp#152 moved to error state > >>> [ 235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d > >>> [ 235.390192] rdma_rxe: qp#153 moved to error state > >>> [ 235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c > >>> [ 235.411374] rdma_rxe: qp#154 moved to error state > >>> [ 235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482 > >>> [ 235.895828] rdma_rxe: qp#161 moved to error state > >>> " > >>> Not sure if they are problems. > >>> IMO, we should make further investigations. > >>> > >>> Thanks > >>> Zhu Yanjun > >>>> Thanks > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-18 7:20 ` Zhu Yanjun @ 2021-08-18 7:44 ` yangx.jy 2021-08-18 8:28 ` Zhu Yanjun 0 siblings, 1 reply; 19+ messages in thread From: yangx.jy @ 2021-08-18 7:44 UTC (permalink / raw) To: Zhu Yanjun Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe, linux-rdma On 2021/8/18 15:20, Zhu Yanjun wrote: > Can you let me know how to reproduce the panic? > > 1. linux upstream< ----rping----> linux upstream? rdma_client on v5.13< ---> rdma_server on upstream kernel. > 2. just run rping? Running rdma_client on v5.13 and rdma_server on upstream can reproduce the issue. Note: running rping can reproduce the issue as well. > 3. how do you create rxe? with rdma link or rxe_cfg? rdma link add > 4. do you make other operations? No > 5. other operations? No > Thanks. > Zhu Yanjun > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-18 7:44 ` yangx.jy @ 2021-08-18 8:28 ` Zhu Yanjun 2021-08-18 14:33 ` yangx.jy 0 siblings, 1 reply; 19+ messages in thread From: Zhu Yanjun @ 2021-08-18 8:28 UTC (permalink / raw) To: yangx.jy Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe, linux-rdma On Wed, Aug 18, 2021 at 3:44 PM yangx.jy@fujitsu.com <yangx.jy@fujitsu.com> wrote: > > On 2021/8/18 15:20, Zhu Yanjun wrote: > > Can you let me know how to reproduce the panic? > > > > 1. linux upstream< ----rping----> linux upstream? > rdma_client on v5.13< ---> rdma_server on upstream kernel. > > > 2. just run rping? > Running rdma_client on v5.13 and rdma_server on upstream can reproduce > the issue. > > Note: running rping can reproduce the issue as well. rping and rdma_server/rdma_client are from the latest rdma-core? Thanks Zhu Yanjun > > 3. how do you create rxe? with rdma link or rxe_cfg? > rdma link add > > 4. do you make other operations? > No > > 5. other operations? > No > > Thanks. > > Zhu Yanjun > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-18 8:28 ` Zhu Yanjun @ 2021-08-18 14:33 ` yangx.jy 2021-08-20 3:31 ` Zhu Yanjun 0 siblings, 1 reply; 19+ messages in thread From: yangx.jy @ 2021-08-18 14:33 UTC (permalink / raw) To: Zhu Yanjun Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe, linux-rdma On 2021/8/18 16:28, Zhu Yanjun wrote: > On Wed, Aug 18, 2021 at 3:44 PM yangx.jy@fujitsu.com > <yangx.jy@fujitsu.com> wrote: >> On 2021/8/18 15:20, Zhu Yanjun wrote: >>> Can you let me know how to reproduce the panic? >>> >>> 1. linux upstream< ----rping----> linux upstream? >> rdma_client on v5.13< ---> rdma_server on upstream kernel. >> >>> 2. just run rping? >> Running rdma_client on v5.13 and rdma_server on upstream can reproduce >> the issue. >> >> Note: running rping can reproduce the issue as well. > rping and rdma_server/rdma_client are from the latest rdma-core? Yes, use the latest rdma-core from https://github.com/linux-rdma/rdma-core (master branch). > Thanks > Zhu Yanjun > >>> 3. how do you create rxe? with rdma link or rxe_cfg? >> rdma link add >>> 4. do you make other operations? >> No >>> 5. other operations? >> No >>> Thanks. >>> Zhu Yanjun >>> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-18 14:33 ` yangx.jy @ 2021-08-20 3:31 ` Zhu Yanjun 2021-08-20 7:42 ` yangx.jy 0 siblings, 1 reply; 19+ messages in thread From: Zhu Yanjun @ 2021-08-20 3:31 UTC (permalink / raw) To: yangx.jy Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe, linux-rdma On Wed, Aug 18, 2021 at 10:33 PM yangx.jy@fujitsu.com <yangx.jy@fujitsu.com> wrote: > > On 2021/8/18 16:28, Zhu Yanjun wrote: > > On Wed, Aug 18, 2021 at 3:44 PM yangx.jy@fujitsu.com > > <yangx.jy@fujitsu.com> wrote: > >> On 2021/8/18 15:20, Zhu Yanjun wrote: > >>> Can you let me know how to reproduce the panic? > >>> > >>> 1. linux upstream< ----rping----> linux upstream? > >> rdma_client on v5.13< ---> rdma_server on upstream kernel. > >> > >>> 2. just run rping? > >> Running rdma_client on v5.13 and rdma_server on upstream can reproduce > >> the issue. > >> > >> Note: running rping can reproduce the issue as well. > > rping and rdma_server/rdma_client are from the latest rdma-core? > Yes, use the latest rdma-core from > https://github.com/linux-rdma/rdma-core (master branch). Latest kernel + latest rdma-core < ------rping---- > 5.10.y stable + latest rdma-core Latest kernel + latest rdma-core < ------rping---- > 5.11.y stable + latest rdma-core Latest kernel + latest rdma-core < ------rping---- > 5.12.y stable + latest rdma-core Latest kernel + latest rdma-core < ------rping---- > 5.13.y stable + latest rdma-core The above works well. Zhu Yanjun > > Thanks > > Zhu Yanjun > > > >>> 3. how do you create rxe? with rdma link or rxe_cfg? > >> rdma link add > >>> 4. do you make other operations? > >> No > >>> 5. other operations? > >> No > >>> Thanks. > >>> Zhu Yanjun > >>> ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-20 3:31 ` Zhu Yanjun @ 2021-08-20 7:42 ` yangx.jy 2021-08-20 21:40 ` Bob Pearson 0 siblings, 1 reply; 19+ messages in thread From: yangx.jy @ 2021-08-20 7:42 UTC (permalink / raw) To: Zhu Yanjun Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe, linux-rdma On 2021/8/20 11:31, Zhu Yanjun wrote: > Latest kernel + latest rdma-coOnre< ------rping----> 5.10.y stable + > latest rdma-core > Latest kernel + latest rdma-core< ------rping----> 5.11.y stable + > latest rdma-core > Latest kernel + latest rdma-core< ------rping----> 5.12.y stable + > latest rdma-core > Latest kernel + latest rdma-core< ------rping----> 5.13.y stable + > latest rdma-core > > The above works well. Hi Yanjun, Sorry, I don't know why you cannot reproduce the bug. Did you see the similar bug reported by Olga Kornievskaia? https://www.spinics.net/lists/linux-rdma/msg104358.html https://www.spinics.net/lists/linux-rdma/msg104359.html https://www.spinics.net/lists/linux-rdma/msg104360.html Best Regards, Xiao Yang > Zhu Yanjun > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-20 7:42 ` yangx.jy @ 2021-08-20 21:40 ` Bob Pearson 2021-08-20 22:09 ` Bob Pearson 0 siblings, 1 reply; 19+ messages in thread From: Bob Pearson @ 2021-08-20 21:40 UTC (permalink / raw) To: yangx.jy, Zhu Yanjun Cc: Olga Kornievskaia, Leon Romanovsky, Jason Gunthorpe, linux-rdma On 8/20/21 2:42 AM, yangx.jy@fujitsu.com wrote: > On 2021/8/20 11:31, Zhu Yanjun wrote: >> Latest kernel + latest rdma-coOnre< ------rping----> 5.10.y stable + >> latest rdma-core >> Latest kernel + latest rdma-core< ------rping----> 5.11.y stable + >> latest rdma-core >> Latest kernel + latest rdma-core< ------rping----> 5.12.y stable + >> latest rdma-core >> Latest kernel + latest rdma-core< ------rping----> 5.13.y stable + >> latest rdma-core >> >> The above works well. > Hi Yanjun, > > Sorry, I don't know why you cannot reproduce the bug. > > Did you see the similar bug reported by Olga Kornievskaia? > https://www.spinics.net/lists/linux-rdma/msg104358.html > https://www.spinics.net/lists/linux-rdma/msg104359.html > https://www.spinics.net/lists/linux-rdma/msg104360.html > > Best Regards, > Xiao Yang >> Zhu Yanjun >> There is some interest in the current status of rping on rxe. I have looked at several configurations and tested the following test cases: 1. The python test suite in rdma-core 2. ib_xxx_bw and ib_xxx_bw -R for RC 3. rping Between the following node configurations. A. 5.11.0 (ubuntu 21.04 OOB) + rdma-core 33.1 (ubuntu 21.04 OOB) B. 5.11.0 + current rdma-core + "Provider/rxe:Set the correct value of resid for inline data" (a.k.a rdma-core+) C. 5.14.0-rc1+ (for-next current) + 5 recent bug fixes (a.k.a. for-next+) RDMA/rxe:Fix bug in get srq wqe in rxe_resp.c.patch RDMA/rxe:Fix bug in rxe_net.c.patch RDMA/rxe:Add memory barriers to kernel queues.patch RDMA/rxe:Fix memory allocation while locked.patch RDMA/rxe:Zero out index member of struct rxe_queue.patch + rdma-core+ D. for-next+ + rdma-core (33.1) Results: 1. A N/A 1. B no errors, some skips 1. C no errors, some skips 1. D N/A (n.b. requires adding IPV6 address == gid[0] by hand) 2. [A-D] -> [A-D] all pass 3. A -> A, C -> C, D -> D all pass, all other combinations fail (RDMA_resolve_route: No such device. Not yet sure cause of failures but looking into it.) In theory these should all work but rdmacm is more sensitive to configuration than verbs. Bob ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-20 21:40 ` Bob Pearson @ 2021-08-20 22:09 ` Bob Pearson 0 siblings, 0 replies; 19+ messages in thread From: Bob Pearson @ 2021-08-20 22:09 UTC (permalink / raw) To: yangx.jy, Zhu Yanjun Cc: Olga Kornievskaia, Leon Romanovsky, Jason Gunthorpe, linux-rdma On 8/20/21 4:40 PM, Bob Pearson wrote: > On 8/20/21 2:42 AM, yangx.jy@fujitsu.com wrote: >> On 2021/8/20 11:31, Zhu Yanjun wrote: >>> Latest kernel + latest rdma-coOnre< ------rping----> 5.10.y stable + >>> latest rdma-core >>> Latest kernel + latest rdma-core< ------rping----> 5.11.y stable + >>> latest rdma-core >>> Latest kernel + latest rdma-core< ------rping----> 5.12.y stable + >>> latest rdma-core >>> Latest kernel + latest rdma-core< ------rping----> 5.13.y stable + >>> latest rdma-core >>> >>> The above works well. >> Hi Yanjun, >> >> Sorry, I don't know why you cannot reproduce the bug. >> >> Did you see the similar bug reported by Olga Kornievskaia? >> https://www.spinics.net/lists/linux-rdma/msg104358.html >> https://www.spinics.net/lists/linux-rdma/msg104359.html >> https://www.spinics.net/lists/linux-rdma/msg104360.html >> >> Best Regards, >> Xiao Yang >>> Zhu Yanjun >>> > > There is some interest in the current status of rping on rxe. > I have looked at several configurations and tested the following test cases: > > 1. The python test suite in rdma-core > 2. ib_xxx_bw and ib_xxx_bw -R for RC > 3. rping > > Between the following node configurations. > > A. 5.11.0 (ubuntu 21.04 OOB) + rdma-core 33.1 (ubuntu 21.04 OOB) > B. 5.11.0 + current rdma-core > + "Provider/rxe:Set the correct value of resid for inline data" (a.k.a rdma-core+) > C. 5.14.0-rc1+ (for-next current) > + 5 recent bug fixes (a.k.a. for-next+) > RDMA/rxe:Fix bug in get srq wqe in rxe_resp.c.patch > > RDMA/rxe:Fix bug in rxe_net.c.patch > > RDMA/rxe:Add memory barriers to kernel queues.patch > > RDMA/rxe:Fix memory allocation while locked.patch > > RDMA/rxe:Zero out index member of struct rxe_queue.patch > + rdma-core+ > D. for-next+ + rdma-core (33.1) > > Results: > 1. A N/A > 1. B no errors, some skips > 1. C no errors, some skips > 1. D N/A > (n.b. requires adding IPV6 address == gid[0] by hand) > > 2. [A-D] -> [A-D] all pass > > 3. A -> A, C -> C, D -> D all pass, all other combinations fail > > (RDMA_resolve_route: No such device. Not yet sure cause of failures but looking into it.) > In theory these should all work but rdmacm is more sensitive to configuration than verbs. > > Bob > Found the problem (thank you google) If you run both server$ rping -s -a nn.nn.nn.nn client$ rping -c -a nn.nn.nn.nn now all tests pass for rping as well. Bob ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-04 9:05 ` Zhu Yanjun 2021-08-06 2:37 ` Olga Kornievskaia @ 2021-08-13 21:53 ` Bob Pearson 2021-08-14 5:32 ` Leon Romanovsky 1 sibling, 1 reply; 19+ messages in thread From: Bob Pearson @ 2021-08-13 21:53 UTC (permalink / raw) To: Zhu Yanjun, Leon Romanovsky Cc: Olga Kornievskaia, Jason Gunthorpe, linux-rdma On 8/4/21 4:05 AM, Zhu Yanjun wrote: > On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky <leon@kernel.org> wrote: >> >> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote: >>> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote: >>>> >>>> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote: >>>>> >>>>> Hi, >>>>> >>>>> Can you please help me to understand the RXE status in the upstream? >>>>> >>>>> Does we still have crashes/interop issues/e.t.c? >>>> >>>> I made some developments with the RXE in the upstream, from my usage >>>> with latest RXE, >>>> I found the following: >>>> >>>> 1. rdma-core can not work well with latest RDMA git; >>> >>> The latest RDMA git is >>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git >> >> "Latest" is a relative term, what SHA did you test? >> Let's focus on fixing RXE before we will continue with new features. > > Thanks a lot. I agree with you. > > rdma-core: > 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull > request #1038 from selvintxavier/master > 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr > 327d45e0 tests: Add missing MAC element to args list > 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices > 8754fb51 bnxt_re/lib: Use separate indices for shadow queue > be4d8abf bnxt_re/lib: add a function to initialize software queue > > kernel rdma: > 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD) > RDMA/qedr: Improve error logs for rdma_alloc_tid error return > 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc > 991c4274dc17 RDMA/hfi1: Fix typo in comments > 8d7e415d5561 docs: Fix infiniband uverbs minor number > bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure > that requests are valid > bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting > e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails > a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on > hfi1_devdata->user_refcount > > with the above kernel and rdma-core, the following messages will appear. > " > [ 54.214608] rdma_rxe: loaded > [ 54.217089] infiniband rxe0: set active > [ 54.217101] infiniband rxe0: added enp0s8 > [ 167.623200] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 167.645590] rdma_rxe: cqe(1) < current # elements in queue (6) > [ 167.733297] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247 > [ 169.074796] rdma_rxe: qp#27 moved to error state > [ 169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de > [ 169.138889] rdma_rxe: qp#30 moved to error state > [ 169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7 > [ 169.160601] rdma_rxe: qp#31 moved to error state > [ 169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782 > [ 169.182170] rdma_rxe: qp#32 moved to error state > [ 169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8 > [ 169.667850] rdma_rxe: qp#39 moved to error state > [ 198.872649] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 198.894829] rdma_rxe: cqe(1) < current # elements in queue (6) > [ 198.981839] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887 > [ 200.332086] rdma_rxe: qp#58 moved to error state > [ 200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d > [ 200.396514] rdma_rxe: qp#61 moved to error state > [ 200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40 > [ 200.417956] rdma_rxe: qp#62 moved to error state > [ 200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24 > [ 200.439654] rdma_rxe: qp#63 moved to error state > [ 200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8 > [ 200.933153] rdma_rxe: qp#70 moved to error state > [ 206.880305] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 206.904030] rdma_rxe: cqe(1) < current # elements in queue (6) > [ 206.991494] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d > [ 208.360028] rdma_rxe: qp#89 moved to error state > [ 208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136 > [ 208.425675] rdma_rxe: qp#92 moved to error state > [ 208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8 > [ 208.447370] rdma_rxe: qp#93 moved to error state > [ 208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a > [ 208.469550] rdma_rxe: qp#94 moved to error state > [ 208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670 > [ 208.956731] rdma_rxe: qp#100 moved to error state > [ 216.879703] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 216.902199] rdma_rxe: cqe(1) < current # elements in queue (6) > [ 216.989264] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6 > [ 218.363808] rdma_rxe: qp#119 moved to error state > [ 218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4 > [ 218.429513] rdma_rxe: qp#122 moved to error state > [ 218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895 > [ 218.451481] rdma_rxe: qp#123 moved to error state > [ 218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910 > [ 218.473908] rdma_rxe: qp#124 moved to error state > [ 218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b > [ 218.963641] rdma_rxe: qp#130 moved to error state > [ 233.855140] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 233.877202] rdma_rxe: cqe(1) < current # elements in queue (6) > [ 233.963952] rdma_rxe: cqe(32768) > max_cqe(32767) > [ 235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2 > [ 235.305319] rdma_rxe: qp#149 moved to error state > [ 235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8 > [ 235.368838] rdma_rxe: qp#152 moved to error state > [ 235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d > [ 235.390192] rdma_rxe: qp#153 moved to error state > [ 235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c > [ 235.411374] rdma_rxe: qp#154 moved to error state > [ 235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482 > [ 235.895828] rdma_rxe: qp#161 moved to error state > " > Not sure if they are problems. > IMO, we should make further investigations. > > Thanks > Zhu Yanjun >> >> Thanks All of the messages are from the rxe driver caused by the python tests intentionally causing errors. Here is a test run with messages. No errors occurred. This is run on current rdma_core and for_next. Does not answer the question about rping. That needs more testing. (so ru is short for "./build/bin/run_tests.py --dev rxe_1") Bob rpearson:rdma-core$ sudo dmesg -C rpearson:rdma-core$ so ru .............sssssssss.............sssssssssssssss.sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss........sssssssssssssssssss....ssss........s...s.s..s..........ssssssssss..ss ---------------------------------------------------------------------- Ran 199 tests in 0.418s OK (skipped=134) rpearson:rdma-core$ sudo dmesg [ 9396.038090] rdma_rxe: cqe(32768) > max_cqe(32767) [ 9396.042414] rdma_rxe: cqe(1) < current # elements in queue (6) [ 9396.056685] rdma_rxe: cqe(32768) > max_cqe(32767) [ 9396.273114] rdma_rxe: check_rkey: no MW matches rkey 0x1000256 [ 9396.273120] rdma_rxe: qp#27 moved to error state [ 9396.283112] rdma_rxe: check_rkey: no MW matches rkey 0x10005be [ 9396.283116] rdma_rxe: qp#30 moved to error state [ 9396.286497] rdma_rxe: check_rkey: no MW matches rkey 0x100063d [ 9396.286501] rdma_rxe: qp#31 moved to error state [ 9396.289917] rdma_rxe: check_rkey: no MW matches rkey 0x10007a6 [ 9396.289922] rdma_rxe: qp#32 moved to error state [ 9396.364850] rdma_rxe: check_rkey: no MR matches rkey 0x1868 [ 9396.364854] rdma_rxe: qp#37 moved to error state ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-13 21:53 ` Bob Pearson @ 2021-08-14 5:32 ` Leon Romanovsky 0 siblings, 0 replies; 19+ messages in thread From: Leon Romanovsky @ 2021-08-14 5:32 UTC (permalink / raw) To: Bob Pearson; +Cc: Zhu Yanjun, Olga Kornievskaia, Jason Gunthorpe, linux-rdma On Fri, Aug 13, 2021 at 04:53:56PM -0500, Bob Pearson wrote: > On 8/4/21 4:05 AM, Zhu Yanjun wrote: > > On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky <leon@kernel.org> wrote: > >> > >> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote: > >>> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote: > >>>> > >>>> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote: > >>>>> > >>>>> Hi, > >>>>> > >>>>> Can you please help me to understand the RXE status in the upstream? > >>>>> > >>>>> Does we still have crashes/interop issues/e.t.c? > >>>> > >>>> I made some developments with the RXE in the upstream, from my usage > >>>> with latest RXE, > >>>> I found the following: > >>>> > >>>> 1. rdma-core can not work well with latest RDMA git; > >>> > >>> The latest RDMA git is > >>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git > >> > >> "Latest" is a relative term, what SHA did you test? > >> Let's focus on fixing RXE before we will continue with new features. > > > > Thanks a lot. I agree with you. > > > > rdma-core: > > 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull > > request #1038 from selvintxavier/master > > 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr > > 327d45e0 tests: Add missing MAC element to args list > > 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices > > 8754fb51 bnxt_re/lib: Use separate indices for shadow queue > > be4d8abf bnxt_re/lib: add a function to initialize software queue > > > > kernel rdma: > > 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD) > > RDMA/qedr: Improve error logs for rdma_alloc_tid error return > > 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc > > 991c4274dc17 RDMA/hfi1: Fix typo in comments > > 8d7e415d5561 docs: Fix infiniband uverbs minor number > > bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure > > that requests are valid > > bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting > > e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails > > a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on > > hfi1_devdata->user_refcount > > > > with the above kernel and rdma-core, the following messages will appear. > > " > > [ 54.214608] rdma_rxe: loaded > > [ 54.217089] infiniband rxe0: set active > > [ 54.217101] infiniband rxe0: added enp0s8 > > [ 167.623200] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 167.645590] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 167.733297] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247 > > [ 169.074796] rdma_rxe: qp#27 moved to error state > > [ 169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de > > [ 169.138889] rdma_rxe: qp#30 moved to error state > > [ 169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7 > > [ 169.160601] rdma_rxe: qp#31 moved to error state > > [ 169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782 > > [ 169.182170] rdma_rxe: qp#32 moved to error state > > [ 169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8 > > [ 169.667850] rdma_rxe: qp#39 moved to error state > > [ 198.872649] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 198.894829] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 198.981839] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887 > > [ 200.332086] rdma_rxe: qp#58 moved to error state > > [ 200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d > > [ 200.396514] rdma_rxe: qp#61 moved to error state > > [ 200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40 > > [ 200.417956] rdma_rxe: qp#62 moved to error state > > [ 200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24 > > [ 200.439654] rdma_rxe: qp#63 moved to error state > > [ 200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8 > > [ 200.933153] rdma_rxe: qp#70 moved to error state > > [ 206.880305] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 206.904030] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 206.991494] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d > > [ 208.360028] rdma_rxe: qp#89 moved to error state > > [ 208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136 > > [ 208.425675] rdma_rxe: qp#92 moved to error state > > [ 208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8 > > [ 208.447370] rdma_rxe: qp#93 moved to error state > > [ 208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a > > [ 208.469550] rdma_rxe: qp#94 moved to error state > > [ 208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670 > > [ 208.956731] rdma_rxe: qp#100 moved to error state > > [ 216.879703] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 216.902199] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 216.989264] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6 > > [ 218.363808] rdma_rxe: qp#119 moved to error state > > [ 218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4 > > [ 218.429513] rdma_rxe: qp#122 moved to error state > > [ 218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895 > > [ 218.451481] rdma_rxe: qp#123 moved to error state > > [ 218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910 > > [ 218.473908] rdma_rxe: qp#124 moved to error state > > [ 218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b > > [ 218.963641] rdma_rxe: qp#130 moved to error state > > [ 233.855140] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 233.877202] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 233.963952] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2 > > [ 235.305319] rdma_rxe: qp#149 moved to error state > > [ 235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8 > > [ 235.368838] rdma_rxe: qp#152 moved to error state > > [ 235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d > > [ 235.390192] rdma_rxe: qp#153 moved to error state > > [ 235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c > > [ 235.411374] rdma_rxe: qp#154 moved to error state > > [ 235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482 > > [ 235.895828] rdma_rxe: qp#161 moved to error state > > " > > Not sure if they are problems. > > IMO, we should make further investigations. > > > > Thanks > > Zhu Yanjun > >> > >> Thanks > > > > All of the messages are from the rxe driver caused by the python tests intentionally causing > > errors. Here is a test run with messages. No errors occurred. This is run on current rdma_core and > for_next. Does not answer the question about rping. That needs more testing. > (so ru is short for "./build/bin/run_tests.py --dev rxe_1") > > Bob > > rpearson:rdma-core$ sudo dmesg -C > > rpearson:rdma-core$ so ru > > .............sssssssss.............sssssssssssssss.sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss........sssssssssssssssssss....ssss........s...s.s..s..........ssssssssss..ss > > ---------------------------------------------------------------------- > > Ran 199 tests in 0.418s > > > > OK (skipped=134) > > rpearson:rdma-core$ sudo dmesg > > [ 9396.038090] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 9396.042414] rdma_rxe: cqe(1) < current # elements in queue (6) > > [ 9396.056685] rdma_rxe: cqe(32768) > max_cqe(32767) > > [ 9396.273114] rdma_rxe: check_rkey: no MW matches rkey 0x1000256 > > [ 9396.273120] rdma_rxe: qp#27 moved to error state > > [ 9396.283112] rdma_rxe: check_rkey: no MW matches rkey 0x10005be > > [ 9396.283116] rdma_rxe: qp#30 moved to error state > > [ 9396.286497] rdma_rxe: check_rkey: no MW matches rkey 0x100063d > > [ 9396.286501] rdma_rxe: qp#31 moved to error state > > [ 9396.289917] rdma_rxe: check_rkey: no MW matches rkey 0x10007a6 > > [ 9396.289922] rdma_rxe: qp#32 moved to error state > > [ 9396.364850] rdma_rxe: check_rkey: no MR matches rkey 0x1868 > > [ 9396.364854] rdma_rxe: qp#37 moved to error state You shouldn't print these errors by default, they need be *_dbg() level, Thanks > > > > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: RXE status in the upstream rping using rxe 2021-08-03 18:07 RXE status in the upstream rping using rxe Leon Romanovsky 2021-08-04 1:01 ` Zhu Yanjun @ 2021-08-23 7:53 ` Zhu Yanjun 1 sibling, 0 replies; 19+ messages in thread From: Zhu Yanjun @ 2021-08-23 7:53 UTC (permalink / raw) To: Leon Romanovsky Cc: Olga Kornievskaia, Bob Pearson, Jason Gunthorpe, linux-rdma On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote: > > Hi, > > Can you please help me to understand the RXE status in the upstream? Hi, all On the Ubuntu 20.04, kernel: 5.4.0-80, with the latest rdma-core, " root@xxx:~/rdma-core# cat /etc/issue Ubuntu 20.04.2 LTS \n \l root@xxx:~/rdma-core# uname -a Linux 5.4.0-80-generic #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux root@xxx:~/rdma-core# git log -1 --oneline 206a0cfd (HEAD -> master, origin/master, origin/HEAD) Merge pull request #1047 from yishaih/mlx5_misc " Run run_tests.py, I got the following errors. Not sure if it is a problem. Please comment on it. It is easy to reproduce on Ubuntu 20.04 + 5.4.0-80. " .............sssssssss..FFF........sssssssssssssss.sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss.ssssssssssssssssssssssssss....s...ss..........s....s..s.......ssReceived the following exceptions: {'active': BrokenBarrierError()} EReceived the following exceptions: {'active': BrokenBarrierError()} E........ss ====================================================================== ERROR: test_rdmacm_async_ex_multicast_traffic (tests.test_rdmacm.CMTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/rdma-core/tests/utils.py", line 976, in inner return func(instance) File "/root/rdma-core/tests/test_rdmacm.py", line 42, in test_rdmacm_async_ex_multicast_traffic self.two_nodes_rdmacm_traffic(CMAsyncConnection, File "/root/rdma-core/tests/base.py", line 368, in two_nodes_rdmacm_traffic raise(res) threading.BrokenBarrierError ====================================================================== ERROR: test_rdmacm_async_multicast_traffic (tests.test_rdmacm.CMTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/rdma-core/tests/utils.py", line 976, in inner return func(instance) File "/root/rdma-core/tests/test_rdmacm.py", line 36, in test_rdmacm_async_multicast_traffic self.two_nodes_rdmacm_traffic(CMAsyncConnection, File "/root/rdma-core/tests/base.py", line 368, in two_nodes_rdmacm_traffic raise(res) threading.BrokenBarrierError ====================================================================== FAIL: test_phys_port_cnt_ex (tests.test_device.DeviceTest) Test phys_port_cnt_ex ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/rdma-core/tests/test_device.py", line 222, in test_phys_port_cnt_ex self.assertEqual(phys_port_cnt, phys_port_cnt_ex, AssertionError: 1 != 0 : phys_port_cnt_ex and phys_port_cnt should be equal if number of ports is less than 256 ====================================================================== FAIL: test_query_device (tests.test_device.DeviceTest) Test ibv_query_device() ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/rdma-core/tests/test_device.py", line 63, in test_query_device self.verify_device_attr(attr, dev) File "/root/rdma-core/tests/test_device.py", line 187, in verify_device_attr assert attr.vendor_id != 0 AssertionError ====================================================================== FAIL: test_query_device_ex (tests.test_device.DeviceTest) Test ibv_query_device_ex() ---------------------------------------------------------------------- Traceback (most recent call last): File "/root/rdma-core/tests/test_device.py", line 206, in test_query_device_ex self.verify_device_attr(attr_ex.orig_attr, dev) File "/root/rdma-core/tests/test_device.py", line 187, in verify_device_attr assert attr.vendor_id != 0 AssertionError ---------------------------------------------------------------------- Ran 205 tests in 40.112s FAILED (failures=3, errors=2, skipped=137) Traceback (most recent call last): File "device.pyx", line 170, in pyverbs.device.Context.close AttributeError: 'NoneType' object has no attribute 'debug' Exception ignored in: 'pyverbs.device.Context.__dealloc__' Traceback (most recent call last): File "device.pyx", line 170, in pyverbs.device.Context.close AttributeError: 'NoneType' object has no attribute 'debug' " > > Does we still have crashes/interop issues/e.t.c? > > Latest commit is: > 20da44dfe8ef ("RDMA/mlx5: Drop in-driver verbs object creations") > > Thanks ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2021-08-23 7:53 UTC | newest] Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-08-03 18:07 RXE status in the upstream rping using rxe Leon Romanovsky 2021-08-04 1:01 ` Zhu Yanjun 2021-08-04 1:09 ` Zhu Yanjun 2021-08-04 5:41 ` Leon Romanovsky 2021-08-04 9:05 ` Zhu Yanjun 2021-08-06 2:37 ` Olga Kornievskaia 2021-08-17 2:28 ` Zhu Yanjun 2021-08-18 6:43 ` yangx.jy 2021-08-18 7:20 ` Zhu Yanjun 2021-08-18 7:44 ` yangx.jy 2021-08-18 8:28 ` Zhu Yanjun 2021-08-18 14:33 ` yangx.jy 2021-08-20 3:31 ` Zhu Yanjun 2021-08-20 7:42 ` yangx.jy 2021-08-20 21:40 ` Bob Pearson 2021-08-20 22:09 ` Bob Pearson 2021-08-13 21:53 ` Bob Pearson 2021-08-14 5:32 ` Leon Romanovsky 2021-08-23 7:53 ` Zhu Yanjun
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).