All of lore.kernel.org
 help / color / mirror / Atom feed
* RXE status in the upstream rping using rxe
@ 2021-08-03 18:07 Leon Romanovsky
  2021-08-04  1:01 ` Zhu Yanjun
  2021-08-23  7:53 ` Zhu Yanjun
  0 siblings, 2 replies; 19+ messages in thread
From: Leon Romanovsky @ 2021-08-03 18:07 UTC (permalink / raw)
  To: Olga Kornievskaia, Bob Pearson, Zhu Yanjun; +Cc: Jason Gunthorpe, linux-rdma

Hi,

Can you please help me to understand the RXE status in the upstream?

Does we still have crashes/interop issues/e.t.c?

Latest commit is:
20da44dfe8ef ("RDMA/mlx5: Drop in-driver verbs object creations")

Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-03 18:07 RXE status in the upstream rping using rxe Leon Romanovsky
@ 2021-08-04  1:01 ` Zhu Yanjun
  2021-08-04  1:09   ` Zhu Yanjun
  2021-08-23  7:53 ` Zhu Yanjun
  1 sibling, 1 reply; 19+ messages in thread
From: Zhu Yanjun @ 2021-08-04  1:01 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Olga Kornievskaia, Bob Pearson, Jason Gunthorpe, linux-rdma

On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> Hi,
>
> Can you please help me to understand the RXE status in the upstream?
>
> Does we still have crashes/interop issues/e.t.c?

I made some developments with the RXE in the upstream, from my usage
with latest RXE,
I found the following:

1. rdma-core can not work well with latest RDMA git;
2. There are some problems with rxe in different kernel version;

To 2, the commit
https://patchwork.kernel.org/project/linux-rdma/patch/20210729220039.18549-3-rpearsonhpe@gmail.com/
can relieve this problem,
not sure if some bugs still exist in RXE.

To 1, I will continue to make further investigations.
So we should have time to make tests, fix bugs and make rxe stable.

Zhu Yanjun
>
> Latest commit is:
> 20da44dfe8ef ("RDMA/mlx5: Drop in-driver verbs object creations")
>
> Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-04  1:01 ` Zhu Yanjun
@ 2021-08-04  1:09   ` Zhu Yanjun
  2021-08-04  5:41     ` Leon Romanovsky
  0 siblings, 1 reply; 19+ messages in thread
From: Zhu Yanjun @ 2021-08-04  1:09 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Olga Kornievskaia, Bob Pearson, Jason Gunthorpe, linux-rdma

On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
>
> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > Hi,
> >
> > Can you please help me to understand the RXE status in the upstream?
> >
> > Does we still have crashes/interop issues/e.t.c?
>
> I made some developments with the RXE in the upstream, from my usage
> with latest RXE,
> I found the following:
>
> 1. rdma-core can not work well with latest RDMA git;

The latest RDMA git is
https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git

Zhu Yanjun
> 2. There are some problems with rxe in different kernel version;
>
> To 2, the commit
> https://patchwork.kernel.org/project/linux-rdma/patch/20210729220039.18549-3-rpearsonhpe@gmail.com/
> can relieve this problem,
> not sure if some bugs still exist in RXE.
>
> To 1, I will continue to make further investigations.
> So we should have time to make tests, fix bugs and make rxe stable.
>
> Zhu Yanjun
> >
> > Latest commit is:
> > 20da44dfe8ef ("RDMA/mlx5: Drop in-driver verbs object creations")
> >
> > Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-04  1:09   ` Zhu Yanjun
@ 2021-08-04  5:41     ` Leon Romanovsky
  2021-08-04  9:05       ` Zhu Yanjun
  0 siblings, 1 reply; 19+ messages in thread
From: Leon Romanovsky @ 2021-08-04  5:41 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Olga Kornievskaia, Bob Pearson, Jason Gunthorpe, linux-rdma

On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote:
> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
> >
> > On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > Hi,
> > >
> > > Can you please help me to understand the RXE status in the upstream?
> > >
> > > Does we still have crashes/interop issues/e.t.c?
> >
> > I made some developments with the RXE in the upstream, from my usage
> > with latest RXE,
> > I found the following:
> >
> > 1. rdma-core can not work well with latest RDMA git;
> 
> The latest RDMA git is
> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git

"Latest" is a relative term, what SHA did you test?
Let's focus on fixing RXE before we will continue with new features.

Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-04  5:41     ` Leon Romanovsky
@ 2021-08-04  9:05       ` Zhu Yanjun
  2021-08-06  2:37         ` Olga Kornievskaia
  2021-08-13 21:53         ` Bob Pearson
  0 siblings, 2 replies; 19+ messages in thread
From: Zhu Yanjun @ 2021-08-04  9:05 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Olga Kornievskaia, Bob Pearson, Jason Gunthorpe, linux-rdma

On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote:
> > On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
> > >
> > > On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > >
> > > > Hi,
> > > >
> > > > Can you please help me to understand the RXE status in the upstream?
> > > >
> > > > Does we still have crashes/interop issues/e.t.c?
> > >
> > > I made some developments with the RXE in the upstream, from my usage
> > > with latest RXE,
> > > I found the following:
> > >
> > > 1. rdma-core can not work well with latest RDMA git;
> >
> > The latest RDMA git is
> > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
>
> "Latest" is a relative term, what SHA did you test?
> Let's focus on fixing RXE before we will continue with new features.

Thanks a lot. I agree with you.

rdma-core:
313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull
request #1038 from selvintxavier/master
2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr
327d45e0 tests: Add missing MAC element to args list
66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices
8754fb51 bnxt_re/lib: Use separate indices for shadow queue
be4d8abf bnxt_re/lib: add a function to initialize software queue

kernel rdma:
0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD)
RDMA/qedr: Improve error logs for rdma_alloc_tid error return
090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc
991c4274dc17 RDMA/hfi1: Fix typo in comments
8d7e415d5561 docs: Fix infiniband uverbs minor number
bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure
that requests are valid
bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting
e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails
a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on
hfi1_devdata->user_refcount

with the above kernel and rdma-core, the following messages will appear.
"
[   54.214608] rdma_rxe: loaded
[   54.217089] infiniband rxe0: set active
[   54.217101] infiniband rxe0: added enp0s8
[  167.623200] rdma_rxe: cqe(32768) > max_cqe(32767)
[  167.645590] rdma_rxe: cqe(1) < current # elements in queue (6)
[  167.733297] rdma_rxe: cqe(32768) > max_cqe(32767)
[  169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247
[  169.074796] rdma_rxe: qp#27 moved to error state
[  169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de
[  169.138889] rdma_rxe: qp#30 moved to error state
[  169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7
[  169.160601] rdma_rxe: qp#31 moved to error state
[  169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782
[  169.182170] rdma_rxe: qp#32 moved to error state
[  169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8
[  169.667850] rdma_rxe: qp#39 moved to error state
[  198.872649] rdma_rxe: cqe(32768) > max_cqe(32767)
[  198.894829] rdma_rxe: cqe(1) < current # elements in queue (6)
[  198.981839] rdma_rxe: cqe(32768) > max_cqe(32767)
[  200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887
[  200.332086] rdma_rxe: qp#58 moved to error state
[  200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d
[  200.396514] rdma_rxe: qp#61 moved to error state
[  200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40
[  200.417956] rdma_rxe: qp#62 moved to error state
[  200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24
[  200.439654] rdma_rxe: qp#63 moved to error state
[  200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8
[  200.933153] rdma_rxe: qp#70 moved to error state
[  206.880305] rdma_rxe: cqe(32768) > max_cqe(32767)
[  206.904030] rdma_rxe: cqe(1) < current # elements in queue (6)
[  206.991494] rdma_rxe: cqe(32768) > max_cqe(32767)
[  208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d
[  208.360028] rdma_rxe: qp#89 moved to error state
[  208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136
[  208.425675] rdma_rxe: qp#92 moved to error state
[  208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8
[  208.447370] rdma_rxe: qp#93 moved to error state
[  208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a
[  208.469550] rdma_rxe: qp#94 moved to error state
[  208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670
[  208.956731] rdma_rxe: qp#100 moved to error state
[  216.879703] rdma_rxe: cqe(32768) > max_cqe(32767)
[  216.902199] rdma_rxe: cqe(1) < current # elements in queue (6)
[  216.989264] rdma_rxe: cqe(32768) > max_cqe(32767)
[  218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6
[  218.363808] rdma_rxe: qp#119 moved to error state
[  218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4
[  218.429513] rdma_rxe: qp#122 moved to error state
[  218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895
[  218.451481] rdma_rxe: qp#123 moved to error state
[  218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910
[  218.473908] rdma_rxe: qp#124 moved to error state
[  218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b
[  218.963641] rdma_rxe: qp#130 moved to error state
[  233.855140] rdma_rxe: cqe(32768) > max_cqe(32767)
[  233.877202] rdma_rxe: cqe(1) < current # elements in queue (6)
[  233.963952] rdma_rxe: cqe(32768) > max_cqe(32767)
[  235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2
[  235.305319] rdma_rxe: qp#149 moved to error state
[  235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8
[  235.368838] rdma_rxe: qp#152 moved to error state
[  235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d
[  235.390192] rdma_rxe: qp#153 moved to error state
[  235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c
[  235.411374] rdma_rxe: qp#154 moved to error state
[  235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482
[  235.895828] rdma_rxe: qp#161 moved to error state
"
Not sure if they are problems.
IMO, we should make further investigations.

Thanks
Zhu Yanjun
>
> Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-04  9:05       ` Zhu Yanjun
@ 2021-08-06  2:37         ` Olga Kornievskaia
  2021-08-17  2:28           ` Zhu Yanjun
  2021-08-13 21:53         ` Bob Pearson
  1 sibling, 1 reply; 19+ messages in thread
From: Olga Kornievskaia @ 2021-08-06  2:37 UTC (permalink / raw)
  To: Zhu Yanjun; +Cc: Leon Romanovsky, Bob Pearson, Jason Gunthorpe, linux-rdma

On Wed, Aug 4, 2021 at 5:05 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
>
> On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky <leon@kernel.org> wrote:
> >
> > On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote:
> > > On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
> > > >
> > > > On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Can you please help me to understand the RXE status in the upstream?
> > > > >
> > > > > Does we still have crashes/interop issues/e.t.c?
> > > >
> > > > I made some developments with the RXE in the upstream, from my usage
> > > > with latest RXE,
> > > > I found the following:
> > > >
> > > > 1. rdma-core can not work well with latest RDMA git;
> > >
> > > The latest RDMA git is
> > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
> >
> > "Latest" is a relative term, what SHA did you test?
> > Let's focus on fixing RXE before we will continue with new features.
>
> Thanks a lot. I agree with you.

I believe simple rping still doesn't work linux-to-linux. The last
working version (of rping in rxe) was 5.13 I think. I have posted a
number of crashes rping encounters (gotta get that working before I
can even try NFSoRDMA).

Thank you for working on the code.

We (NFS community) do test NFSoRDMA every git pull using rxe and siw
but lately have been encountering problems.

> rdma-core:
> 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull
> request #1038 from selvintxavier/master
> 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr
> 327d45e0 tests: Add missing MAC element to args list
> 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices
> 8754fb51 bnxt_re/lib: Use separate indices for shadow queue
> be4d8abf bnxt_re/lib: add a function to initialize software queue
>
> kernel rdma:
> 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD)
> RDMA/qedr: Improve error logs for rdma_alloc_tid error return
> 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc
> 991c4274dc17 RDMA/hfi1: Fix typo in comments
> 8d7e415d5561 docs: Fix infiniband uverbs minor number
> bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure
> that requests are valid
> bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting
> e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails
> a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on
> hfi1_devdata->user_refcount
>
> with the above kernel and rdma-core, the following messages will appear.
> "
> [   54.214608] rdma_rxe: loaded
> [   54.217089] infiniband rxe0: set active
> [   54.217101] infiniband rxe0: added enp0s8
> [  167.623200] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  167.645590] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  167.733297] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247
> [  169.074796] rdma_rxe: qp#27 moved to error state
> [  169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de
> [  169.138889] rdma_rxe: qp#30 moved to error state
> [  169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7
> [  169.160601] rdma_rxe: qp#31 moved to error state
> [  169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782
> [  169.182170] rdma_rxe: qp#32 moved to error state
> [  169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8
> [  169.667850] rdma_rxe: qp#39 moved to error state
> [  198.872649] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  198.894829] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  198.981839] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887
> [  200.332086] rdma_rxe: qp#58 moved to error state
> [  200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d
> [  200.396514] rdma_rxe: qp#61 moved to error state
> [  200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40
> [  200.417956] rdma_rxe: qp#62 moved to error state
> [  200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24
> [  200.439654] rdma_rxe: qp#63 moved to error state
> [  200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8
> [  200.933153] rdma_rxe: qp#70 moved to error state
> [  206.880305] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  206.904030] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  206.991494] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d
> [  208.360028] rdma_rxe: qp#89 moved to error state
> [  208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136
> [  208.425675] rdma_rxe: qp#92 moved to error state
> [  208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8
> [  208.447370] rdma_rxe: qp#93 moved to error state
> [  208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a
> [  208.469550] rdma_rxe: qp#94 moved to error state
> [  208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670
> [  208.956731] rdma_rxe: qp#100 moved to error state
> [  216.879703] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  216.902199] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  216.989264] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6
> [  218.363808] rdma_rxe: qp#119 moved to error state
> [  218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4
> [  218.429513] rdma_rxe: qp#122 moved to error state
> [  218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895
> [  218.451481] rdma_rxe: qp#123 moved to error state
> [  218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910
> [  218.473908] rdma_rxe: qp#124 moved to error state
> [  218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b
> [  218.963641] rdma_rxe: qp#130 moved to error state
> [  233.855140] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  233.877202] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  233.963952] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2
> [  235.305319] rdma_rxe: qp#149 moved to error state
> [  235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8
> [  235.368838] rdma_rxe: qp#152 moved to error state
> [  235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d
> [  235.390192] rdma_rxe: qp#153 moved to error state
> [  235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c
> [  235.411374] rdma_rxe: qp#154 moved to error state
> [  235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482
> [  235.895828] rdma_rxe: qp#161 moved to error state
> "
> Not sure if they are problems.
> IMO, we should make further investigations.
>
> Thanks
> Zhu Yanjun
> >
> > Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-04  9:05       ` Zhu Yanjun
  2021-08-06  2:37         ` Olga Kornievskaia
@ 2021-08-13 21:53         ` Bob Pearson
  2021-08-14  5:32           ` Leon Romanovsky
  1 sibling, 1 reply; 19+ messages in thread
From: Bob Pearson @ 2021-08-13 21:53 UTC (permalink / raw)
  To: Zhu Yanjun, Leon Romanovsky
  Cc: Olga Kornievskaia, Jason Gunthorpe, linux-rdma

On 8/4/21 4:05 AM, Zhu Yanjun wrote:
> On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky <leon@kernel.org> wrote:
>>
>> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote:
>>> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
>>>>
>>>> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Can you please help me to understand the RXE status in the upstream?
>>>>>
>>>>> Does we still have crashes/interop issues/e.t.c?
>>>>
>>>> I made some developments with the RXE in the upstream, from my usage
>>>> with latest RXE,
>>>> I found the following:
>>>>
>>>> 1. rdma-core can not work well with latest RDMA git;
>>>
>>> The latest RDMA git is
>>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
>>
>> "Latest" is a relative term, what SHA did you test?
>> Let's focus on fixing RXE before we will continue with new features.
> 
> Thanks a lot. I agree with you.
> 
> rdma-core:
> 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull
> request #1038 from selvintxavier/master
> 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr
> 327d45e0 tests: Add missing MAC element to args list
> 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices
> 8754fb51 bnxt_re/lib: Use separate indices for shadow queue
> be4d8abf bnxt_re/lib: add a function to initialize software queue
> 
> kernel rdma:
> 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD)
> RDMA/qedr: Improve error logs for rdma_alloc_tid error return
> 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc
> 991c4274dc17 RDMA/hfi1: Fix typo in comments
> 8d7e415d5561 docs: Fix infiniband uverbs minor number
> bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure
> that requests are valid
> bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting
> e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails
> a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on
> hfi1_devdata->user_refcount
> 
> with the above kernel and rdma-core, the following messages will appear.
> "
> [   54.214608] rdma_rxe: loaded
> [   54.217089] infiniband rxe0: set active
> [   54.217101] infiniband rxe0: added enp0s8
> [  167.623200] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  167.645590] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  167.733297] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247
> [  169.074796] rdma_rxe: qp#27 moved to error state
> [  169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de
> [  169.138889] rdma_rxe: qp#30 moved to error state
> [  169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7
> [  169.160601] rdma_rxe: qp#31 moved to error state
> [  169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782
> [  169.182170] rdma_rxe: qp#32 moved to error state
> [  169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8
> [  169.667850] rdma_rxe: qp#39 moved to error state
> [  198.872649] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  198.894829] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  198.981839] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887
> [  200.332086] rdma_rxe: qp#58 moved to error state
> [  200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d
> [  200.396514] rdma_rxe: qp#61 moved to error state
> [  200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40
> [  200.417956] rdma_rxe: qp#62 moved to error state
> [  200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24
> [  200.439654] rdma_rxe: qp#63 moved to error state
> [  200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8
> [  200.933153] rdma_rxe: qp#70 moved to error state
> [  206.880305] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  206.904030] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  206.991494] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d
> [  208.360028] rdma_rxe: qp#89 moved to error state
> [  208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136
> [  208.425675] rdma_rxe: qp#92 moved to error state
> [  208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8
> [  208.447370] rdma_rxe: qp#93 moved to error state
> [  208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a
> [  208.469550] rdma_rxe: qp#94 moved to error state
> [  208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670
> [  208.956731] rdma_rxe: qp#100 moved to error state
> [  216.879703] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  216.902199] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  216.989264] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6
> [  218.363808] rdma_rxe: qp#119 moved to error state
> [  218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4
> [  218.429513] rdma_rxe: qp#122 moved to error state
> [  218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895
> [  218.451481] rdma_rxe: qp#123 moved to error state
> [  218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910
> [  218.473908] rdma_rxe: qp#124 moved to error state
> [  218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b
> [  218.963641] rdma_rxe: qp#130 moved to error state
> [  233.855140] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  233.877202] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  233.963952] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2
> [  235.305319] rdma_rxe: qp#149 moved to error state
> [  235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8
> [  235.368838] rdma_rxe: qp#152 moved to error state
> [  235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d
> [  235.390192] rdma_rxe: qp#153 moved to error state
> [  235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c
> [  235.411374] rdma_rxe: qp#154 moved to error state
> [  235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482
> [  235.895828] rdma_rxe: qp#161 moved to error state
> "
> Not sure if they are problems.
> IMO, we should make further investigations.
> 
> Thanks
> Zhu Yanjun
>>
>> Thanks



All of the messages are from the rxe driver caused by the python tests intentionally causing

errors. Here is a test run with messages. No errors occurred. This is run on current rdma_core and
for_next. Does not answer the question about rping. That needs more testing.
(so ru is short for "./build/bin/run_tests.py --dev rxe_1")

Bob

rpearson:rdma-core$ sudo dmesg -C

rpearson:rdma-core$ so ru

.............sssssssss.............sssssssssssssss.sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss........sssssssssssssssssss....ssss........s...s.s..s..........ssssssssss..ss

----------------------------------------------------------------------

Ran 199 tests in 0.418s



OK (skipped=134)

rpearson:rdma-core$ sudo dmesg

[ 9396.038090] rdma_rxe: cqe(32768) > max_cqe(32767)

[ 9396.042414] rdma_rxe: cqe(1) < current # elements in queue (6)

[ 9396.056685] rdma_rxe: cqe(32768) > max_cqe(32767)

[ 9396.273114] rdma_rxe: check_rkey: no MW matches rkey 0x1000256

[ 9396.273120] rdma_rxe: qp#27 moved to error state

[ 9396.283112] rdma_rxe: check_rkey: no MW matches rkey 0x10005be

[ 9396.283116] rdma_rxe: qp#30 moved to error state

[ 9396.286497] rdma_rxe: check_rkey: no MW matches rkey 0x100063d

[ 9396.286501] rdma_rxe: qp#31 moved to error state

[ 9396.289917] rdma_rxe: check_rkey: no MW matches rkey 0x10007a6

[ 9396.289922] rdma_rxe: qp#32 moved to error state

[ 9396.364850] rdma_rxe: check_rkey: no MR matches rkey 0x1868

[ 9396.364854] rdma_rxe: qp#37 moved to error state





^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-13 21:53         ` Bob Pearson
@ 2021-08-14  5:32           ` Leon Romanovsky
  0 siblings, 0 replies; 19+ messages in thread
From: Leon Romanovsky @ 2021-08-14  5:32 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Zhu Yanjun, Olga Kornievskaia, Jason Gunthorpe, linux-rdma

On Fri, Aug 13, 2021 at 04:53:56PM -0500, Bob Pearson wrote:
> On 8/4/21 4:05 AM, Zhu Yanjun wrote:
> > On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky <leon@kernel.org> wrote:
> >>
> >> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote:
> >>> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
> >>>>
> >>>> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote:
> >>>>>
> >>>>> Hi,
> >>>>>
> >>>>> Can you please help me to understand the RXE status in the upstream?
> >>>>>
> >>>>> Does we still have crashes/interop issues/e.t.c?
> >>>>
> >>>> I made some developments with the RXE in the upstream, from my usage
> >>>> with latest RXE,
> >>>> I found the following:
> >>>>
> >>>> 1. rdma-core can not work well with latest RDMA git;
> >>>
> >>> The latest RDMA git is
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
> >>
> >> "Latest" is a relative term, what SHA did you test?
> >> Let's focus on fixing RXE before we will continue with new features.
> > 
> > Thanks a lot. I agree with you.
> > 
> > rdma-core:
> > 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull
> > request #1038 from selvintxavier/master
> > 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr
> > 327d45e0 tests: Add missing MAC element to args list
> > 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices
> > 8754fb51 bnxt_re/lib: Use separate indices for shadow queue
> > be4d8abf bnxt_re/lib: add a function to initialize software queue
> > 
> > kernel rdma:
> > 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD)
> > RDMA/qedr: Improve error logs for rdma_alloc_tid error return
> > 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc
> > 991c4274dc17 RDMA/hfi1: Fix typo in comments
> > 8d7e415d5561 docs: Fix infiniband uverbs minor number
> > bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure
> > that requests are valid
> > bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting
> > e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails
> > a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on
> > hfi1_devdata->user_refcount
> > 
> > with the above kernel and rdma-core, the following messages will appear.
> > "
> > [   54.214608] rdma_rxe: loaded
> > [   54.217089] infiniband rxe0: set active
> > [   54.217101] infiniband rxe0: added enp0s8
> > [  167.623200] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  167.645590] rdma_rxe: cqe(1) < current # elements in queue (6)
> > [  167.733297] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247
> > [  169.074796] rdma_rxe: qp#27 moved to error state
> > [  169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de
> > [  169.138889] rdma_rxe: qp#30 moved to error state
> > [  169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7
> > [  169.160601] rdma_rxe: qp#31 moved to error state
> > [  169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782
> > [  169.182170] rdma_rxe: qp#32 moved to error state
> > [  169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8
> > [  169.667850] rdma_rxe: qp#39 moved to error state
> > [  198.872649] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  198.894829] rdma_rxe: cqe(1) < current # elements in queue (6)
> > [  198.981839] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887
> > [  200.332086] rdma_rxe: qp#58 moved to error state
> > [  200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d
> > [  200.396514] rdma_rxe: qp#61 moved to error state
> > [  200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40
> > [  200.417956] rdma_rxe: qp#62 moved to error state
> > [  200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24
> > [  200.439654] rdma_rxe: qp#63 moved to error state
> > [  200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8
> > [  200.933153] rdma_rxe: qp#70 moved to error state
> > [  206.880305] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  206.904030] rdma_rxe: cqe(1) < current # elements in queue (6)
> > [  206.991494] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d
> > [  208.360028] rdma_rxe: qp#89 moved to error state
> > [  208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136
> > [  208.425675] rdma_rxe: qp#92 moved to error state
> > [  208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8
> > [  208.447370] rdma_rxe: qp#93 moved to error state
> > [  208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a
> > [  208.469550] rdma_rxe: qp#94 moved to error state
> > [  208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670
> > [  208.956731] rdma_rxe: qp#100 moved to error state
> > [  216.879703] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  216.902199] rdma_rxe: cqe(1) < current # elements in queue (6)
> > [  216.989264] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6
> > [  218.363808] rdma_rxe: qp#119 moved to error state
> > [  218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4
> > [  218.429513] rdma_rxe: qp#122 moved to error state
> > [  218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895
> > [  218.451481] rdma_rxe: qp#123 moved to error state
> > [  218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910
> > [  218.473908] rdma_rxe: qp#124 moved to error state
> > [  218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b
> > [  218.963641] rdma_rxe: qp#130 moved to error state
> > [  233.855140] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  233.877202] rdma_rxe: cqe(1) < current # elements in queue (6)
> > [  233.963952] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2
> > [  235.305319] rdma_rxe: qp#149 moved to error state
> > [  235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8
> > [  235.368838] rdma_rxe: qp#152 moved to error state
> > [  235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d
> > [  235.390192] rdma_rxe: qp#153 moved to error state
> > [  235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c
> > [  235.411374] rdma_rxe: qp#154 moved to error state
> > [  235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482
> > [  235.895828] rdma_rxe: qp#161 moved to error state
> > "
> > Not sure if they are problems.
> > IMO, we should make further investigations.
> > 
> > Thanks
> > Zhu Yanjun
> >>
> >> Thanks
> 
> 
> 
> All of the messages are from the rxe driver caused by the python tests intentionally causing
> 
> errors. Here is a test run with messages. No errors occurred. This is run on current rdma_core and
> for_next. Does not answer the question about rping. That needs more testing.
> (so ru is short for "./build/bin/run_tests.py --dev rxe_1")
> 
> Bob
> 
> rpearson:rdma-core$ sudo dmesg -C
> 
> rpearson:rdma-core$ so ru
> 
> .............sssssssss.............sssssssssssssss.sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss........sssssssssssssssssss....ssss........s...s.s..s..........ssssssssss..ss
> 
> ----------------------------------------------------------------------
> 
> Ran 199 tests in 0.418s
> 
> 
> 
> OK (skipped=134)
> 
> rpearson:rdma-core$ sudo dmesg
> 
> [ 9396.038090] rdma_rxe: cqe(32768) > max_cqe(32767)
> 
> [ 9396.042414] rdma_rxe: cqe(1) < current # elements in queue (6)
> 
> [ 9396.056685] rdma_rxe: cqe(32768) > max_cqe(32767)
> 
> [ 9396.273114] rdma_rxe: check_rkey: no MW matches rkey 0x1000256
> 
> [ 9396.273120] rdma_rxe: qp#27 moved to error state
> 
> [ 9396.283112] rdma_rxe: check_rkey: no MW matches rkey 0x10005be
> 
> [ 9396.283116] rdma_rxe: qp#30 moved to error state
> 
> [ 9396.286497] rdma_rxe: check_rkey: no MW matches rkey 0x100063d
> 
> [ 9396.286501] rdma_rxe: qp#31 moved to error state
> 
> [ 9396.289917] rdma_rxe: check_rkey: no MW matches rkey 0x10007a6
> 
> [ 9396.289922] rdma_rxe: qp#32 moved to error state
> 
> [ 9396.364850] rdma_rxe: check_rkey: no MR matches rkey 0x1868
> 
> [ 9396.364854] rdma_rxe: qp#37 moved to error state

You shouldn't print these errors by default, they need be *_dbg() level,

Thanks

> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-06  2:37         ` Olga Kornievskaia
@ 2021-08-17  2:28           ` Zhu Yanjun
  2021-08-18  6:43             ` yangx.jy
  0 siblings, 1 reply; 19+ messages in thread
From: Zhu Yanjun @ 2021-08-17  2:28 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: Leon Romanovsky, Bob Pearson, Jason Gunthorpe, linux-rdma

On Fri, Aug 6, 2021 at 10:37 AM Olga Kornievskaia <aglo@umich.edu> wrote:
>
> On Wed, Aug 4, 2021 at 5:05 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
> >
> > On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky <leon@kernel.org> wrote:
> > >
> > > On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote:
> > > > On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
> > > > >
> > > > > On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Can you please help me to understand the RXE status in the upstream?
> > > > > >
> > > > > > Does we still have crashes/interop issues/e.t.c?
> > > > >
> > > > > I made some developments with the RXE in the upstream, from my usage
> > > > > with latest RXE,
> > > > > I found the following:
> > > > >
> > > > > 1. rdma-core can not work well with latest RDMA git;
> > > >
> > > > The latest RDMA git is
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
> > >
> > > "Latest" is a relative term, what SHA did you test?
> > > Let's focus on fixing RXE before we will continue with new features.
> >
> > Thanks a lot. I agree with you.
>
> I believe simple rping still doesn't work linux-to-linux. The last
> working version (of rping in rxe) was 5.13 I think. I have posted a
> number of crashes rping encounters (gotta get that working before I
> can even try NFSoRDMA).

The following are my tests.

1. Modprobe rdma_rxe
2. Modprobe -v -r rdma_rxe
3. Rdma link add rxe
4. Rdma link del rxe
5. Latest rdma-core && latest kernel upstream;
6. Latest kernel < ------rping---- > 5.10.y stable
7. Latest kernel < ------rping---- > 5.11.y stable
8. Latest kernel < ------rping---- > 5.12.y stable
9. Latest kernel < ------rping---- > 5.13.y stable

It seems that the latest kernel upstream (5.14-rc6) can rping other
stable kernels.
Can you make tests again?

Zhu Yanjun
>
> Thank you for working on the code.
>
> We (NFS community) do test NFSoRDMA every git pull using rxe and siw
> but lately have been encountering problems.
>
> > rdma-core:
> > 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull
> > request #1038 from selvintxavier/master
> > 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr
> > 327d45e0 tests: Add missing MAC element to args list
> > 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices
> > 8754fb51 bnxt_re/lib: Use separate indices for shadow queue
> > be4d8abf bnxt_re/lib: add a function to initialize software queue
> >
> > kernel rdma:
> > 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD)
> > RDMA/qedr: Improve error logs for rdma_alloc_tid error return
> > 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc
> > 991c4274dc17 RDMA/hfi1: Fix typo in comments
> > 8d7e415d5561 docs: Fix infiniband uverbs minor number
> > bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure
> > that requests are valid
> > bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting
> > e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails
> > a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on
> > hfi1_devdata->user_refcount
> >
> > with the above kernel and rdma-core, the following messages will appear.
> > "
> > [   54.214608] rdma_rxe: loaded
> > [   54.217089] infiniband rxe0: set active
> > [   54.217101] infiniband rxe0: added enp0s8
> > [  167.623200] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  167.645590] rdma_rxe: cqe(1) < current # elements in queue (6)
> > [  167.733297] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247
> > [  169.074796] rdma_rxe: qp#27 moved to error state
> > [  169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de
> > [  169.138889] rdma_rxe: qp#30 moved to error state
> > [  169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7
> > [  169.160601] rdma_rxe: qp#31 moved to error state
> > [  169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782
> > [  169.182170] rdma_rxe: qp#32 moved to error state
> > [  169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8
> > [  169.667850] rdma_rxe: qp#39 moved to error state
> > [  198.872649] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  198.894829] rdma_rxe: cqe(1) < current # elements in queue (6)
> > [  198.981839] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887
> > [  200.332086] rdma_rxe: qp#58 moved to error state
> > [  200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d
> > [  200.396514] rdma_rxe: qp#61 moved to error state
> > [  200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40
> > [  200.417956] rdma_rxe: qp#62 moved to error state
> > [  200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24
> > [  200.439654] rdma_rxe: qp#63 moved to error state
> > [  200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8
> > [  200.933153] rdma_rxe: qp#70 moved to error state
> > [  206.880305] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  206.904030] rdma_rxe: cqe(1) < current # elements in queue (6)
> > [  206.991494] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d
> > [  208.360028] rdma_rxe: qp#89 moved to error state
> > [  208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136
> > [  208.425675] rdma_rxe: qp#92 moved to error state
> > [  208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8
> > [  208.447370] rdma_rxe: qp#93 moved to error state
> > [  208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a
> > [  208.469550] rdma_rxe: qp#94 moved to error state
> > [  208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670
> > [  208.956731] rdma_rxe: qp#100 moved to error state
> > [  216.879703] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  216.902199] rdma_rxe: cqe(1) < current # elements in queue (6)
> > [  216.989264] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6
> > [  218.363808] rdma_rxe: qp#119 moved to error state
> > [  218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4
> > [  218.429513] rdma_rxe: qp#122 moved to error state
> > [  218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895
> > [  218.451481] rdma_rxe: qp#123 moved to error state
> > [  218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910
> > [  218.473908] rdma_rxe: qp#124 moved to error state
> > [  218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b
> > [  218.963641] rdma_rxe: qp#130 moved to error state
> > [  233.855140] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  233.877202] rdma_rxe: cqe(1) < current # elements in queue (6)
> > [  233.963952] rdma_rxe: cqe(32768) > max_cqe(32767)
> > [  235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2
> > [  235.305319] rdma_rxe: qp#149 moved to error state
> > [  235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8
> > [  235.368838] rdma_rxe: qp#152 moved to error state
> > [  235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d
> > [  235.390192] rdma_rxe: qp#153 moved to error state
> > [  235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c
> > [  235.411374] rdma_rxe: qp#154 moved to error state
> > [  235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482
> > [  235.895828] rdma_rxe: qp#161 moved to error state
> > "
> > Not sure if they are problems.
> > IMO, we should make further investigations.
> >
> > Thanks
> > Zhu Yanjun
> > >
> > > Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-17  2:28           ` Zhu Yanjun
@ 2021-08-18  6:43             ` yangx.jy
  2021-08-18  7:20               ` Zhu Yanjun
  0 siblings, 1 reply; 19+ messages in thread
From: yangx.jy @ 2021-08-18  6:43 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe,
	linux-rdma

于 2021/8/17 10:28, Zhu Yanjun 写道:
> On Fri, Aug 6, 2021 at 10:37 AM Olga Kornievskaia<aglo@umich.edu>  wrote:
>> On Wed, Aug 4, 2021 at 5:05 AM Zhu Yanjun<zyjzyj2000@gmail.com>  wrote:
>>> On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky<leon@kernel.org>  wrote:
>>>> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote:
>>>>> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun<zyjzyj2000@gmail.com>  wrote:
>>>>>> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky<leon@kernel.org>  wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Can you please help me to understand the RXE status in the upstream?
>>>>>>>
>>>>>>> Does we still have crashes/interop issues/e.t.c?
>>>>>> I made some developments with the RXE in the upstream, from my usage
>>>>>> with latest RXE,
>>>>>> I found the following:
>>>>>>
>>>>>> 1. rdma-core can not work well with latest RDMA git;
>>>>> The latest RDMA git is
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
>>>> "Latest" is a relative term, what SHA did you test?
>>>> Let's focus on fixing RXE before we will continue with new features.
>>> Thanks a lot. I agree with you.
>> I believe simple rping still doesn't work linux-to-linux. The last
>> working version (of rping in rxe) was 5.13 I think. I have posted a
>> number of crashes rping encounters (gotta get that working before I
>> can even try NFSoRDMA).
> The following are my tests.
>
> 1. Modprobe rdma_rxe
> 2. Modprobe -v -r rdma_rxe
> 3. Rdma link add rxe
> 4. Rdma link del rxe
> 5. Latest rdma-core&&  latest kernel upstream;
> 6. Latest kernel<  ------rping---->  5.10.y stable
> 7. Latest kernel<  ------rping---->  5.11.y stable
> 8. Latest kernel<  ------rping---->  5.12.y stable
> 9. Latest kernel<  ------rping---->  5.13.y stable
>
> It seems that the latest kernel upstream (5.14-rc6) can rping other
> stable kernels.
> Can you make tests again?
>
> Zhu Yanjun
Hi,

I still get two simliar panic by rping or rdma_client/server on latest kernel vs 5.13:
Panic1:
--------------------------------------------------------
[  268.248642] BUG: unable to handle page fault for address: ffff9ae2c07a1414
[  268.251049] #PF: supervisor read access in kernel mode
[  268.252491] #PF: error_code(0x0000) - not-present page
[  268.253919] PGD 1000067 P4D 1000067 PUD 0
[  268.255052] Oops: 0000 [#1] SMP PTI
[  268.256055] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1
[  268.257893] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
[  268.259995] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe]
[  268.261114] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44>  8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b
[  268.265005] RSP: 0018:ffff9ae2404108b8 EFLAGS: 00010202
[  268.266145] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8f7a8bf9da76
[  268.267703] RDX: ffff9ae2c07a1410 RSI: 0000000000000001 RDI: ffff8f7a71874400
[  268.269291] RBP: ffff8f7a482f87cc R08: 0000000000000010 R09: 0000000000000000
[  268.270871] R10: 00000000000000cb R11: 0000000000000001 R12: ffff8f7a482f8000
[  268.272468] R13: ffff8f7a8c038928 R14: ffff8f7a482f8008 R15: 0000000000000010
[  268.274080] FS:  0000000000000000(0000) GS:ffff8f7abec00000(0000) knlGS:0000000000000000
[  268.275899] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  268.277205] CR2: ffff9ae2c07a1414 CR3: 000000000263c002 CR4: 0000000000060ee0
[  268.278825] Call Trace:
[  268.279358]<IRQ>
[  268.279747]  rxe_responder+0x11b1/0x2490 [rdma_rxe]
[  268.280798]  rxe_do_task+0x9c/0xe0 [rdma_rxe]
[  268.281895]  rxe_rcv+0x286/0x8e0 [rdma_rxe]
...
------------------------------------------------------

Panic2:
--------------------------------------------------------
[  212.526854] BUG: unable to handle page fault for address: ffffbb97142acc14
[  212.530688] #PF: supervisor read access in kernel mode
[  212.533030] #PF: error_code(0x0000) - not-present page
[  212.535428] PGD 1000067 P4D 1000067 PUD 0
[  212.536970] Oops: 0000 [#1] SMP PTI
[  212.537748] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1
[  212.538984] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
[  212.540853] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe]
[  212.541957] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44>  8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b
[  212.546041] RSP: 0018:ffffbb9640410898 EFLAGS: 00010202
[  212.547200] RAX: ffffbb97142acc00 RBX: ffff95510ee6d000 RCX: ffff95510a802076
[  212.548782] RDX: ffffbb97142acc10 RSI: 0000000000000001 RDI: ffff95510ca00700
[  212.550369] RBP: 0000000000000010 R08: 0000000000000010 R09: 0000000000000000
[  212.551992] R10: 0000000000000001 R11: 0000000000000001 R12: ffff95510a802076
[  212.553613] R13: ffff9550f29acd28 R14: ffff95510ee6d008 R15: 0000000000000010
[  212.555225] FS:  0000000000000000(0000) GS:ffff95513ec00000(0000) knlGS:0000000000000000
[  212.556749] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  212.557846] CR2: ffffbb97142acc14 CR3: 0000000003b66005 CR4: 0000000000060ee0
[  212.559177] Call Trace:
[  212.559655]<IRQ>
[  212.560055]  send_data_in+0x55/0x73 [rdma_rxe]
[  212.560903]  rxe_responder.cold+0xea/0x1f8 [rdma_rxe]
[  212.561865]  rxe_do_task+0x9c/0xe0 [rdma_rxe]
[  212.562699]  rxe_rcv+0x286/0x8e0 [rdma_rxe]
...
------------------------------------------------------

Note: it is easy to reproduce the panic on the lastest kernel.

Best Regards,
Xiao Yang



>> Thank you for working on the code.
>>
>> We (NFS community) do test NFSoRDMA every git pull using rxe and siw
>> but lately have been encountering problems.
>>
>>> rdma-core:
>>> 313509f8 (HEAD ->  master, origin/master, origin/HEAD) Merge pull
>>> request #1038 from selvintxavier/master
>>> 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr
>>> 327d45e0 tests: Add missing MAC element to args list
>>> 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices
>>> 8754fb51 bnxt_re/lib: Use separate indices for shadow queue
>>> be4d8abf bnxt_re/lib: add a function to initialize software queue
>>>
>>> kernel rdma:
>>> 0050a57638ca (HEAD ->  for-next, origin/for-next, origin/HEAD)
>>> RDMA/qedr: Improve error logs for rdma_alloc_tid error return
>>> 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc
>>> 991c4274dc17 RDMA/hfi1: Fix typo in comments
>>> 8d7e415d5561 docs: Fix infiniband uverbs minor number
>>> bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure
>>> that requests are valid
>>> bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting
>>> e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails
>>> a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on
>>> hfi1_devdata->user_refcount
>>>
>>> with the above kernel and rdma-core, the following messages will appear.
>>> "
>>> [   54.214608] rdma_rxe: loaded
>>> [   54.217089] infiniband rxe0: set active
>>> [   54.217101] infiniband rxe0: added enp0s8
>>> [  167.623200] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  167.645590] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  167.733297] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247
>>> [  169.074796] rdma_rxe: qp#27 moved to error state
>>> [  169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de
>>> [  169.138889] rdma_rxe: qp#30 moved to error state
>>> [  169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7
>>> [  169.160601] rdma_rxe: qp#31 moved to error state
>>> [  169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782
>>> [  169.182170] rdma_rxe: qp#32 moved to error state
>>> [  169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8
>>> [  169.667850] rdma_rxe: qp#39 moved to error state
>>> [  198.872649] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  198.894829] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  198.981839] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887
>>> [  200.332086] rdma_rxe: qp#58 moved to error state
>>> [  200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d
>>> [  200.396514] rdma_rxe: qp#61 moved to error state
>>> [  200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40
>>> [  200.417956] rdma_rxe: qp#62 moved to error state
>>> [  200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24
>>> [  200.439654] rdma_rxe: qp#63 moved to error state
>>> [  200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8
>>> [  200.933153] rdma_rxe: qp#70 moved to error state
>>> [  206.880305] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  206.904030] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  206.991494] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d
>>> [  208.360028] rdma_rxe: qp#89 moved to error state
>>> [  208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136
>>> [  208.425675] rdma_rxe: qp#92 moved to error state
>>> [  208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8
>>> [  208.447370] rdma_rxe: qp#93 moved to error state
>>> [  208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a
>>> [  208.469550] rdma_rxe: qp#94 moved to error state
>>> [  208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670
>>> [  208.956731] rdma_rxe: qp#100 moved to error state
>>> [  216.879703] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  216.902199] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  216.989264] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6
>>> [  218.363808] rdma_rxe: qp#119 moved to error state
>>> [  218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4
>>> [  218.429513] rdma_rxe: qp#122 moved to error state
>>> [  218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895
>>> [  218.451481] rdma_rxe: qp#123 moved to error state
>>> [  218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910
>>> [  218.473908] rdma_rxe: qp#124 moved to error state
>>> [  218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b
>>> [  218.963641] rdma_rxe: qp#130 moved to error state
>>> [  233.855140] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  233.877202] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  233.963952] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2
>>> [  235.305319] rdma_rxe: qp#149 moved to error state
>>> [  235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8
>>> [  235.368838] rdma_rxe: qp#152 moved to error state
>>> [  235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d
>>> [  235.390192] rdma_rxe: qp#153 moved to error state
>>> [  235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c
>>> [  235.411374] rdma_rxe: qp#154 moved to error state
>>> [  235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482
>>> [  235.895828] rdma_rxe: qp#161 moved to error state
>>> "
>>> Not sure if they are problems.
>>> IMO, we should make further investigations.
>>>
>>> Thanks
>>> Zhu Yanjun
>>>> Thanks
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-18  6:43             ` yangx.jy
@ 2021-08-18  7:20               ` Zhu Yanjun
  2021-08-18  7:44                 ` yangx.jy
  0 siblings, 1 reply; 19+ messages in thread
From: Zhu Yanjun @ 2021-08-18  7:20 UTC (permalink / raw)
  To: yangx.jy
  Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe,
	linux-rdma

On Wed, Aug 18, 2021 at 2:43 PM yangx.jy@fujitsu.com
<yangx.jy@fujitsu.com> wrote:
>
> 于 2021/8/17 10:28, Zhu Yanjun 写道:
> > On Fri, Aug 6, 2021 at 10:37 AM Olga Kornievskaia<aglo@umich.edu>  wrote:
> >> On Wed, Aug 4, 2021 at 5:05 AM Zhu Yanjun<zyjzyj2000@gmail.com>  wrote:
> >>> On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky<leon@kernel.org>  wrote:
> >>>> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote:
> >>>>> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun<zyjzyj2000@gmail.com>  wrote:
> >>>>>> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky<leon@kernel.org>  wrote:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> Can you please help me to understand the RXE status in the upstream?
> >>>>>>>
> >>>>>>> Does we still have crashes/interop issues/e.t.c?
> >>>>>> I made some developments with the RXE in the upstream, from my usage
> >>>>>> with latest RXE,
> >>>>>> I found the following:
> >>>>>>
> >>>>>> 1. rdma-core can not work well with latest RDMA git;
> >>>>> The latest RDMA git is
> >>>>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
> >>>> "Latest" is a relative term, what SHA did you test?
> >>>> Let's focus on fixing RXE before we will continue with new features.
> >>> Thanks a lot. I agree with you.
> >> I believe simple rping still doesn't work linux-to-linux. The last
> >> working version (of rping in rxe) was 5.13 I think. I have posted a
> >> number of crashes rping encounters (gotta get that working before I
> >> can even try NFSoRDMA).
> > The following are my tests.
> >
> > 1. Modprobe rdma_rxe
> > 2. Modprobe -v -r rdma_rxe
> > 3. Rdma link add rxe
> > 4. Rdma link del rxe
> > 5. Latest rdma-core&&  latest kernel upstream;
> > 6. Latest kernel<  ------rping---->  5.10.y stable
> > 7. Latest kernel<  ------rping---->  5.11.y stable
> > 8. Latest kernel<  ------rping---->  5.12.y stable
> > 9. Latest kernel<  ------rping---->  5.13.y stable
> >
> > It seems that the latest kernel upstream (5.14-rc6) can rping other
> > stable kernels.
> > Can you make tests again?
> >
> > Zhu Yanjun
> Hi,
>
> I still get two simliar panic by rping or rdma_client/server on latest kernel vs 5.13:
> Panic1:
> --------------------------------------------------------
> [  268.248642] BUG: unable to handle page fault for address: ffff9ae2c07a1414
> [  268.251049] #PF: supervisor read access in kernel mode
> [  268.252491] #PF: error_code(0x0000) - not-present page
> [  268.253919] PGD 1000067 P4D 1000067 PUD 0
> [  268.255052] Oops: 0000 [#1] SMP PTI
> [  268.256055] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1
> [  268.257893] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
> [  268.259995] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe]
> [  268.261114] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44>  8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b
> [  268.265005] RSP: 0018:ffff9ae2404108b8 EFLAGS: 00010202
> [  268.266145] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8f7a8bf9da76
> [  268.267703] RDX: ffff9ae2c07a1410 RSI: 0000000000000001 RDI: ffff8f7a71874400
> [  268.269291] RBP: ffff8f7a482f87cc R08: 0000000000000010 R09: 0000000000000000
> [  268.270871] R10: 00000000000000cb R11: 0000000000000001 R12: ffff8f7a482f8000
> [  268.272468] R13: ffff8f7a8c038928 R14: ffff8f7a482f8008 R15: 0000000000000010
> [  268.274080] FS:  0000000000000000(0000) GS:ffff8f7abec00000(0000) knlGS:0000000000000000
> [  268.275899] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  268.277205] CR2: ffff9ae2c07a1414 CR3: 000000000263c002 CR4: 0000000000060ee0
> [  268.278825] Call Trace:
> [  268.279358]<IRQ>
> [  268.279747]  rxe_responder+0x11b1/0x2490 [rdma_rxe]
> [  268.280798]  rxe_do_task+0x9c/0xe0 [rdma_rxe]
> [  268.281895]  rxe_rcv+0x286/0x8e0 [rdma_rxe]
> ...
> ------------------------------------------------------
>
> Panic2:
> --------------------------------------------------------
> [  212.526854] BUG: unable to handle page fault for address: ffffbb97142acc14
> [  212.530688] #PF: supervisor read access in kernel mode
> [  212.533030] #PF: error_code(0x0000) - not-present page
> [  212.535428] PGD 1000067 P4D 1000067 PUD 0
> [  212.536970] Oops: 0000 [#1] SMP PTI
> [  212.537748] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1
> [  212.538984] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
> [  212.540853] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe]
> [  212.541957] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44>  8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b
> [  212.546041] RSP: 0018:ffffbb9640410898 EFLAGS: 00010202
> [  212.547200] RAX: ffffbb97142acc00 RBX: ffff95510ee6d000 RCX: ffff95510a802076
> [  212.548782] RDX: ffffbb97142acc10 RSI: 0000000000000001 RDI: ffff95510ca00700
> [  212.550369] RBP: 0000000000000010 R08: 0000000000000010 R09: 0000000000000000
> [  212.551992] R10: 0000000000000001 R11: 0000000000000001 R12: ffff95510a802076
> [  212.553613] R13: ffff9550f29acd28 R14: ffff95510ee6d008 R15: 0000000000000010
> [  212.555225] FS:  0000000000000000(0000) GS:ffff95513ec00000(0000) knlGS:0000000000000000
> [  212.556749] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  212.557846] CR2: ffffbb97142acc14 CR3: 0000000003b66005 CR4: 0000000000060ee0
> [  212.559177] Call Trace:
> [  212.559655]<IRQ>
> [  212.560055]  send_data_in+0x55/0x73 [rdma_rxe]
> [  212.560903]  rxe_responder.cold+0xea/0x1f8 [rdma_rxe]
> [  212.561865]  rxe_do_task+0x9c/0xe0 [rdma_rxe]
> [  212.562699]  rxe_rcv+0x286/0x8e0 [rdma_rxe]
> ...
> ------------------------------------------------------
>
> Note: it is easy to reproduce the panic on the lastest kernel.

Can you let me know how to reproduce the panic?

1. linux upstream < ----rping---- > linux upstream?
2. just run rping?
3. how do you create rxe? with rdma link or rxe_cfg?
4. do you make other operations?
5. other operations?

Thanks.
Zhu Yanjun

>
> Best Regards,
> Xiao Yang
>
>
>
> >> Thank you for working on the code.
> >>
> >> We (NFS community) do test NFSoRDMA every git pull using rxe and siw
> >> but lately have been encountering problems.
> >>
> >>> rdma-core:
> >>> 313509f8 (HEAD ->  master, origin/master, origin/HEAD) Merge pull
> >>> request #1038 from selvintxavier/master
> >>> 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr
> >>> 327d45e0 tests: Add missing MAC element to args list
> >>> 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices
> >>> 8754fb51 bnxt_re/lib: Use separate indices for shadow queue
> >>> be4d8abf bnxt_re/lib: add a function to initialize software queue
> >>>
> >>> kernel rdma:
> >>> 0050a57638ca (HEAD ->  for-next, origin/for-next, origin/HEAD)
> >>> RDMA/qedr: Improve error logs for rdma_alloc_tid error return
> >>> 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc
> >>> 991c4274dc17 RDMA/hfi1: Fix typo in comments
> >>> 8d7e415d5561 docs: Fix infiniband uverbs minor number
> >>> bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure
> >>> that requests are valid
> >>> bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting
> >>> e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails
> >>> a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on
> >>> hfi1_devdata->user_refcount
> >>>
> >>> with the above kernel and rdma-core, the following messages will appear.
> >>> "
> >>> [   54.214608] rdma_rxe: loaded
> >>> [   54.217089] infiniband rxe0: set active
> >>> [   54.217101] infiniband rxe0: added enp0s8
> >>> [  167.623200] rdma_rxe: cqe(32768)>  max_cqe(32767)
> >>> [  167.645590] rdma_rxe: cqe(1)<  current # elements in queue (6)
> >>> [  167.733297] rdma_rxe: cqe(32768)>  max_cqe(32767)
> >>> [  169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247
> >>> [  169.074796] rdma_rxe: qp#27 moved to error state
> >>> [  169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de
> >>> [  169.138889] rdma_rxe: qp#30 moved to error state
> >>> [  169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7
> >>> [  169.160601] rdma_rxe: qp#31 moved to error state
> >>> [  169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782
> >>> [  169.182170] rdma_rxe: qp#32 moved to error state
> >>> [  169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8
> >>> [  169.667850] rdma_rxe: qp#39 moved to error state
> >>> [  198.872649] rdma_rxe: cqe(32768)>  max_cqe(32767)
> >>> [  198.894829] rdma_rxe: cqe(1)<  current # elements in queue (6)
> >>> [  198.981839] rdma_rxe: cqe(32768)>  max_cqe(32767)
> >>> [  200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887
> >>> [  200.332086] rdma_rxe: qp#58 moved to error state
> >>> [  200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d
> >>> [  200.396514] rdma_rxe: qp#61 moved to error state
> >>> [  200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40
> >>> [  200.417956] rdma_rxe: qp#62 moved to error state
> >>> [  200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24
> >>> [  200.439654] rdma_rxe: qp#63 moved to error state
> >>> [  200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8
> >>> [  200.933153] rdma_rxe: qp#70 moved to error state
> >>> [  206.880305] rdma_rxe: cqe(32768)>  max_cqe(32767)
> >>> [  206.904030] rdma_rxe: cqe(1)<  current # elements in queue (6)
> >>> [  206.991494] rdma_rxe: cqe(32768)>  max_cqe(32767)
> >>> [  208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d
> >>> [  208.360028] rdma_rxe: qp#89 moved to error state
> >>> [  208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136
> >>> [  208.425675] rdma_rxe: qp#92 moved to error state
> >>> [  208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8
> >>> [  208.447370] rdma_rxe: qp#93 moved to error state
> >>> [  208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a
> >>> [  208.469550] rdma_rxe: qp#94 moved to error state
> >>> [  208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670
> >>> [  208.956731] rdma_rxe: qp#100 moved to error state
> >>> [  216.879703] rdma_rxe: cqe(32768)>  max_cqe(32767)
> >>> [  216.902199] rdma_rxe: cqe(1)<  current # elements in queue (6)
> >>> [  216.989264] rdma_rxe: cqe(32768)>  max_cqe(32767)
> >>> [  218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6
> >>> [  218.363808] rdma_rxe: qp#119 moved to error state
> >>> [  218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4
> >>> [  218.429513] rdma_rxe: qp#122 moved to error state
> >>> [  218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895
> >>> [  218.451481] rdma_rxe: qp#123 moved to error state
> >>> [  218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910
> >>> [  218.473908] rdma_rxe: qp#124 moved to error state
> >>> [  218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b
> >>> [  218.963641] rdma_rxe: qp#130 moved to error state
> >>> [  233.855140] rdma_rxe: cqe(32768)>  max_cqe(32767)
> >>> [  233.877202] rdma_rxe: cqe(1)<  current # elements in queue (6)
> >>> [  233.963952] rdma_rxe: cqe(32768)>  max_cqe(32767)
> >>> [  235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2
> >>> [  235.305319] rdma_rxe: qp#149 moved to error state
> >>> [  235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8
> >>> [  235.368838] rdma_rxe: qp#152 moved to error state
> >>> [  235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d
> >>> [  235.390192] rdma_rxe: qp#153 moved to error state
> >>> [  235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c
> >>> [  235.411374] rdma_rxe: qp#154 moved to error state
> >>> [  235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482
> >>> [  235.895828] rdma_rxe: qp#161 moved to error state
> >>> "
> >>> Not sure if they are problems.
> >>> IMO, we should make further investigations.
> >>>
> >>> Thanks
> >>> Zhu Yanjun
> >>>> Thanks
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-18  7:20               ` Zhu Yanjun
@ 2021-08-18  7:44                 ` yangx.jy
  2021-08-18  8:28                   ` Zhu Yanjun
  0 siblings, 1 reply; 19+ messages in thread
From: yangx.jy @ 2021-08-18  7:44 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe,
	linux-rdma

On 2021/8/18 15:20, Zhu Yanjun wrote:
> Can you let me know how to reproduce the panic?
>
> 1. linux upstream<  ----rping---->  linux upstream?
rdma_client on v5.13<  --->  rdma_server on upstream kernel.

> 2. just run rping?
Running rdma_client on v5.13 and rdma_server on upstream can reproduce 
the issue.

Note: running rping can reproduce the issue as well.
> 3. how do you create rxe? with rdma link or rxe_cfg?
rdma link add
> 4. do you make other operations?
No
> 5. other operations?
No
> Thanks.
> Zhu Yanjun
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-18  7:44                 ` yangx.jy
@ 2021-08-18  8:28                   ` Zhu Yanjun
  2021-08-18 14:33                     ` yangx.jy
  0 siblings, 1 reply; 19+ messages in thread
From: Zhu Yanjun @ 2021-08-18  8:28 UTC (permalink / raw)
  To: yangx.jy
  Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe,
	linux-rdma

On Wed, Aug 18, 2021 at 3:44 PM yangx.jy@fujitsu.com
<yangx.jy@fujitsu.com> wrote:
>
> On 2021/8/18 15:20, Zhu Yanjun wrote:
> > Can you let me know how to reproduce the panic?
> >
> > 1. linux upstream<  ----rping---->  linux upstream?
> rdma_client on v5.13<  --->  rdma_server on upstream kernel.
>
> > 2. just run rping?
> Running rdma_client on v5.13 and rdma_server on upstream can reproduce
> the issue.
>
> Note: running rping can reproduce the issue as well.

rping and rdma_server/rdma_client are from the latest rdma-core?

Thanks
Zhu Yanjun

> > 3. how do you create rxe? with rdma link or rxe_cfg?
> rdma link add
> > 4. do you make other operations?
> No
> > 5. other operations?
> No
> > Thanks.
> > Zhu Yanjun
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-18  8:28                   ` Zhu Yanjun
@ 2021-08-18 14:33                     ` yangx.jy
  2021-08-20  3:31                       ` Zhu Yanjun
  0 siblings, 1 reply; 19+ messages in thread
From: yangx.jy @ 2021-08-18 14:33 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe,
	linux-rdma

On 2021/8/18 16:28, Zhu Yanjun wrote:
> On Wed, Aug 18, 2021 at 3:44 PM yangx.jy@fujitsu.com
> <yangx.jy@fujitsu.com>  wrote:
>> On 2021/8/18 15:20, Zhu Yanjun wrote:
>>> Can you let me know how to reproduce the panic?
>>>
>>> 1. linux upstream<   ----rping---->   linux upstream?
>> rdma_client on v5.13<   --->   rdma_server on upstream kernel.
>>
>>> 2. just run rping?
>> Running rdma_client on v5.13 and rdma_server on upstream can reproduce
>> the issue.
>>
>> Note: running rping can reproduce the issue as well.
> rping and rdma_server/rdma_client are from the latest rdma-core?
Yes, use the latest rdma-core from 
https://github.com/linux-rdma/rdma-core (master branch).
> Thanks
> Zhu Yanjun
>
>>> 3. how do you create rxe? with rdma link or rxe_cfg?
>> rdma link add
>>> 4. do you make other operations?
>> No
>>> 5. other operations?
>> No
>>> Thanks.
>>> Zhu Yanjun
>>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-18 14:33                     ` yangx.jy
@ 2021-08-20  3:31                       ` Zhu Yanjun
  2021-08-20  7:42                         ` yangx.jy
  0 siblings, 1 reply; 19+ messages in thread
From: Zhu Yanjun @ 2021-08-20  3:31 UTC (permalink / raw)
  To: yangx.jy
  Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe,
	linux-rdma

On Wed, Aug 18, 2021 at 10:33 PM yangx.jy@fujitsu.com
<yangx.jy@fujitsu.com> wrote:
>
> On 2021/8/18 16:28, Zhu Yanjun wrote:
> > On Wed, Aug 18, 2021 at 3:44 PM yangx.jy@fujitsu.com
> > <yangx.jy@fujitsu.com>  wrote:
> >> On 2021/8/18 15:20, Zhu Yanjun wrote:
> >>> Can you let me know how to reproduce the panic?
> >>>
> >>> 1. linux upstream<   ----rping---->   linux upstream?
> >> rdma_client on v5.13<   --->   rdma_server on upstream kernel.
> >>
> >>> 2. just run rping?
> >> Running rdma_client on v5.13 and rdma_server on upstream can reproduce
> >> the issue.
> >>
> >> Note: running rping can reproduce the issue as well.
> > rping and rdma_server/rdma_client are from the latest rdma-core?
> Yes, use the latest rdma-core from
> https://github.com/linux-rdma/rdma-core (master branch).

Latest kernel + latest rdma-core < ------rping---- > 5.10.y stable +
latest rdma-core
Latest kernel + latest rdma-core < ------rping---- > 5.11.y stable +
latest rdma-core
Latest kernel + latest rdma-core < ------rping---- > 5.12.y stable +
latest rdma-core
Latest kernel + latest rdma-core < ------rping---- > 5.13.y stable +
latest rdma-core

The above works well.

Zhu Yanjun

> > Thanks
> > Zhu Yanjun
> >
> >>> 3. how do you create rxe? with rdma link or rxe_cfg?
> >> rdma link add
> >>> 4. do you make other operations?
> >> No
> >>> 5. other operations?
> >> No
> >>> Thanks.
> >>> Zhu Yanjun
> >>>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-20  3:31                       ` Zhu Yanjun
@ 2021-08-20  7:42                         ` yangx.jy
  2021-08-20 21:40                           ` Bob Pearson
  0 siblings, 1 reply; 19+ messages in thread
From: yangx.jy @ 2021-08-20  7:42 UTC (permalink / raw)
  To: Zhu Yanjun
  Cc: Olga Kornievskaia, Leon Romanovsky, Bob Pearson, Jason Gunthorpe,
	linux-rdma

On 2021/8/20 11:31, Zhu Yanjun wrote:
> Latest kernel + latest rdma-coOnre<  ------rping---->  5.10.y stable +
> latest rdma-core
> Latest kernel + latest rdma-core<  ------rping---->  5.11.y stable +
> latest rdma-core
> Latest kernel + latest rdma-core<  ------rping---->  5.12.y stable +
> latest rdma-core
> Latest kernel + latest rdma-core<  ------rping---->  5.13.y stable +
> latest rdma-core
>
> The above works well.
Hi Yanjun,

Sorry, I don't know why you cannot reproduce the bug.

Did you see the similar bug reported by Olga Kornievskaia?
https://www.spinics.net/lists/linux-rdma/msg104358.html
https://www.spinics.net/lists/linux-rdma/msg104359.html
https://www.spinics.net/lists/linux-rdma/msg104360.html

Best Regards,
Xiao Yang
> Zhu Yanjun
>

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-20  7:42                         ` yangx.jy
@ 2021-08-20 21:40                           ` Bob Pearson
  2021-08-20 22:09                             ` Bob Pearson
  0 siblings, 1 reply; 19+ messages in thread
From: Bob Pearson @ 2021-08-20 21:40 UTC (permalink / raw)
  To: yangx.jy, Zhu Yanjun
  Cc: Olga Kornievskaia, Leon Romanovsky, Jason Gunthorpe, linux-rdma

On 8/20/21 2:42 AM, yangx.jy@fujitsu.com wrote:
> On 2021/8/20 11:31, Zhu Yanjun wrote:
>> Latest kernel + latest rdma-coOnre<  ------rping---->  5.10.y stable +
>> latest rdma-core
>> Latest kernel + latest rdma-core<  ------rping---->  5.11.y stable +
>> latest rdma-core
>> Latest kernel + latest rdma-core<  ------rping---->  5.12.y stable +
>> latest rdma-core
>> Latest kernel + latest rdma-core<  ------rping---->  5.13.y stable +
>> latest rdma-core
>>
>> The above works well.
> Hi Yanjun,
> 
> Sorry, I don't know why you cannot reproduce the bug.
> 
> Did you see the similar bug reported by Olga Kornievskaia?
> https://www.spinics.net/lists/linux-rdma/msg104358.html
> https://www.spinics.net/lists/linux-rdma/msg104359.html
> https://www.spinics.net/lists/linux-rdma/msg104360.html
> 
> Best Regards,
> Xiao Yang
>> Zhu Yanjun
>>

There is some interest in the current status of rping on rxe.
I have looked at several configurations and tested the following test cases:

	1. The python test suite in rdma-core
	2. ib_xxx_bw and ib_xxx_bw -R for RC
	3. rping

Between the following node configurations.

	A. 5.11.0 (ubuntu 21.04 OOB) + rdma-core 33.1 (ubuntu 21.04 OOB)
	B. 5.11.0 + current rdma-core
		+ "Provider/rxe:Set the correct value of resid for inline data" (a.k.a rdma-core+)
	C. 5.14.0-rc1+ (for-next current)
		+ 5 recent bug fixes (a.k.a. for-next+)
			RDMA/rxe:Fix bug in get srq wqe in rxe_resp.c.patch

			RDMA/rxe:Fix bug in rxe_net.c.patch

			RDMA/rxe:Add memory barriers to kernel queues.patch

			RDMA/rxe:Fix memory allocation while locked.patch

			RDMA/rxe:Zero out index member of struct rxe_queue.patch
		+ rdma-core+
	D. for-next+ + rdma-core (33.1)

Results:
	1.  A N/A
	1.  B no errors, some skips
	1.  C no errors, some skips
	1.  D N/A
	(n.b. requires adding IPV6 address == gid[0] by hand)

	2. [A-D] -> [A-D] all pass

	3.  A -> A, C -> C, D -> D all pass, all other combinations fail

	(RDMA_resolve_route: No such device. Not yet sure cause of failures but looking into it.)
	In theory these should all work but rdmacm is more sensitive to configuration than verbs. 

Bob


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-20 21:40                           ` Bob Pearson
@ 2021-08-20 22:09                             ` Bob Pearson
  0 siblings, 0 replies; 19+ messages in thread
From: Bob Pearson @ 2021-08-20 22:09 UTC (permalink / raw)
  To: yangx.jy, Zhu Yanjun
  Cc: Olga Kornievskaia, Leon Romanovsky, Jason Gunthorpe, linux-rdma

On 8/20/21 4:40 PM, Bob Pearson wrote:
> On 8/20/21 2:42 AM, yangx.jy@fujitsu.com wrote:
>> On 2021/8/20 11:31, Zhu Yanjun wrote:
>>> Latest kernel + latest rdma-coOnre<  ------rping---->  5.10.y stable +
>>> latest rdma-core
>>> Latest kernel + latest rdma-core<  ------rping---->  5.11.y stable +
>>> latest rdma-core
>>> Latest kernel + latest rdma-core<  ------rping---->  5.12.y stable +
>>> latest rdma-core
>>> Latest kernel + latest rdma-core<  ------rping---->  5.13.y stable +
>>> latest rdma-core
>>>
>>> The above works well.
>> Hi Yanjun,
>>
>> Sorry, I don't know why you cannot reproduce the bug.
>>
>> Did you see the similar bug reported by Olga Kornievskaia?
>> https://www.spinics.net/lists/linux-rdma/msg104358.html
>> https://www.spinics.net/lists/linux-rdma/msg104359.html
>> https://www.spinics.net/lists/linux-rdma/msg104360.html
>>
>> Best Regards,
>> Xiao Yang
>>> Zhu Yanjun
>>>
> 
> There is some interest in the current status of rping on rxe.
> I have looked at several configurations and tested the following test cases:
> 
> 	1. The python test suite in rdma-core
> 	2. ib_xxx_bw and ib_xxx_bw -R for RC
> 	3. rping
> 
> Between the following node configurations.
> 
> 	A. 5.11.0 (ubuntu 21.04 OOB) + rdma-core 33.1 (ubuntu 21.04 OOB)
> 	B. 5.11.0 + current rdma-core
> 		+ "Provider/rxe:Set the correct value of resid for inline data" (a.k.a rdma-core+)
> 	C. 5.14.0-rc1+ (for-next current)
> 		+ 5 recent bug fixes (a.k.a. for-next+)
> 			RDMA/rxe:Fix bug in get srq wqe in rxe_resp.c.patch
> 
> 			RDMA/rxe:Fix bug in rxe_net.c.patch
> 
> 			RDMA/rxe:Add memory barriers to kernel queues.patch
> 
> 			RDMA/rxe:Fix memory allocation while locked.patch
> 
> 			RDMA/rxe:Zero out index member of struct rxe_queue.patch
> 		+ rdma-core+
> 	D. for-next+ + rdma-core (33.1)
> 
> Results:
> 	1.  A N/A
> 	1.  B no errors, some skips
> 	1.  C no errors, some skips
> 	1.  D N/A
> 	(n.b. requires adding IPV6 address == gid[0] by hand)
> 
> 	2. [A-D] -> [A-D] all pass
> 
> 	3.  A -> A, C -> C, D -> D all pass, all other combinations fail
> 
> 	(RDMA_resolve_route: No such device. Not yet sure cause of failures but looking into it.)
> 	In theory these should all work but rdmacm is more sensitive to configuration than verbs. 
> 
> Bob
> 

Found the problem (thank you google) If you run both

server$ rping -s -a nn.nn.nn.nn
client$ rping -c -a nn.nn.nn.nn

now all tests pass for rping as well.

Bob

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: RXE status in the upstream rping using rxe
  2021-08-03 18:07 RXE status in the upstream rping using rxe Leon Romanovsky
  2021-08-04  1:01 ` Zhu Yanjun
@ 2021-08-23  7:53 ` Zhu Yanjun
  1 sibling, 0 replies; 19+ messages in thread
From: Zhu Yanjun @ 2021-08-23  7:53 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Olga Kornievskaia, Bob Pearson, Jason Gunthorpe, linux-rdma

On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> Hi,
>
> Can you please help me to understand the RXE status in the upstream?

Hi, all

On the Ubuntu 20.04, kernel: 5.4.0-80, with the latest rdma-core,

"
root@xxx:~/rdma-core# cat /etc/issue
Ubuntu 20.04.2 LTS \n \l
root@xxx:~/rdma-core# uname -a
Linux 5.4.0-80-generic #90-Ubuntu SMP Fri Jul 9 22:49:44 UTC 2021
x86_64 x86_64 x86_64 GNU/Linux
root@xxx:~/rdma-core# git log -1 --oneline
206a0cfd (HEAD -> master, origin/master, origin/HEAD) Merge pull
request #1047 from yishaih/mlx5_misc
"

Run run_tests.py, I got the following errors.

Not sure if it is a problem. Please comment on it.

It is easy to reproduce on Ubuntu 20.04 + 5.4.0-80.

"
.............sssssssss..FFF........sssssssssssssss.sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss.ssssssssssssssssssssssssss....s...ss..........s....s..s.......ssReceived
the following exceptions: {'active': BrokenBarrierError()}
EReceived the following exceptions: {'active': BrokenBarrierError()}
E........ss
======================================================================
ERROR: test_rdmacm_async_ex_multicast_traffic (tests.test_rdmacm.CMTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/rdma-core/tests/utils.py", line 976, in inner
    return func(instance)
  File "/root/rdma-core/tests/test_rdmacm.py", line 42, in
test_rdmacm_async_ex_multicast_traffic
    self.two_nodes_rdmacm_traffic(CMAsyncConnection,
  File "/root/rdma-core/tests/base.py", line 368, in two_nodes_rdmacm_traffic
    raise(res)
threading.BrokenBarrierError

======================================================================
ERROR: test_rdmacm_async_multicast_traffic (tests.test_rdmacm.CMTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/rdma-core/tests/utils.py", line 976, in inner
    return func(instance)
  File "/root/rdma-core/tests/test_rdmacm.py", line 36, in
test_rdmacm_async_multicast_traffic
    self.two_nodes_rdmacm_traffic(CMAsyncConnection,
  File "/root/rdma-core/tests/base.py", line 368, in two_nodes_rdmacm_traffic
    raise(res)
threading.BrokenBarrierError

======================================================================
FAIL: test_phys_port_cnt_ex (tests.test_device.DeviceTest)
Test phys_port_cnt_ex
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/rdma-core/tests/test_device.py", line 222, in
test_phys_port_cnt_ex
    self.assertEqual(phys_port_cnt, phys_port_cnt_ex,
AssertionError: 1 != 0 : phys_port_cnt_ex and phys_port_cnt should be
equal if number of ports is less than 256

======================================================================
FAIL: test_query_device (tests.test_device.DeviceTest)
Test ibv_query_device()
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/rdma-core/tests/test_device.py", line 63, in test_query_device
    self.verify_device_attr(attr, dev)
  File "/root/rdma-core/tests/test_device.py", line 187, in verify_device_attr
    assert attr.vendor_id != 0
AssertionError

======================================================================
FAIL: test_query_device_ex (tests.test_device.DeviceTest)
Test ibv_query_device_ex()
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/root/rdma-core/tests/test_device.py", line 206, in test_query_device_ex
    self.verify_device_attr(attr_ex.orig_attr, dev)
  File "/root/rdma-core/tests/test_device.py", line 187, in verify_device_attr
    assert attr.vendor_id != 0
AssertionError

----------------------------------------------------------------------
Ran 205 tests in 40.112s

FAILED (failures=3, errors=2, skipped=137)
Traceback (most recent call last):
  File "device.pyx", line 170, in pyverbs.device.Context.close
AttributeError: 'NoneType' object has no attribute 'debug'
Exception ignored in: 'pyverbs.device.Context.__dealloc__'
Traceback (most recent call last):
  File "device.pyx", line 170, in pyverbs.device.Context.close
AttributeError: 'NoneType' object has no attribute 'debug'
"


>
> Does we still have crashes/interop issues/e.t.c?
>
> Latest commit is:
> 20da44dfe8ef ("RDMA/mlx5: Drop in-driver verbs object creations")
>
> Thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-08-23  7:53 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-03 18:07 RXE status in the upstream rping using rxe Leon Romanovsky
2021-08-04  1:01 ` Zhu Yanjun
2021-08-04  1:09   ` Zhu Yanjun
2021-08-04  5:41     ` Leon Romanovsky
2021-08-04  9:05       ` Zhu Yanjun
2021-08-06  2:37         ` Olga Kornievskaia
2021-08-17  2:28           ` Zhu Yanjun
2021-08-18  6:43             ` yangx.jy
2021-08-18  7:20               ` Zhu Yanjun
2021-08-18  7:44                 ` yangx.jy
2021-08-18  8:28                   ` Zhu Yanjun
2021-08-18 14:33                     ` yangx.jy
2021-08-20  3:31                       ` Zhu Yanjun
2021-08-20  7:42                         ` yangx.jy
2021-08-20 21:40                           ` Bob Pearson
2021-08-20 22:09                             ` Bob Pearson
2021-08-13 21:53         ` Bob Pearson
2021-08-14  5:32           ` Leon Romanovsky
2021-08-23  7:53 ` Zhu Yanjun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.