linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Bob Pearson <rpearsonhpe@gmail.com>
To: Zhu Yanjun <zyjzyj2000@gmail.com>, Leon Romanovsky <leon@kernel.org>
Cc: Olga Kornievskaia <aglo@umich.edu>,
	Jason Gunthorpe <jgg@nvidia.com>,
	linux-rdma <linux-rdma@vger.kernel.org>
Subject: Re: RXE status in the upstream rping using rxe
Date: Fri, 13 Aug 2021 16:53:56 -0500	[thread overview]
Message-ID: <3243d639-fef0-0f76-6262-ff7c4a1daca2@gmail.com> (raw)
In-Reply-To: <CAD=hENcnUd-rTHGPq2DjyF7tDHVzCebDO2gtwZa9pw0M_QvaPA@mail.gmail.com>

On 8/4/21 4:05 AM, Zhu Yanjun wrote:
> On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky <leon@kernel.org> wrote:
>>
>> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote:
>>> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun <zyjzyj2000@gmail.com> wrote:
>>>>
>>>> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky <leon@kernel.org> wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> Can you please help me to understand the RXE status in the upstream?
>>>>>
>>>>> Does we still have crashes/interop issues/e.t.c?
>>>>
>>>> I made some developments with the RXE in the upstream, from my usage
>>>> with latest RXE,
>>>> I found the following:
>>>>
>>>> 1. rdma-core can not work well with latest RDMA git;
>>>
>>> The latest RDMA git is
>>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
>>
>> "Latest" is a relative term, what SHA did you test?
>> Let's focus on fixing RXE before we will continue with new features.
> 
> Thanks a lot. I agree with you.
> 
> rdma-core:
> 313509f8 (HEAD -> master, origin/master, origin/HEAD) Merge pull
> request #1038 from selvintxavier/master
> 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr
> 327d45e0 tests: Add missing MAC element to args list
> 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices
> 8754fb51 bnxt_re/lib: Use separate indices for shadow queue
> be4d8abf bnxt_re/lib: add a function to initialize software queue
> 
> kernel rdma:
> 0050a57638ca (HEAD -> for-next, origin/for-next, origin/HEAD)
> RDMA/qedr: Improve error logs for rdma_alloc_tid error return
> 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc
> 991c4274dc17 RDMA/hfi1: Fix typo in comments
> 8d7e415d5561 docs: Fix infiniband uverbs minor number
> bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure
> that requests are valid
> bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting
> e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails
> a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on
> hfi1_devdata->user_refcount
> 
> with the above kernel and rdma-core, the following messages will appear.
> "
> [   54.214608] rdma_rxe: loaded
> [   54.217089] infiniband rxe0: set active
> [   54.217101] infiniband rxe0: added enp0s8
> [  167.623200] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  167.645590] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  167.733297] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247
> [  169.074796] rdma_rxe: qp#27 moved to error state
> [  169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de
> [  169.138889] rdma_rxe: qp#30 moved to error state
> [  169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7
> [  169.160601] rdma_rxe: qp#31 moved to error state
> [  169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782
> [  169.182170] rdma_rxe: qp#32 moved to error state
> [  169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8
> [  169.667850] rdma_rxe: qp#39 moved to error state
> [  198.872649] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  198.894829] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  198.981839] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887
> [  200.332086] rdma_rxe: qp#58 moved to error state
> [  200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d
> [  200.396514] rdma_rxe: qp#61 moved to error state
> [  200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40
> [  200.417956] rdma_rxe: qp#62 moved to error state
> [  200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24
> [  200.439654] rdma_rxe: qp#63 moved to error state
> [  200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8
> [  200.933153] rdma_rxe: qp#70 moved to error state
> [  206.880305] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  206.904030] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  206.991494] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d
> [  208.360028] rdma_rxe: qp#89 moved to error state
> [  208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136
> [  208.425675] rdma_rxe: qp#92 moved to error state
> [  208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8
> [  208.447370] rdma_rxe: qp#93 moved to error state
> [  208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a
> [  208.469550] rdma_rxe: qp#94 moved to error state
> [  208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670
> [  208.956731] rdma_rxe: qp#100 moved to error state
> [  216.879703] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  216.902199] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  216.989264] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6
> [  218.363808] rdma_rxe: qp#119 moved to error state
> [  218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4
> [  218.429513] rdma_rxe: qp#122 moved to error state
> [  218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895
> [  218.451481] rdma_rxe: qp#123 moved to error state
> [  218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910
> [  218.473908] rdma_rxe: qp#124 moved to error state
> [  218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b
> [  218.963641] rdma_rxe: qp#130 moved to error state
> [  233.855140] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  233.877202] rdma_rxe: cqe(1) < current # elements in queue (6)
> [  233.963952] rdma_rxe: cqe(32768) > max_cqe(32767)
> [  235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2
> [  235.305319] rdma_rxe: qp#149 moved to error state
> [  235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8
> [  235.368838] rdma_rxe: qp#152 moved to error state
> [  235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d
> [  235.390192] rdma_rxe: qp#153 moved to error state
> [  235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c
> [  235.411374] rdma_rxe: qp#154 moved to error state
> [  235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482
> [  235.895828] rdma_rxe: qp#161 moved to error state
> "
> Not sure if they are problems.
> IMO, we should make further investigations.
> 
> Thanks
> Zhu Yanjun
>>
>> Thanks



All of the messages are from the rxe driver caused by the python tests intentionally causing

errors. Here is a test run with messages. No errors occurred. This is run on current rdma_core and
for_next. Does not answer the question about rping. That needs more testing.
(so ru is short for "./build/bin/run_tests.py --dev rxe_1")

Bob

rpearson:rdma-core$ sudo dmesg -C

rpearson:rdma-core$ so ru

.............sssssssss.............sssssssssssssss.sssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssssss........sssssssssssssssssss....ssss........s...s.s..s..........ssssssssss..ss

----------------------------------------------------------------------

Ran 199 tests in 0.418s



OK (skipped=134)

rpearson:rdma-core$ sudo dmesg

[ 9396.038090] rdma_rxe: cqe(32768) > max_cqe(32767)

[ 9396.042414] rdma_rxe: cqe(1) < current # elements in queue (6)

[ 9396.056685] rdma_rxe: cqe(32768) > max_cqe(32767)

[ 9396.273114] rdma_rxe: check_rkey: no MW matches rkey 0x1000256

[ 9396.273120] rdma_rxe: qp#27 moved to error state

[ 9396.283112] rdma_rxe: check_rkey: no MW matches rkey 0x10005be

[ 9396.283116] rdma_rxe: qp#30 moved to error state

[ 9396.286497] rdma_rxe: check_rkey: no MW matches rkey 0x100063d

[ 9396.286501] rdma_rxe: qp#31 moved to error state

[ 9396.289917] rdma_rxe: check_rkey: no MW matches rkey 0x10007a6

[ 9396.289922] rdma_rxe: qp#32 moved to error state

[ 9396.364850] rdma_rxe: check_rkey: no MR matches rkey 0x1868

[ 9396.364854] rdma_rxe: qp#37 moved to error state





  parent reply	other threads:[~2021-08-13 21:54 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-03 18:07 RXE status in the upstream rping using rxe Leon Romanovsky
2021-08-04  1:01 ` Zhu Yanjun
2021-08-04  1:09   ` Zhu Yanjun
2021-08-04  5:41     ` Leon Romanovsky
2021-08-04  9:05       ` Zhu Yanjun
2021-08-06  2:37         ` Olga Kornievskaia
2021-08-17  2:28           ` Zhu Yanjun
2021-08-18  6:43             ` yangx.jy
2021-08-18  7:20               ` Zhu Yanjun
2021-08-18  7:44                 ` yangx.jy
2021-08-18  8:28                   ` Zhu Yanjun
2021-08-18 14:33                     ` yangx.jy
2021-08-20  3:31                       ` Zhu Yanjun
2021-08-20  7:42                         ` yangx.jy
2021-08-20 21:40                           ` Bob Pearson
2021-08-20 22:09                             ` Bob Pearson
2021-08-13 21:53         ` Bob Pearson [this message]
2021-08-14  5:32           ` Leon Romanovsky
2021-08-23  7:53 ` Zhu Yanjun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3243d639-fef0-0f76-6262-ff7c4a1daca2@gmail.com \
    --to=rpearsonhpe@gmail.com \
    --cc=aglo@umich.edu \
    --cc=jgg@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=zyjzyj2000@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).