linux-rdma.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "yangx.jy@fujitsu.com" <yangx.jy@fujitsu.com>
To: Zhu Yanjun <zyjzyj2000@gmail.com>
Cc: Olga Kornievskaia <aglo@umich.edu>,
	Leon Romanovsky <leon@kernel.org>,
	Bob Pearson <rpearsonhpe@gmail.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	linux-rdma <linux-rdma@vger.kernel.org>
Subject: Re: RXE status in the upstream rping using rxe
Date: Wed, 18 Aug 2021 06:43:30 +0000	[thread overview]
Message-ID: <611CABE6.3010700@fujitsu.com> (raw)
In-Reply-To: <CAD=hENdqho3mRy=gUSE-vuXzLvZPkwJ7kEFrjRN-AxLwvQP18Q@mail.gmail.com>

于 2021/8/17 10:28, Zhu Yanjun 写道:
> On Fri, Aug 6, 2021 at 10:37 AM Olga Kornievskaia<aglo@umich.edu>  wrote:
>> On Wed, Aug 4, 2021 at 5:05 AM Zhu Yanjun<zyjzyj2000@gmail.com>  wrote:
>>> On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky<leon@kernel.org>  wrote:
>>>> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote:
>>>>> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun<zyjzyj2000@gmail.com>  wrote:
>>>>>> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky<leon@kernel.org>  wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Can you please help me to understand the RXE status in the upstream?
>>>>>>>
>>>>>>> Does we still have crashes/interop issues/e.t.c?
>>>>>> I made some developments with the RXE in the upstream, from my usage
>>>>>> with latest RXE,
>>>>>> I found the following:
>>>>>>
>>>>>> 1. rdma-core can not work well with latest RDMA git;
>>>>> The latest RDMA git is
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
>>>> "Latest" is a relative term, what SHA did you test?
>>>> Let's focus on fixing RXE before we will continue with new features.
>>> Thanks a lot. I agree with you.
>> I believe simple rping still doesn't work linux-to-linux. The last
>> working version (of rping in rxe) was 5.13 I think. I have posted a
>> number of crashes rping encounters (gotta get that working before I
>> can even try NFSoRDMA).
> The following are my tests.
>
> 1. Modprobe rdma_rxe
> 2. Modprobe -v -r rdma_rxe
> 3. Rdma link add rxe
> 4. Rdma link del rxe
> 5. Latest rdma-core&&  latest kernel upstream;
> 6. Latest kernel<  ------rping---->  5.10.y stable
> 7. Latest kernel<  ------rping---->  5.11.y stable
> 8. Latest kernel<  ------rping---->  5.12.y stable
> 9. Latest kernel<  ------rping---->  5.13.y stable
>
> It seems that the latest kernel upstream (5.14-rc6) can rping other
> stable kernels.
> Can you make tests again?
>
> Zhu Yanjun
Hi,

I still get two simliar panic by rping or rdma_client/server on latest kernel vs 5.13:
Panic1:
--------------------------------------------------------
[  268.248642] BUG: unable to handle page fault for address: ffff9ae2c07a1414
[  268.251049] #PF: supervisor read access in kernel mode
[  268.252491] #PF: error_code(0x0000) - not-present page
[  268.253919] PGD 1000067 P4D 1000067 PUD 0
[  268.255052] Oops: 0000 [#1] SMP PTI
[  268.256055] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1
[  268.257893] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
[  268.259995] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe]
[  268.261114] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44>  8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b
[  268.265005] RSP: 0018:ffff9ae2404108b8 EFLAGS: 00010202
[  268.266145] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8f7a8bf9da76
[  268.267703] RDX: ffff9ae2c07a1410 RSI: 0000000000000001 RDI: ffff8f7a71874400
[  268.269291] RBP: ffff8f7a482f87cc R08: 0000000000000010 R09: 0000000000000000
[  268.270871] R10: 00000000000000cb R11: 0000000000000001 R12: ffff8f7a482f8000
[  268.272468] R13: ffff8f7a8c038928 R14: ffff8f7a482f8008 R15: 0000000000000010
[  268.274080] FS:  0000000000000000(0000) GS:ffff8f7abec00000(0000) knlGS:0000000000000000
[  268.275899] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  268.277205] CR2: ffff9ae2c07a1414 CR3: 000000000263c002 CR4: 0000000000060ee0
[  268.278825] Call Trace:
[  268.279358]<IRQ>
[  268.279747]  rxe_responder+0x11b1/0x2490 [rdma_rxe]
[  268.280798]  rxe_do_task+0x9c/0xe0 [rdma_rxe]
[  268.281895]  rxe_rcv+0x286/0x8e0 [rdma_rxe]
...
------------------------------------------------------

Panic2:
--------------------------------------------------------
[  212.526854] BUG: unable to handle page fault for address: ffffbb97142acc14
[  212.530688] #PF: supervisor read access in kernel mode
[  212.533030] #PF: error_code(0x0000) - not-present page
[  212.535428] PGD 1000067 P4D 1000067 PUD 0
[  212.536970] Oops: 0000 [#1] SMP PTI
[  212.537748] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1
[  212.538984] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
[  212.540853] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe]
[  212.541957] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44>  8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b
[  212.546041] RSP: 0018:ffffbb9640410898 EFLAGS: 00010202
[  212.547200] RAX: ffffbb97142acc00 RBX: ffff95510ee6d000 RCX: ffff95510a802076
[  212.548782] RDX: ffffbb97142acc10 RSI: 0000000000000001 RDI: ffff95510ca00700
[  212.550369] RBP: 0000000000000010 R08: 0000000000000010 R09: 0000000000000000
[  212.551992] R10: 0000000000000001 R11: 0000000000000001 R12: ffff95510a802076
[  212.553613] R13: ffff9550f29acd28 R14: ffff95510ee6d008 R15: 0000000000000010
[  212.555225] FS:  0000000000000000(0000) GS:ffff95513ec00000(0000) knlGS:0000000000000000
[  212.556749] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  212.557846] CR2: ffffbb97142acc14 CR3: 0000000003b66005 CR4: 0000000000060ee0
[  212.559177] Call Trace:
[  212.559655]<IRQ>
[  212.560055]  send_data_in+0x55/0x73 [rdma_rxe]
[  212.560903]  rxe_responder.cold+0xea/0x1f8 [rdma_rxe]
[  212.561865]  rxe_do_task+0x9c/0xe0 [rdma_rxe]
[  212.562699]  rxe_rcv+0x286/0x8e0 [rdma_rxe]
...
------------------------------------------------------

Note: it is easy to reproduce the panic on the lastest kernel.

Best Regards,
Xiao Yang



>> Thank you for working on the code.
>>
>> We (NFS community) do test NFSoRDMA every git pull using rxe and siw
>> but lately have been encountering problems.
>>
>>> rdma-core:
>>> 313509f8 (HEAD ->  master, origin/master, origin/HEAD) Merge pull
>>> request #1038 from selvintxavier/master
>>> 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr
>>> 327d45e0 tests: Add missing MAC element to args list
>>> 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices
>>> 8754fb51 bnxt_re/lib: Use separate indices for shadow queue
>>> be4d8abf bnxt_re/lib: add a function to initialize software queue
>>>
>>> kernel rdma:
>>> 0050a57638ca (HEAD ->  for-next, origin/for-next, origin/HEAD)
>>> RDMA/qedr: Improve error logs for rdma_alloc_tid error return
>>> 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc
>>> 991c4274dc17 RDMA/hfi1: Fix typo in comments
>>> 8d7e415d5561 docs: Fix infiniband uverbs minor number
>>> bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure
>>> that requests are valid
>>> bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting
>>> e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails
>>> a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on
>>> hfi1_devdata->user_refcount
>>>
>>> with the above kernel and rdma-core, the following messages will appear.
>>> "
>>> [   54.214608] rdma_rxe: loaded
>>> [   54.217089] infiniband rxe0: set active
>>> [   54.217101] infiniband rxe0: added enp0s8
>>> [  167.623200] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  167.645590] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  167.733297] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247
>>> [  169.074796] rdma_rxe: qp#27 moved to error state
>>> [  169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de
>>> [  169.138889] rdma_rxe: qp#30 moved to error state
>>> [  169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7
>>> [  169.160601] rdma_rxe: qp#31 moved to error state
>>> [  169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782
>>> [  169.182170] rdma_rxe: qp#32 moved to error state
>>> [  169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8
>>> [  169.667850] rdma_rxe: qp#39 moved to error state
>>> [  198.872649] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  198.894829] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  198.981839] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887
>>> [  200.332086] rdma_rxe: qp#58 moved to error state
>>> [  200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d
>>> [  200.396514] rdma_rxe: qp#61 moved to error state
>>> [  200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40
>>> [  200.417956] rdma_rxe: qp#62 moved to error state
>>> [  200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24
>>> [  200.439654] rdma_rxe: qp#63 moved to error state
>>> [  200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8
>>> [  200.933153] rdma_rxe: qp#70 moved to error state
>>> [  206.880305] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  206.904030] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  206.991494] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d
>>> [  208.360028] rdma_rxe: qp#89 moved to error state
>>> [  208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136
>>> [  208.425675] rdma_rxe: qp#92 moved to error state
>>> [  208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8
>>> [  208.447370] rdma_rxe: qp#93 moved to error state
>>> [  208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a
>>> [  208.469550] rdma_rxe: qp#94 moved to error state
>>> [  208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670
>>> [  208.956731] rdma_rxe: qp#100 moved to error state
>>> [  216.879703] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  216.902199] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  216.989264] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6
>>> [  218.363808] rdma_rxe: qp#119 moved to error state
>>> [  218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4
>>> [  218.429513] rdma_rxe: qp#122 moved to error state
>>> [  218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895
>>> [  218.451481] rdma_rxe: qp#123 moved to error state
>>> [  218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910
>>> [  218.473908] rdma_rxe: qp#124 moved to error state
>>> [  218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b
>>> [  218.963641] rdma_rxe: qp#130 moved to error state
>>> [  233.855140] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  233.877202] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  233.963952] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2
>>> [  235.305319] rdma_rxe: qp#149 moved to error state
>>> [  235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8
>>> [  235.368838] rdma_rxe: qp#152 moved to error state
>>> [  235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d
>>> [  235.390192] rdma_rxe: qp#153 moved to error state
>>> [  235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c
>>> [  235.411374] rdma_rxe: qp#154 moved to error state
>>> [  235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482
>>> [  235.895828] rdma_rxe: qp#161 moved to error state
>>> "
>>> Not sure if they are problems.
>>> IMO, we should make further investigations.
>>>
>>> Thanks
>>> Zhu Yanjun
>>>> Thanks
>

  reply	other threads:[~2021-08-18  6:50 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-03 18:07 RXE status in the upstream rping using rxe Leon Romanovsky
2021-08-04  1:01 ` Zhu Yanjun
2021-08-04  1:09   ` Zhu Yanjun
2021-08-04  5:41     ` Leon Romanovsky
2021-08-04  9:05       ` Zhu Yanjun
2021-08-06  2:37         ` Olga Kornievskaia
2021-08-17  2:28           ` Zhu Yanjun
2021-08-18  6:43             ` yangx.jy [this message]
2021-08-18  7:20               ` Zhu Yanjun
2021-08-18  7:44                 ` yangx.jy
2021-08-18  8:28                   ` Zhu Yanjun
2021-08-18 14:33                     ` yangx.jy
2021-08-20  3:31                       ` Zhu Yanjun
2021-08-20  7:42                         ` yangx.jy
2021-08-20 21:40                           ` Bob Pearson
2021-08-20 22:09                             ` Bob Pearson
2021-08-13 21:53         ` Bob Pearson
2021-08-14  5:32           ` Leon Romanovsky
2021-08-23  7:53 ` Zhu Yanjun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=611CABE6.3010700@fujitsu.com \
    --to=yangx.jy@fujitsu.com \
    --cc=aglo@umich.edu \
    --cc=jgg@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=rpearsonhpe@gmail.com \
    --cc=zyjzyj2000@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).