All of lore.kernel.org
 help / color / mirror / Atom feed
From: "yangx.jy@fujitsu.com" <yangx.jy@fujitsu.com>
To: Zhu Yanjun <zyjzyj2000@gmail.com>
Cc: Olga Kornievskaia <aglo@umich.edu>,
	Leon Romanovsky <leon@kernel.org>,
	Bob Pearson <rpearsonhpe@gmail.com>,
	Jason Gunthorpe <jgg@nvidia.com>,
	linux-rdma <linux-rdma@vger.kernel.org>
Subject: Re: RXE status in the upstream rping using rxe
Date: Wed, 18 Aug 2021 06:43:30 +0000	[thread overview]
Message-ID: <611CABE6.3010700@fujitsu.com> (raw)
In-Reply-To: <CAD=hENdqho3mRy=gUSE-vuXzLvZPkwJ7kEFrjRN-AxLwvQP18Q@mail.gmail.com>

于 2021/8/17 10:28, Zhu Yanjun 写道:
> On Fri, Aug 6, 2021 at 10:37 AM Olga Kornievskaia<aglo@umich.edu>  wrote:
>> On Wed, Aug 4, 2021 at 5:05 AM Zhu Yanjun<zyjzyj2000@gmail.com>  wrote:
>>> On Wed, Aug 4, 2021 at 1:41 PM Leon Romanovsky<leon@kernel.org>  wrote:
>>>> On Wed, Aug 04, 2021 at 09:09:41AM +0800, Zhu Yanjun wrote:
>>>>> On Wed, Aug 4, 2021 at 9:01 AM Zhu Yanjun<zyjzyj2000@gmail.com>  wrote:
>>>>>> On Wed, Aug 4, 2021 at 2:07 AM Leon Romanovsky<leon@kernel.org>  wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> Can you please help me to understand the RXE status in the upstream?
>>>>>>>
>>>>>>> Does we still have crashes/interop issues/e.t.c?
>>>>>> I made some developments with the RXE in the upstream, from my usage
>>>>>> with latest RXE,
>>>>>> I found the following:
>>>>>>
>>>>>> 1. rdma-core can not work well with latest RDMA git;
>>>>> The latest RDMA git is
>>>>> https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git
>>>> "Latest" is a relative term, what SHA did you test?
>>>> Let's focus on fixing RXE before we will continue with new features.
>>> Thanks a lot. I agree with you.
>> I believe simple rping still doesn't work linux-to-linux. The last
>> working version (of rping in rxe) was 5.13 I think. I have posted a
>> number of crashes rping encounters (gotta get that working before I
>> can even try NFSoRDMA).
> The following are my tests.
>
> 1. Modprobe rdma_rxe
> 2. Modprobe -v -r rdma_rxe
> 3. Rdma link add rxe
> 4. Rdma link del rxe
> 5. Latest rdma-core&&  latest kernel upstream;
> 6. Latest kernel<  ------rping---->  5.10.y stable
> 7. Latest kernel<  ------rping---->  5.11.y stable
> 8. Latest kernel<  ------rping---->  5.12.y stable
> 9. Latest kernel<  ------rping---->  5.13.y stable
>
> It seems that the latest kernel upstream (5.14-rc6) can rping other
> stable kernels.
> Can you make tests again?
>
> Zhu Yanjun
Hi,

I still get two simliar panic by rping or rdma_client/server on latest kernel vs 5.13:
Panic1:
--------------------------------------------------------
[  268.248642] BUG: unable to handle page fault for address: ffff9ae2c07a1414
[  268.251049] #PF: supervisor read access in kernel mode
[  268.252491] #PF: error_code(0x0000) - not-present page
[  268.253919] PGD 1000067 P4D 1000067 PUD 0
[  268.255052] Oops: 0000 [#1] SMP PTI
[  268.256055] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1
[  268.257893] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
[  268.259995] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe]
[  268.261114] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44>  8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b
[  268.265005] RSP: 0018:ffff9ae2404108b8 EFLAGS: 00010202
[  268.266145] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8f7a8bf9da76
[  268.267703] RDX: ffff9ae2c07a1410 RSI: 0000000000000001 RDI: ffff8f7a71874400
[  268.269291] RBP: ffff8f7a482f87cc R08: 0000000000000010 R09: 0000000000000000
[  268.270871] R10: 00000000000000cb R11: 0000000000000001 R12: ffff8f7a482f8000
[  268.272468] R13: ffff8f7a8c038928 R14: ffff8f7a482f8008 R15: 0000000000000010
[  268.274080] FS:  0000000000000000(0000) GS:ffff8f7abec00000(0000) knlGS:0000000000000000
[  268.275899] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  268.277205] CR2: ffff9ae2c07a1414 CR3: 000000000263c002 CR4: 0000000000060ee0
[  268.278825] Call Trace:
[  268.279358]<IRQ>
[  268.279747]  rxe_responder+0x11b1/0x2490 [rdma_rxe]
[  268.280798]  rxe_do_task+0x9c/0xe0 [rdma_rxe]
[  268.281895]  rxe_rcv+0x286/0x8e0 [rdma_rxe]
...
------------------------------------------------------

Panic2:
--------------------------------------------------------
[  212.526854] BUG: unable to handle page fault for address: ffffbb97142acc14
[  212.530688] #PF: supervisor read access in kernel mode
[  212.533030] #PF: error_code(0x0000) - not-present page
[  212.535428] PGD 1000067 P4D 1000067 PUD 0
[  212.536970] Oops: 0000 [#1] SMP PTI
[  212.537748] CPU: 2 PID: 0 Comm: swapper/2 Not tainted 5.14.0-rc6+ #1
[  212.538984] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
[  212.540853] RIP: 0010:copy_data+0x2d/0x2a0 [rdma_rxe]
[  212.541957] Code: 00 00 41 57 41 56 41 55 41 54 55 53 48 83 ec 20 48 89 7c 24 08 89 74 24 10 44 89 4c 24 14 45 85 c0 0f 84 e8 00 00 00 45 89 c7<44>  8b 42 04 49 89 d6 44 89 44 24 18 45 39 f8 0f 8c 20 02 00 00 8b
[  212.546041] RSP: 0018:ffffbb9640410898 EFLAGS: 00010202
[  212.547200] RAX: ffffbb97142acc00 RBX: ffff95510ee6d000 RCX: ffff95510a802076
[  212.548782] RDX: ffffbb97142acc10 RSI: 0000000000000001 RDI: ffff95510ca00700
[  212.550369] RBP: 0000000000000010 R08: 0000000000000010 R09: 0000000000000000
[  212.551992] R10: 0000000000000001 R11: 0000000000000001 R12: ffff95510a802076
[  212.553613] R13: ffff9550f29acd28 R14: ffff95510ee6d008 R15: 0000000000000010
[  212.555225] FS:  0000000000000000(0000) GS:ffff95513ec00000(0000) knlGS:0000000000000000
[  212.556749] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  212.557846] CR2: ffffbb97142acc14 CR3: 0000000003b66005 CR4: 0000000000060ee0
[  212.559177] Call Trace:
[  212.559655]<IRQ>
[  212.560055]  send_data_in+0x55/0x73 [rdma_rxe]
[  212.560903]  rxe_responder.cold+0xea/0x1f8 [rdma_rxe]
[  212.561865]  rxe_do_task+0x9c/0xe0 [rdma_rxe]
[  212.562699]  rxe_rcv+0x286/0x8e0 [rdma_rxe]
...
------------------------------------------------------

Note: it is easy to reproduce the panic on the lastest kernel.

Best Regards,
Xiao Yang



>> Thank you for working on the code.
>>
>> We (NFS community) do test NFSoRDMA every git pull using rxe and siw
>> but lately have been encountering problems.
>>
>>> rdma-core:
>>> 313509f8 (HEAD ->  master, origin/master, origin/HEAD) Merge pull
>>> request #1038 from selvintxavier/master
>>> 2d3dc48b Merge pull request #1039 from amzn/pyverbs-mac-fix-pr
>>> 327d45e0 tests: Add missing MAC element to args list
>>> 66aba73d bnxt_re/lib: Move hardware queue to 16B aligned indices
>>> 8754fb51 bnxt_re/lib: Use separate indices for shadow queue
>>> be4d8abf bnxt_re/lib: add a function to initialize software queue
>>>
>>> kernel rdma:
>>> 0050a57638ca (HEAD ->  for-next, origin/for-next, origin/HEAD)
>>> RDMA/qedr: Improve error logs for rdma_alloc_tid error return
>>> 090473004b02 RDMA/qed: Use accurate error num in qed_cxt_dynamic_ilt_alloc
>>> 991c4274dc17 RDMA/hfi1: Fix typo in comments
>>> 8d7e415d5561 docs: Fix infiniband uverbs minor number
>>> bbafcbc2b1c9 RDMA/iwpm: Rely on the rdma_nl_[un]register() to ensure
>>> that requests are valid
>>> bdb0e4e3ff19 RDMA/iwpm: Remove not-needed reference counting
>>> e677b72a0647 RDMA/iwcm: Release resources if iw_cm module initialization fails
>>> a0293eb24936 RDMA/hfi1: Convert from atomic_t to refcount_t on
>>> hfi1_devdata->user_refcount
>>>
>>> with the above kernel and rdma-core, the following messages will appear.
>>> "
>>> [   54.214608] rdma_rxe: loaded
>>> [   54.217089] infiniband rxe0: set active
>>> [   54.217101] infiniband rxe0: added enp0s8
>>> [  167.623200] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  167.645590] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  167.733297] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  169.074755] rdma_rxe: check_rkey: no MW matches rkey 0x1000247
>>> [  169.074796] rdma_rxe: qp#27 moved to error state
>>> [  169.138851] rdma_rxe: check_rkey: no MW matches rkey 0x10005de
>>> [  169.138889] rdma_rxe: qp#30 moved to error state
>>> [  169.160565] rdma_rxe: check_rkey: no MW matches rkey 0x10006f7
>>> [  169.160601] rdma_rxe: qp#31 moved to error state
>>> [  169.182132] rdma_rxe: check_rkey: no MW matches rkey 0x1000782
>>> [  169.182170] rdma_rxe: qp#32 moved to error state
>>> [  169.667803] rdma_rxe: check_rkey: no MR matches rkey 0x18d8
>>> [  169.667850] rdma_rxe: qp#39 moved to error state
>>> [  198.872649] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  198.894829] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  198.981839] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  200.332031] rdma_rxe: check_rkey: no MW matches rkey 0x1000887
>>> [  200.332086] rdma_rxe: qp#58 moved to error state
>>> [  200.396476] rdma_rxe: check_rkey: no MW matches rkey 0x1000b0d
>>> [  200.396514] rdma_rxe: qp#61 moved to error state
>>> [  200.417919] rdma_rxe: check_rkey: no MW matches rkey 0x1000c40
>>> [  200.417956] rdma_rxe: qp#62 moved to error state
>>> [  200.439616] rdma_rxe: check_rkey: no MW matches rkey 0x1000d24
>>> [  200.439654] rdma_rxe: qp#63 moved to error state
>>> [  200.933104] rdma_rxe: check_rkey: no MR matches rkey 0x37d8
>>> [  200.933153] rdma_rxe: qp#70 moved to error state
>>> [  206.880305] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  206.904030] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  206.991494] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  208.359987] rdma_rxe: check_rkey: no MW matches rkey 0x1000e4d
>>> [  208.360028] rdma_rxe: qp#89 moved to error state
>>> [  208.425637] rdma_rxe: check_rkey: no MW matches rkey 0x1001136
>>> [  208.425675] rdma_rxe: qp#92 moved to error state
>>> [  208.447333] rdma_rxe: check_rkey: no MW matches rkey 0x10012d8
>>> [  208.447370] rdma_rxe: qp#93 moved to error state
>>> [  208.469511] rdma_rxe: check_rkey: no MW matches rkey 0x100137a
>>> [  208.469550] rdma_rxe: qp#94 moved to error state
>>> [  208.956691] rdma_rxe: check_rkey: no MR matches rkey 0x5670
>>> [  208.956731] rdma_rxe: qp#100 moved to error state
>>> [  216.879703] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  216.902199] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  216.989264] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  218.363765] rdma_rxe: check_rkey: no MW matches rkey 0x10014d6
>>> [  218.363808] rdma_rxe: qp#119 moved to error state
>>> [  218.429474] rdma_rxe: check_rkey: no MW matches rkey 0x10017e4
>>> [  218.429513] rdma_rxe: qp#122 moved to error state
>>> [  218.451443] rdma_rxe: check_rkey: no MW matches rkey 0x1001895
>>> [  218.451481] rdma_rxe: qp#123 moved to error state
>>> [  218.473869] rdma_rxe: check_rkey: no MW matches rkey 0x1001910
>>> [  218.473908] rdma_rxe: qp#124 moved to error state
>>> [  218.963602] rdma_rxe: check_rkey: no MR matches rkey 0x757b
>>> [  218.963641] rdma_rxe: qp#130 moved to error state
>>> [  233.855140] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  233.877202] rdma_rxe: cqe(1)<  current # elements in queue (6)
>>> [  233.963952] rdma_rxe: cqe(32768)>  max_cqe(32767)
>>> [  235.305274] rdma_rxe: check_rkey: no MW matches rkey 0x1001ac2
>>> [  235.305319] rdma_rxe: qp#149 moved to error state
>>> [  235.368800] rdma_rxe: check_rkey: no MW matches rkey 0x1001db8
>>> [  235.368838] rdma_rxe: qp#152 moved to error state
>>> [  235.390155] rdma_rxe: check_rkey: no MW matches rkey 0x1001e4d
>>> [  235.390192] rdma_rxe: qp#153 moved to error state
>>> [  235.411336] rdma_rxe: check_rkey: no MW matches rkey 0x1001f4c
>>> [  235.411374] rdma_rxe: qp#154 moved to error state
>>> [  235.895784] rdma_rxe: check_rkey: no MR matches rkey 0x9482
>>> [  235.895828] rdma_rxe: qp#161 moved to error state
>>> "
>>> Not sure if they are problems.
>>> IMO, we should make further investigations.
>>>
>>> Thanks
>>> Zhu Yanjun
>>>> Thanks
>

  reply	other threads:[~2021-08-18  6:50 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-08-03 18:07 RXE status in the upstream rping using rxe Leon Romanovsky
2021-08-04  1:01 ` Zhu Yanjun
2021-08-04  1:09   ` Zhu Yanjun
2021-08-04  5:41     ` Leon Romanovsky
2021-08-04  9:05       ` Zhu Yanjun
2021-08-06  2:37         ` Olga Kornievskaia
2021-08-17  2:28           ` Zhu Yanjun
2021-08-18  6:43             ` yangx.jy [this message]
2021-08-18  7:20               ` Zhu Yanjun
2021-08-18  7:44                 ` yangx.jy
2021-08-18  8:28                   ` Zhu Yanjun
2021-08-18 14:33                     ` yangx.jy
2021-08-20  3:31                       ` Zhu Yanjun
2021-08-20  7:42                         ` yangx.jy
2021-08-20 21:40                           ` Bob Pearson
2021-08-20 22:09                             ` Bob Pearson
2021-08-13 21:53         ` Bob Pearson
2021-08-14  5:32           ` Leon Romanovsky
2021-08-23  7:53 ` Zhu Yanjun

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=611CABE6.3010700@fujitsu.com \
    --to=yangx.jy@fujitsu.com \
    --cc=aglo@umich.edu \
    --cc=jgg@nvidia.com \
    --cc=leon@kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=rpearsonhpe@gmail.com \
    --cc=zyjzyj2000@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.