All of lore.kernel.org
 help / color / mirror / Atom feed
* RDMA/rxe is broken (impacting running NFSoRDMA over softRoCE)
@ 2021-07-20  3:46 Olga Kornievskaia
  2021-07-20  6:27 ` Bob Pearson
  0 siblings, 1 reply; 7+ messages in thread
From: Olga Kornievskaia @ 2021-07-20  3:46 UTC (permalink / raw)
  To: Bob Pearson, Zhu Yanjun; +Cc: Jason Gunthorpe, linux-rdma

Hello,

I would like to report that the rxe driver got broken some time
between 5.13 and 5.14-rc1 (so basically the last git pull). It's not
just NFSoRDMA but simple rping doesn't work. I believe I found the
problematic commit: 5bcf5a59c41e19141783c7305d420a5e36c937b2
"RDMA/rxe: Protext kernel index from user space"

Server side logs: "rdma_rxe: bad ICRC from <>".

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RDMA/rxe is broken (impacting running NFSoRDMA over softRoCE)
  2021-07-20  3:46 RDMA/rxe is broken (impacting running NFSoRDMA over softRoCE) Olga Kornievskaia
@ 2021-07-20  6:27 ` Bob Pearson
  2021-07-20 21:48   ` Olga Kornievskaia
  0 siblings, 1 reply; 7+ messages in thread
From: Bob Pearson @ 2021-07-20  6:27 UTC (permalink / raw)
  To: Olga Kornievskaia, Zhu Yanjun; +Cc: Jason Gunthorpe, linux-rdma

On 7/19/21 10:46 PM, Olga Kornievskaia wrote:
> Hello,
> 
> I would like to report that the rxe driver got broken some time
> between 5.13 and 5.14-rc1 (so basically the last git pull). It's not
> just NFSoRDMA but simple rping doesn't work. I believe I found the
> problematic commit: 5bcf5a59c41e19141783c7305d420a5e36c937b2
> "RDMA/rxe: Protext kernel index from user space"
> 
> Server side logs: "rdma_rxe: bad ICRC from <>".
> 
Thanks. That is helpful. Will try to find it.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RDMA/rxe is broken (impacting running NFSoRDMA over softRoCE)
  2021-07-20  6:27 ` Bob Pearson
@ 2021-07-20 21:48   ` Olga Kornievskaia
  2021-07-21  5:47     ` Leon Romanovsky
  2021-07-21  6:16     ` Zhu Yanjun
  0 siblings, 2 replies; 7+ messages in thread
From: Olga Kornievskaia @ 2021-07-20 21:48 UTC (permalink / raw)
  To: Bob Pearson; +Cc: Zhu Yanjun, Jason Gunthorpe, linux-rdma

On Tue, Jul 20, 2021 at 2:27 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> On 7/19/21 10:46 PM, Olga Kornievskaia wrote:
> > Hello,
> >
> > I would like to report that the rxe driver got broken some time
> > between 5.13 and 5.14-rc1 (so basically the last git pull). It's not
> > just NFSoRDMA but simple rping doesn't work. I believe I found the
> > problematic commit: 5bcf5a59c41e19141783c7305d420a5e36c937b2
> > "RDMA/rxe: Protext kernel index from user space"
> >
> > Server side logs: "rdma_rxe: bad ICRC from <>".
> >
> Thanks. That is helpful. Will try to find it.

Thank you, I appreciate you looking into it. Actually I'm not 100%
confident that's the commit for this particular problem "I" was seeing
in 5.14-rc (which was rping hanging but not crashing. An NFS mount
also hangs, doesn't crash) . But what git bisect was going thru and
encountering crashes so can't say what it "found". So I think that's
the one that cashes kernel oops. I think something else leads to the
bad ICRC.

I have a general question. I see that you've been posting a lot of
work on RDMA/rxe lately. Can this be viewed as somebody (you/your
company) is now actively supporting rxe driver? It looked like
previously Mellanox had abandoned support for it. We ran into several
issues trying to use rxe for NFSoRDMA throughout the years but they
were not being addressed.

There were a number of commits that lead to crashes. commit
ec9bf373f2458f4b5f1ece8b93a07e6204081667 "RDMA/core: Use refcount_t
instead of atomic_t on refcount of ib_uverbs_device" leads to the
following kernel oops. commit 205be5dc9984b67a3b388cbdaa27a2f2644a4bd6
"RDMA/irdma: Fix spelling mistake "Allocal" -> "Allocate"" also leads
to the kernel oops.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RDMA/rxe is broken (impacting running NFSoRDMA over softRoCE)
  2021-07-20 21:48   ` Olga Kornievskaia
@ 2021-07-21  5:47     ` Leon Romanovsky
  2021-07-21 21:15       ` Olga Kornievskaia
  2021-07-21  6:16     ` Zhu Yanjun
  1 sibling, 1 reply; 7+ messages in thread
From: Leon Romanovsky @ 2021-07-21  5:47 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: Bob Pearson, Zhu Yanjun, Jason Gunthorpe, linux-rdma

On Tue, Jul 20, 2021 at 05:48:03PM -0400, Olga Kornievskaia wrote:
> On Tue, Jul 20, 2021 at 2:27 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:

<...>

> There were a number of commits that lead to crashes. commit
> ec9bf373f2458f4b5f1ece8b93a07e6204081667 "RDMA/core: Use refcount_t
> instead of atomic_t on refcount of ib_uverbs_device" leads to the
> following kernel oops. commit 205be5dc9984b67a3b388cbdaa27a2f2644a4bd6
> "RDMA/irdma: Fix spelling mistake "Allocal" -> "Allocate"" also leads
> to the kernel oops.

The commits above aren't relevant to RXE at all.

If first commit is wrong, all drivers will experience crashes and second
commit is in irdma and not in RXE.

And both of them are legit commits.

Thanks

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RDMA/rxe is broken (impacting running NFSoRDMA over softRoCE)
  2021-07-20 21:48   ` Olga Kornievskaia
  2021-07-21  5:47     ` Leon Romanovsky
@ 2021-07-21  6:16     ` Zhu Yanjun
  1 sibling, 0 replies; 7+ messages in thread
From: Zhu Yanjun @ 2021-07-21  6:16 UTC (permalink / raw)
  To: Olga Kornievskaia; +Cc: Bob Pearson, Jason Gunthorpe, linux-rdma

On Wed, Jul 21, 2021 at 5:48 AM Olga Kornievskaia <aglo@umich.edu> wrote:
>
> On Tue, Jul 20, 2021 at 2:27 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
> >
> > On 7/19/21 10:46 PM, Olga Kornievskaia wrote:
> > > Hello,
> > >
> > > I would like to report that the rxe driver got broken some time
> > > between 5.13 and 5.14-rc1 (so basically the last git pull). It's not
> > > just NFSoRDMA but simple rping doesn't work. I believe I found the
> > > problematic commit: 5bcf5a59c41e19141783c7305d420a5e36c937b2
> > > "RDMA/rxe: Protext kernel index from user space"
> > >
> > > Server side logs: "rdma_rxe: bad ICRC from <>".
> > >
> > Thanks. That is helpful. Will try to find it.
>
> Thank you, I appreciate you looking into it. Actually I'm not 100%
> confident that's the commit for this particular problem "I" was seeing
> in 5.14-rc (which was rping hanging but not crashing. An NFS mount
> also hangs, doesn't crash) . But what git bisect was going thru and
> encountering crashes so can't say what it "found". So I think that's
> the one that cashes kernel oops. I think something else leads to the
> bad ICRC.

Thanks a lot. I will delve into this problem.

Zhu Yanjun

>
> I have a general question. I see that you've been posting a lot of
> work on RDMA/rxe lately. Can this be viewed as somebody (you/your
> company) is now actively supporting rxe driver? It looked like
> previously Mellanox had abandoned support for it. We ran into several
> issues trying to use rxe for NFSoRDMA throughout the years but they
> were not being addressed.
>
> There were a number of commits that lead to crashes. commit
> ec9bf373f2458f4b5f1ece8b93a07e6204081667 "RDMA/core: Use refcount_t
> instead of atomic_t on refcount of ib_uverbs_device" leads to the
> following kernel oops. commit 205be5dc9984b67a3b388cbdaa27a2f2644a4bd6
> "RDMA/irdma: Fix spelling mistake "Allocal" -> "Allocate"" also leads
> to the kernel oops.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RDMA/rxe is broken (impacting running NFSoRDMA over softRoCE)
  2021-07-21  5:47     ` Leon Romanovsky
@ 2021-07-21 21:15       ` Olga Kornievskaia
  2021-07-21 21:48         ` Bob Pearson
  0 siblings, 1 reply; 7+ messages in thread
From: Olga Kornievskaia @ 2021-07-21 21:15 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: Bob Pearson, Zhu Yanjun, Jason Gunthorpe, linux-rdma

On Wed, Jul 21, 2021 at 1:47 AM Leon Romanovsky <leon@kernel.org> wrote:
>
> On Tue, Jul 20, 2021 at 05:48:03PM -0400, Olga Kornievskaia wrote:
> > On Tue, Jul 20, 2021 at 2:27 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>
> <...>
>
> > There were a number of commits that lead to crashes. commit
> > ec9bf373f2458f4b5f1ece8b93a07e6204081667 "RDMA/core: Use refcount_t
> > instead of atomic_t on refcount of ib_uverbs_device" leads to the
> > following kernel oops. commit 205be5dc9984b67a3b388cbdaa27a2f2644a4bd6
> > "RDMA/irdma: Fix spelling mistake "Allocal" -> "Allocate"" also leads
> > to the kernel oops.
>
> The commits above aren't relevant to RXE at all.
>
> If first commit is wrong, all drivers will experience crashes and second
> commit is in irdma and not in RXE.
>
> And both of them are legit commits.

Yes I realize they are problems outside of rxe but I didn't run into
any crashes when I ran rpring on the 5.14-rc1 (it was hanging and
that's what I reported here) so I thought perhaps later patches have
addressed the crashes I've seen.

Would you like me to post a separate email report on the crash(es) (I
don't recall if it's the same one or two different ones)?

>
> Thanks

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: RDMA/rxe is broken (impacting running NFSoRDMA over softRoCE)
  2021-07-21 21:15       ` Olga Kornievskaia
@ 2021-07-21 21:48         ` Bob Pearson
  0 siblings, 0 replies; 7+ messages in thread
From: Bob Pearson @ 2021-07-21 21:48 UTC (permalink / raw)
  To: Olga Kornievskaia, Leon Romanovsky
  Cc: Zhu Yanjun, Jason Gunthorpe, linux-rdma

On 7/21/21 4:15 PM, Olga Kornievskaia wrote:
> On Wed, Jul 21, 2021 at 1:47 AM Leon Romanovsky <leon@kernel.org> wrote:
>>
>> On Tue, Jul 20, 2021 at 05:48:03PM -0400, Olga Kornievskaia wrote:
>>> On Tue, Jul 20, 2021 at 2:27 AM Bob Pearson <rpearsonhpe@gmail.com> wrote:
>>
>> <...>
>>
>>> There were a number of commits that lead to crashes. commit
>>> ec9bf373f2458f4b5f1ece8b93a07e6204081667 "RDMA/core: Use refcount_t
>>> instead of atomic_t on refcount of ib_uverbs_device" leads to the
>>> following kernel oops. commit 205be5dc9984b67a3b388cbdaa27a2f2644a4bd6
>>> "RDMA/irdma: Fix spelling mistake "Allocal" -> "Allocate"" also leads
>>> to the kernel oops.
>>
>> The commits above aren't relevant to RXE at all.
>>
>> If first commit is wrong, all drivers will experience crashes and second
>> commit is in irdma and not in RXE.
>>
>> And both of them are legit commits.
> 
> Yes I realize they are problems outside of rxe but I didn't run into
> any crashes when I ran rpring on the 5.14-rc1 (it was hanging and
> that's what I reported here) so I thought perhaps later patches have
> addressed the crashes I've seen.
> 
> Would you like me to post a separate email report on the crash(es) (I
> don't recall if it's the same one or two different ones)?
> 
>>
>> Thanks

If you are able please try the patch I just sent. It should make things a little better.
If not I have another idea to try as well.
Bob

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-07-21 21:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-20  3:46 RDMA/rxe is broken (impacting running NFSoRDMA over softRoCE) Olga Kornievskaia
2021-07-20  6:27 ` Bob Pearson
2021-07-20 21:48   ` Olga Kornievskaia
2021-07-21  5:47     ` Leon Romanovsky
2021-07-21 21:15       ` Olga Kornievskaia
2021-07-21 21:48         ` Bob Pearson
2021-07-21  6:16     ` Zhu Yanjun

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.