All of lore.kernel.org
 help / color / mirror / Atom feed
* krping problem on 4.15-rc4
@ 2018-01-09 15:30 Olga Kornievskaia
       [not found] ` <CAN-5tyH1HO7yzzQLyb5z5Pq=OrHnKzmCrR2MffLguqsEA-mwWg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2018-01-09 15:30 UTC (permalink / raw)
  To: linux-rdma

Hi folks,

I have 2 linux machines with CX-5 cards (Mellanox MCX515A-CCAT (one
port)) and krping doesn't work in one direction but works in another.
rping works in both direction. ib_send_bw works in both directions and
display 39Gb one way and 36Gb other way on a 40Gb setup.

krping is upstream commit 4df520c888d80e5370d0f58b2eeac8355e3f2286.

Server is started with: [kolga@localhost krping]$ sudo echo
"server,port=9999,addr=172.20.35.191,count=10,verbose" > /proc/krping
And it displays in /var/log/messages:
Jan 4 14:23:29 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error cqe
Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000
Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000
Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000
Jan 4 14:23:29 localhost kernel: 00000000 93003204 10000122 0005bfd2
Jan 4 14:23:29 localhost kernel: krping: cq completion failed with
wr_id 0 status 4 opcode 128 vender_err 32
Jan 4 14:23:29 localhost kernel: krping: cq completion in ERROR state
Jan 4 14:23:29 localhost kernel: krping: wait for RDMA_READ_COMPLETE state 10

Client is run with: [kolga@sti-rx200-231-d1 ~]$ sudo echo
"client,addr=172.20.35.191,port=9999,verbose,count=10" > /proc/krping
And in var log messages:
Jan 4 14:19:27 localhost kernel: krping: DISCONNECT EVENT...
Jan 4 14:19:27 localhost kernel: krping: wait for RDMA_WRITE_ADV state 10
Jan 4 14:19:28 localhost kernel: krping: cq completion in ERROR state

On the network trace is see (over RRoCE):
CM: ConnectRequest
CM: ConnectReply
CM: ReadyToUse
RC Send Only QP
RC Ack
RC RDMA Read Request
RC RDMA Read Response Only
CM: DisconnectRequest
CM: DisconnectReply

I have previously submitted it to Mellanox but they told me to
resubmit to linux-rdma list: They also said the engineering did look
at the CQE error and the meaning of it was:
PD (protection domain) violation - error in fetch data in rxs in pd
(send opcodes/ read respond / atomic ack).
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: krping problem on 4.15-rc4
       [not found] ` <CAN-5tyH1HO7yzzQLyb5z5Pq=OrHnKzmCrR2MffLguqsEA-mwWg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-10 20:10   ` Steve Wise
  2018-01-11 18:18     ` Olga Kornievskaia
  0 siblings, 1 reply; 16+ messages in thread
From: Steve Wise @ 2018-01-10 20:10 UTC (permalink / raw)
  To: 'Olga Kornievskaia', 'linux-rdma'

> Hi folks,
> 
> I have 2 linux machines with CX-5 cards (Mellanox MCX515A-CCAT (one
> port)) and krping doesn't work in one direction but works in another.
> rping works in both direction. ib_send_bw works in both directions and
> display 39Gb one way and 36Gb other way on a 40Gb setup.
> 
> krping is upstream commit 4df520c888d80e5370d0f58b2eeac8355e3f2286.
> 
> Server is started with: [kolga@localhost krping]$ sudo echo
> "server,port=9999,addr=172.20.35.191,count=10,verbose" > /proc/krping
> And it displays in /var/log/messages:
> Jan 4 14:23:29 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error cqe
> Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000
> Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000
> Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000
> Jan 4 14:23:29 localhost kernel: 00000000 93003204 10000122 0005bfd2
> Jan 4 14:23:29 localhost kernel: krping: cq completion failed with
> wr_id 0 status 4 opcode 128 vender_err 32
> Jan 4 14:23:29 localhost kernel: krping: cq completion in ERROR state
> Jan 4 14:23:29 localhost kernel: krping: wait for RDMA_READ_COMPLETE state
> 10
> 
> Client is run with: [kolga@sti-rx200-231-d1 ~]$ sudo echo
> "client,addr=172.20.35.191,port=9999,verbose,count=10" > /proc/krping
> And in var log messages:
> Jan 4 14:19:27 localhost kernel: krping: DISCONNECT EVENT...
> Jan 4 14:19:27 localhost kernel: krping: wait for RDMA_WRITE_ADV state 10
> Jan 4 14:19:28 localhost kernel: krping: cq completion in ERROR state
> 
> On the network trace is see (over RRoCE):
> CM: ConnectRequest
> CM: ConnectReply
> CM: ReadyToUse
> RC Send Only QP
> RC Ack
> RC RDMA Read Request
> RC RDMA Read Response Only
> CM: DisconnectRequest
> CM: DisconnectReply
> 
> I have previously submitted it to Mellanox but they told me to
> resubmit to linux-rdma list: They also said the engineering did look
> at the CQE error and the meaning of it was:
> PD (protection domain) violation - error in fetch data in rxs in pd
> (send opcodes/ read respond / atomic ack).

Hey Olga, 

Are the machines the same kernel version / distro sw / and hw - cpu/motherboard/memory/etc?  If not, what is different about them?  Is it the krping server that sees the CQ error?  Do other rdma devices work on these systems?

Thanks,

Steve.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: krping problem on 4.15-rc4
  2018-01-10 20:10   ` Steve Wise
@ 2018-01-11 18:18     ` Olga Kornievskaia
  2018-01-11 19:45       ` Steve Wise
  0 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2018-01-11 18:18 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma

On Wed, Jan 10, 2018 at 3:10 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
>> Hi folks,
>>
>> I have 2 linux machines with CX-5 cards (Mellanox MCX515A-CCAT (one
>> port)) and krping doesn't work in one direction but works in another.
>> rping works in both direction. ib_send_bw works in both directions and
>> display 39Gb one way and 36Gb other way on a 40Gb setup.
>>
>> krping is upstream commit 4df520c888d80e5370d0f58b2eeac8355e3f2286.
>>
>> Server is started with: [kolga@localhost krping]$ sudo echo
>> "server,port=9999,addr=172.20.35.191,count=10,verbose" > /proc/krping
>> And it displays in /var/log/messages:
>> Jan 4 14:23:29 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error cqe
>> Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000
>> Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000
>> Jan 4 14:23:29 localhost kernel: 00000000 00000000 00000000 00000000
>> Jan 4 14:23:29 localhost kernel: 00000000 93003204 10000122 0005bfd2
>> Jan 4 14:23:29 localhost kernel: krping: cq completion failed with
>> wr_id 0 status 4 opcode 128 vender_err 32
>> Jan 4 14:23:29 localhost kernel: krping: cq completion in ERROR state
>> Jan 4 14:23:29 localhost kernel: krping: wait for RDMA_READ_COMPLETE state
>> 10
>>
>> Client is run with: [kolga@sti-rx200-231-d1 ~]$ sudo echo
>> "client,addr=172.20.35.191,port=9999,verbose,count=10" > /proc/krping
>> And in var log messages:
>> Jan 4 14:19:27 localhost kernel: krping: DISCONNECT EVENT...
>> Jan 4 14:19:27 localhost kernel: krping: wait for RDMA_WRITE_ADV state 10
>> Jan 4 14:19:28 localhost kernel: krping: cq completion in ERROR state
>>
>> On the network trace is see (over RRoCE):
>> CM: ConnectRequest
>> CM: ConnectReply
>> CM: ReadyToUse
>> RC Send Only QP
>> RC Ack
>> RC RDMA Read Request
>> RC RDMA Read Response Only
>> CM: DisconnectRequest
>> CM: DisconnectReply
>>
>> I have previously submitted it to Mellanox but they told me to
>> resubmit to linux-rdma list: They also said the engineering did look
>> at the CQE error and the meaning of it was:
>> PD (protection domain) violation - error in fetch data in rxs in pd
>> (send opcodes/ read respond / atomic ack).
>
> Hey Olga,
>
> Are the machines the same kernel version / distro sw / and hw - cpu/motherboard/memory/etc?  If not, what is different about them?  Is it the krping server that sees the CQ error?  Do other rdma devices work on these systems?

Hi Steve,

Machines software is the same kernel version (4.15-rc4) / distro sw
(RHEL7.4). Hardware of those machines the same (PRIMERGY RX200 S7) but
one machine has 8G less memory than the other (64G vs 56G). kpring
error was on the server. These machines only have 1 CX-5 no other RDMA
devices.

>
> Thanks,
>
> Steve.
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: krping problem on 4.15-rc4
  2018-01-11 18:18     ` Olga Kornievskaia
@ 2018-01-11 19:45       ` Steve Wise
  2018-01-12 22:06         ` Olga Kornievskaia
  0 siblings, 1 reply; 16+ messages in thread
From: Steve Wise @ 2018-01-11 19:45 UTC (permalink / raw)
  To: 'Olga Kornievskaia'; +Cc: 'linux-rdma'

> > Hey Olga,
> >
> > Are the machines the same kernel version / distro sw / and hw -
> cpu/motherboard/memory/etc?  If not, what is different about them?  Is it the
> krping server that sees the CQ error?  Do other rdma devices work on these
> systems?
> 
> Hi Steve,
> 
> Machines software is the same kernel version (4.15-rc4) / distro sw
> (RHEL7.4). Hardware of those machines the same (PRIMERGY RX200 S7) but
> one machine has 8G less memory than the other (64G vs 56G). kpring
> error was on the server. These machines only have 1 CX-5 no other RDMA
> devices.
> 

Ok.  The memory probably doesn't matter.  Maybe run krping client and server on the same host (to use hw-loopback), and see if it works on both, one, or neither systems when they are both the client and server.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: krping problem on 4.15-rc4
  2018-01-11 19:45       ` Steve Wise
@ 2018-01-12 22:06         ` Olga Kornievskaia
       [not found]           ` <CAN-5tyGq=hmXY9HZYXpfaytOUV=gb0fri69gj69WKbbYtW3nTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2018-01-12 22:06 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma

On Thu, Jan 11, 2018 at 2:45 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
>> > Hey Olga,
>> >
>> > Are the machines the same kernel version / distro sw / and hw -
>> cpu/motherboard/memory/etc?  If not, what is different about them?  Is it the
>> krping server that sees the CQ error?  Do other rdma devices work on these
>> systems?
>>
>> Hi Steve,
>>
>> Machines software is the same kernel version (4.15-rc4) / distro sw
>> (RHEL7.4). Hardware of those machines the same (PRIMERGY RX200 S7) but
>> one machine has 8G less memory than the other (64G vs 56G). kpring
>> error was on the server. These machines only have 1 CX-5 no other RDMA
>> devices.
>>
>
> Ok.  The memory probably doesn't matter.  Maybe run krping client and server on the same host (to use hw-loopback), and see if it works on both, one, or neither systems when they are both the client and server.

Loopback on the original "server" machine produces the same failure.
Jan 12 17:05:40 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error cqe
Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
Jan 12 17:05:40 localhost kernel: 00000000 93003204 1000017c 0005e1d2
Jan 12 17:05:40 localhost kernel: krping: cq completion failed with
wr_id 0 status 4 opcode 0 vender_err 32
Jan 12 17:05:40 localhost kernel: krping: cq completion in ERROR state
Jan 12 17:05:40 localhost kernel: krping: wait for RDMA_READ_COMPLETE state 10
Jan 12 17:05:40 localhost kernel: krping: DISCONNECT EVENT...
Jan 12 17:05:40 localhost kernel: krping: wait for RDMA_WRITE_ADV state 10
Jan 12 17:05:40 localhost kernel: krping: cq completion in ERROR state


Loopback on the original "client" machine runs successfully.
Jan 12 17:04:26 localhost kernel: krping: server ping data (64B max):
|rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr|
Jan 12 17:04:26 localhost kernel: krping: ping data (64B max):
|rdma-ping-0: ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqr|
Jan 12 17:04:26 localhost kernel: krping: server ping data (64B max):
|rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs|
Jan 12 17:04:26 localhost kernel: krping: ping data (64B max):
|rdma-ping-1: BCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrs|
Jan 12 17:04:26 localhost kernel: krping: server ping data (64B max):
|rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst|
Jan 12 17:04:26 localhost kernel: krping: ping data (64B max):
|rdma-ping-2: CDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrst|
Jan 12 17:04:26 localhost kernel: krping: server ping data (64B max):
|rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu|
Jan 12 17:04:26 localhost kernel: krping: ping data (64B max):
|rdma-ping-3: DEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstu|
Jan 12 17:04:26 localhost kernel: krping: server ping data (64B max):
|rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv|
Jan 12 17:04:26 localhost kernel: krping: ping data (64B max):
|rdma-ping-4: EFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuv|
Jan 12 17:04:27 localhost kernel: krping: server ping data (64B max):
|rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw|
Jan 12 17:04:27 localhost kernel: krping: ping data (64B max):
|rdma-ping-5: FGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvw|
Jan 12 17:04:27 localhost kernel: krping: server ping data (64B max):
|rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx|
Jan 12 17:04:27 localhost kernel: krping: ping data (64B max):
|rdma-ping-6: GHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwx|
Jan 12 17:04:27 localhost kernel: krping: server ping data (64B max):
|rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy|
Jan 12 17:04:27 localhost kernel: krping: ping data (64B max):
|rdma-ping-7: HIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy|
Jan 12 17:04:27 localhost kernel: krping: server ping data (64B max):
|rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz|
Jan 12 17:04:27 localhost kernel: krping: ping data (64B max):
|rdma-ping-8: IJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz|
Jan 12 17:04:28 localhost kernel: krping: server ping data (64B max):
|rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA|
Jan 12 17:04:28 localhost kernel: krping: ping data (64B max):
|rdma-ping-9: JKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyzA|
Jan 12 17:04:28 localhost kernel: krping: DISCONNECT EVENT...
Jan 12 17:04:28 localhost kernel: krping: wait for RDMA_READ_ADV state 10
Jan 12 17:04:28 localhost kernel: krping: cq completion in ERROR state

What does this means?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: krping problem on 4.15-rc4
       [not found]           ` <CAN-5tyGq=hmXY9HZYXpfaytOUV=gb0fri69gj69WKbbYtW3nTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-13  0:07             ` Steve Wise
  2018-01-16 19:50               ` Olga Kornievskaia
  0 siblings, 1 reply; 16+ messages in thread
From: Steve Wise @ 2018-01-13  0:07 UTC (permalink / raw)
  To: 'Olga Kornievskaia'
  Cc: 'linux-rdma',
	matanb-VPRAkNaXOzVWk0Htik3J/w, 'Leon Romanovsky'

> > Ok.  The memory probably doesn't matter.  Maybe run krping client and
> server on the same host (to use hw-loopback), and see if it works on both,
> one, or neither systems when they are both the client and server.
> 
> Loopback on the original "server" machine produces the same failure.
> Jan 12 17:05:40 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error
> cqe
> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
> Jan 12 17:05:40 localhost kernel: 00000000 93003204 1000017c 0005e1d2
> Jan 12 17:05:40 localhost kernel: krping: cq completion failed with
> wr_id 0 status 4 opcode 0 vender_err 32

Can someone from Mellanox comment more on the above CQE error?  What exactly is it tell us?

> 
> What does this means?

Not sure.  But it does seem to be tied to that specific machine.  Question:  Is an IOMMU enabled on that system?  Perhaps that is exposing a dma mapping problem with krping?

Steve.


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: krping problem on 4.15-rc4
  2018-01-13  0:07             ` Steve Wise
@ 2018-01-16 19:50               ` Olga Kornievskaia
       [not found]                 ` <CAN-5tyG9ZsaKZs3ayfFfuy7o25DrXR2yWmwUvLdNutJ1SbEg1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2018-01-16 19:50 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma, matanb-VPRAkNaXOzVWk0Htik3J/w, Leon Romanovsky

On Fri, Jan 12, 2018 at 7:07 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
>> > Ok.  The memory probably doesn't matter.  Maybe run krping client and
>> server on the same host (to use hw-loopback), and see if it works on both,
>> one, or neither systems when they are both the client and server.
>>
>> Loopback on the original "server" machine produces the same failure.
>> Jan 12 17:05:40 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error
>> cqe
>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>> Jan 12 17:05:40 localhost kernel: 00000000 93003204 1000017c 0005e1d2
>> Jan 12 17:05:40 localhost kernel: krping: cq completion failed with
>> wr_id 0 status 4 opcode 0 vender_err 32
>
> Can someone from Mellanox comment more on the above CQE error?  What exactly is it tell us?
>
>>
>> What does this means?
>
> Not sure.  But it does seem to be tied to that specific machine.  Question:  Is an IOMMU enabled on that system?

IOMMU (Inter's VT-d) is enabled in BIOS (on both machines).

>  Perhaps that is exposing a dma mapping problem with krping?
>
> Steve.
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: krping problem on 4.15-rc4
       [not found]                 ` <CAN-5tyG9ZsaKZs3ayfFfuy7o25DrXR2yWmwUvLdNutJ1SbEg1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-16 21:14                   ` Olga Kornievskaia
       [not found]                     ` <CAN-5tyFSYWaTPVdq=99Yr9XwnULyf4tw06roZys=rtR0F3x03g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2018-01-16 21:14 UTC (permalink / raw)
  To: Steve Wise; +Cc: linux-rdma, matanb-VPRAkNaXOzVWk0Htik3J/w, Leon Romanovsky

On Tue, Jan 16, 2018 at 2:50 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
> On Fri, Jan 12, 2018 at 7:07 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
>>> > Ok.  The memory probably doesn't matter.  Maybe run krping client and
>>> server on the same host (to use hw-loopback), and see if it works on both,
>>> one, or neither systems when they are both the client and server.
>>>
>>> Loopback on the original "server" machine produces the same failure.
>>> Jan 12 17:05:40 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error
>>> cqe
>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>> Jan 12 17:05:40 localhost kernel: 00000000 93003204 1000017c 0005e1d2
>>> Jan 12 17:05:40 localhost kernel: krping: cq completion failed with
>>> wr_id 0 status 4 opcode 0 vender_err 32
>>
>> Can someone from Mellanox comment more on the above CQE error?  What exactly is it tell us?
>>
>>>
>>> What does this means?
>>
>> Not sure.  But it does seem to be tied to that specific machine.  Question:  Is an IOMMU enabled on that system?
>
> IOMMU (Inter's VT-d) is enabled in BIOS (on both machines).
>
>>  Perhaps that is exposing a dma mapping problem with krping?

I have replaces the CX-5 card with another one and I no longer see the
krping problem.  I think it speaks that it's a card issue...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: krping problem on 4.15-rc4
       [not found]                     ` <CAN-5tyFSYWaTPVdq=99Yr9XwnULyf4tw06roZys=rtR0F3x03g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-17 21:03                       ` Doug Ledford
       [not found]                         ` <1516223013.3403.285.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Doug Ledford @ 2018-01-17 21:03 UTC (permalink / raw)
  To: Olga Kornievskaia, Steve Wise
  Cc: linux-rdma, matanb-VPRAkNaXOzVWk0Htik3J/w, Leon Romanovsky

[-- Attachment #1: Type: text/plain, Size: 2027 bytes --]

On Tue, 2018-01-16 at 16:14 -0500, Olga Kornievskaia wrote:
> On Tue, Jan 16, 2018 at 2:50 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
> > On Fri, Jan 12, 2018 at 7:07 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+v7I7tHvgBF7@public.gmane.orgm> wrote:
> > > > > Ok.  The memory probably doesn't matter.  Maybe run krping client and
> > > > 
> > > > server on the same host (to use hw-loopback), and see if it works on both,
> > > > one, or neither systems when they are both the client and server.
> > > > 
> > > > Loopback on the original "server" machine produces the same failure.
> > > > Jan 12 17:05:40 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error
> > > > cqe
> > > > Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
> > > > Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
> > > > Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
> > > > Jan 12 17:05:40 localhost kernel: 00000000 93003204 1000017c 0005e1d2
> > > > Jan 12 17:05:40 localhost kernel: krping: cq completion failed with
> > > > wr_id 0 status 4 opcode 0 vender_err 32
> > > 
> > > Can someone from Mellanox comment more on the above CQE error?  What exactly is it tell us?
> > > 
> > > > 
> > > > What does this means?
> > > 
> > > Not sure.  But it does seem to be tied to that specific machine.  Question:  Is an IOMMU enabled on that system?
> > 
> > IOMMU (Inter's VT-d) is enabled in BIOS (on both machines).
> > 
> > >  Perhaps that is exposing a dma mapping problem with krping?
> 
> I have replaces the CX-5 card with another one and I no longer see the
> krping problem.  I think it speaks that it's a card issue...

Check the firmware on the bad card.  Lots of issues disappear if you
have older firmware and update to the latest.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
    Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: krping problem on 4.15-rc4
       [not found]                         ` <1516223013.3403.285.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2018-01-17 22:03                           ` Olga Kornievskaia
       [not found]                             ` <CAN-5tyFM_Noj5n-BW+BMa-0VXBWnUVWU2JkiP2f5JBpZoA6YcQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2018-01-17 22:03 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Steve Wise, linux-rdma, matanb-VPRAkNaXOzVWk0Htik3J/w, Leon Romanovsky

On Wed, Jan 17, 2018 at 4:03 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> On Tue, 2018-01-16 at 16:14 -0500, Olga Kornievskaia wrote:
>> On Tue, Jan 16, 2018 at 2:50 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
>> > On Fri, Jan 12, 2018 at 7:07 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
>> > > > > Ok.  The memory probably doesn't matter.  Maybe run krping client and
>> > > >
>> > > > server on the same host (to use hw-loopback), and see if it works on both,
>> > > > one, or neither systems when they are both the client and server.
>> > > >
>> > > > Loopback on the original "server" machine produces the same failure.
>> > > > Jan 12 17:05:40 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error
>> > > > cqe
>> > > > Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>> > > > Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>> > > > Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>> > > > Jan 12 17:05:40 localhost kernel: 00000000 93003204 1000017c 0005e1d2
>> > > > Jan 12 17:05:40 localhost kernel: krping: cq completion failed with
>> > > > wr_id 0 status 4 opcode 0 vender_err 32
>> > >
>> > > Can someone from Mellanox comment more on the above CQE error?  What exactly is it tell us?
>> > >
>> > > >
>> > > > What does this means?
>> > >
>> > > Not sure.  But it does seem to be tied to that specific machine.  Question:  Is an IOMMU enabled on that system?
>> >
>> > IOMMU (Inter's VT-d) is enabled in BIOS (on both machines).
>> >
>> > >  Perhaps that is exposing a dma mapping problem with krping?
>>
>> I have replaces the CX-5 card with another one and I no longer see the
>> krping problem.  I think it speaks that it's a card issue...
>
> Check the firmware on the bad card.  Lots of issues disappear if you
> have older firmware and update to the latest.

That's a valid point. A check of firmware versions is needed. At the
time of the problem, I believe I had two machines that each had same
firmware versions. After card replacement, the replacement card
displays newer firmware.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: krping problem on 4.15-rc4
       [not found]                             ` <CAN-5tyFM_Noj5n-BW+BMa-0VXBWnUVWU2JkiP2f5JBpZoA6YcQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-18 16:13                               ` Olga Kornievskaia
       [not found]                                 ` <CAN-5tyGxnd0WnvgxEpNpZ5fG6u2JZs=Wg0fEvt8EaNLHckvx0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2018-01-18 16:13 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Steve Wise, linux-rdma, matanb-VPRAkNaXOzVWk0Htik3J/w, Leon Romanovsky

On Wed, Jan 17, 2018 at 5:03 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
> On Wed, Jan 17, 2018 at 4:03 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>> On Tue, 2018-01-16 at 16:14 -0500, Olga Kornievskaia wrote:
>>> On Tue, Jan 16, 2018 at 2:50 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
>>> > On Fri, Jan 12, 2018 at 7:07 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
>>> > > > > Ok.  The memory probably doesn't matter.  Maybe run krping client and
>>> > > >
>>> > > > server on the same host (to use hw-loopback), and see if it works on both,
>>> > > > one, or neither systems when they are both the client and server.
>>> > > >
>>> > > > Loopback on the original "server" machine produces the same failure.
>>> > > > Jan 12 17:05:40 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error
>>> > > > cqe
>>> > > > Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>> > > > Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>> > > > Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>> > > > Jan 12 17:05:40 localhost kernel: 00000000 93003204 1000017c 0005e1d2
>>> > > > Jan 12 17:05:40 localhost kernel: krping: cq completion failed with
>>> > > > wr_id 0 status 4 opcode 0 vender_err 32
>>> > >
>>> > > Can someone from Mellanox comment more on the above CQE error?  What exactly is it tell us?
>>> > >
>>> > > >
>>> > > > What does this means?
>>> > >
>>> > > Not sure.  But it does seem to be tied to that specific machine.  Question:  Is an IOMMU enabled on that system?
>>> >
>>> > IOMMU (Inter's VT-d) is enabled in BIOS (on both machines).
>>> >
>>> > >  Perhaps that is exposing a dma mapping problem with krping?
>>>
>>> I have replaces the CX-5 card with another one and I no longer see the
>>> krping problem.  I think it speaks that it's a card issue...
>>
>> Check the firmware on the bad card.  Lots of issues disappear if you
>> have older firmware and update to the latest.
>
> That's a valid point. A check of firmware versions is needed. At the
> time of the problem, I believe I had two machines that each had same
> firmware versions. After card replacement, the replacement card
> displays newer firmware.

I have upgraded the firmware on both machines involved to the latest
available firmware for the card and now I'm in the situation where
krping does not work on either machine --- when either of them is a
server it fails with the same information in the var log messages:

Jan 18 11:05:54 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error cqe
Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
Jan 18 11:05:54 localhost kernel: 00000000 93003204 10000122 0005bfd2
Jan 18 11:05:54 localhost kernel: krping: cq completion failed with
wr_id 0 status 4 opcode 128 vender_err 32
Jan 18 11:05:54 localhost kernel: krping: cq completion in ERROR state
Jan 18 11:05:54 localhost kernel: krping: wait for RDMA_READ_COMPLETE state 10

client side logs:
Jan 18 11:14:30 localhost kernel: krping: DISCONNECT EVENT...
Jan 18 11:14:30 localhost kernel: krping: wait for RDMA_WRITE_ADV state 10
Jan 18 11:14:30 localhost kernel: krping: cq completion in ERROR state
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: krping problem on 4.15-rc4
       [not found]                                 ` <CAN-5tyGxnd0WnvgxEpNpZ5fG6u2JZs=Wg0fEvt8EaNLHckvx0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-19 11:08                                   ` Leon Romanovsky
       [not found]                                     ` <20180119110852.GB1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Leon Romanovsky @ 2018-01-19 11:08 UTC (permalink / raw)
  To: Olga Kornievskaia
  Cc: Doug Ledford, Steve Wise, linux-rdma, matanb-VPRAkNaXOzVWk0Htik3J/w

[-- Attachment #1: Type: text/plain, Size: 3669 bytes --]

On Thu, Jan 18, 2018 at 11:13:08AM -0500, Olga Kornievskaia wrote:
> On Wed, Jan 17, 2018 at 5:03 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
> > On Wed, Jan 17, 2018 at 4:03 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> >> On Tue, 2018-01-16 at 16:14 -0500, Olga Kornievskaia wrote:
> >>> On Tue, Jan 16, 2018 at 2:50 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
> >>> > On Fri, Jan 12, 2018 at 7:07 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
> >>> > > > > Ok.  The memory probably doesn't matter.  Maybe run krping client and
> >>> > > >
> >>> > > > server on the same host (to use hw-loopback), and see if it works on both,
> >>> > > > one, or neither systems when they are both the client and server.
> >>> > > >
> >>> > > > Loopback on the original "server" machine produces the same failure.
> >>> > > > Jan 12 17:05:40 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error
> >>> > > > cqe
> >>> > > > Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
> >>> > > > Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
> >>> > > > Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
> >>> > > > Jan 12 17:05:40 localhost kernel: 00000000 93003204 1000017c 0005e1d2
> >>> > > > Jan 12 17:05:40 localhost kernel: krping: cq completion failed with
> >>> > > > wr_id 0 status 4 opcode 0 vender_err 32
> >>> > >
> >>> > > Can someone from Mellanox comment more on the above CQE error?  What exactly is it tell us?
> >>> > >
> >>> > > >
> >>> > > > What does this means?
> >>> > >
> >>> > > Not sure.  But it does seem to be tied to that specific machine.  Question:  Is an IOMMU enabled on that system?
> >>> >
> >>> > IOMMU (Inter's VT-d) is enabled in BIOS (on both machines).
> >>> >
> >>> > >  Perhaps that is exposing a dma mapping problem with krping?
> >>>
> >>> I have replaces the CX-5 card with another one and I no longer see the
> >>> krping problem.  I think it speaks that it's a card issue...
> >>
> >> Check the firmware on the bad card.  Lots of issues disappear if you
> >> have older firmware and update to the latest.
> >
> > That's a valid point. A check of firmware versions is needed. At the
> > time of the problem, I believe I had two machines that each had same
> > firmware versions. After card replacement, the replacement card
> > displays newer firmware.
>
> I have upgraded the firmware on both machines involved to the latest
> available firmware for the card and now I'm in the situation where
> krping does not work on either machine --- when either of them is a
> server it fails with the same information in the var log messages:

Doesn't it mean that the issue in FW?

>
> Jan 18 11:05:54 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error cqe
> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
> Jan 18 11:05:54 localhost kernel: 00000000 93003204 10000122 0005bfd2
> Jan 18 11:05:54 localhost kernel: krping: cq completion failed with
> wr_id 0 status 4 opcode 128 vender_err 32
> Jan 18 11:05:54 localhost kernel: krping: cq completion in ERROR state
> Jan 18 11:05:54 localhost kernel: krping: wait for RDMA_READ_COMPLETE state 10
>
> client side logs:
> Jan 18 11:14:30 localhost kernel: krping: DISCONNECT EVENT...
> Jan 18 11:14:30 localhost kernel: krping: wait for RDMA_WRITE_ADV state 10
> Jan 18 11:14:30 localhost kernel: krping: cq completion in ERROR state

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: krping problem on 4.15-rc4
       [not found]                                     ` <20180119110852.GB1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
@ 2018-01-19 12:21                                       ` Majd Dibbiny
       [not found]                                         ` <14B966CB-B883-4431-A2A3-9DDE6B88B9AB-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  2018-01-19 15:53                                       ` Steve Wise
  1 sibling, 1 reply; 16+ messages in thread
From: Majd Dibbiny @ 2018-01-19 12:21 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Olga Kornievskaia, Doug Ledford, Steve Wise, linux-rdma, Matan Barak


> On Jan 19, 2018, at 1:09 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
> 
>> On Thu, Jan 18, 2018 at 11:13:08AM -0500, Olga Kornievskaia wrote:
>>> On Wed, Jan 17, 2018 at 5:03 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
>>>> On Wed, Jan 17, 2018 at 4:03 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>>>> On Tue, 2018-01-16 at 16:14 -0500, Olga Kornievskaia wrote:
>>>>>> On Tue, Jan 16, 2018 at 2:50 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
>>>>>> On Fri, Jan 12, 2018 at 7:07 PM, Steve Wise <swise@opengridcomputing.com> wrote:
>>>>>>>>> Ok.  The memory probably doesn't matter.  Maybe run krping client and
>>>>>>>> 
>>>>>>>> server on the same host (to use hw-loopback), and see if it works on both,
>>>>>>>> one, or neither systems when they are both the client and server.
>>>>>>>> 
>>>>>>>> Loopback on the original "server" machine produces the same failure.
>>>>>>>> Jan 12 17:05:40 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error
>>>>>>>> cqe
>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 93003204 1000017c 0005e1d2
>>>>>>>> Jan 12 17:05:40 localhost kernel: krping: cq completion failed with
>>>>>>>> wr_id 0 status 4 opcode 0 vender_err 32
>>>>>>> 
>>>>>>> Can someone from Mellanox comment more on the above CQE error?  What exactly is it tell us?
>>>>>>> 
>>>>>>>> 
>>>>>>>> What does this means?
>>>>>>> 
>>>>>>> Not sure.  But it does seem to be tied to that specific machine.  Question:  Is an IOMMU enabled on that system?
>>>>>> 
>>>>>> IOMMU (Inter's VT-d) is enabled in BIOS (on both machines).
>>>>>> 
>>>>>>> Perhaps that is exposing a dma mapping problem with krping?
>>>>> 
>>>>> I have replaces the CX-5 card with another one and I no longer see the
>>>>> krping problem.  I think it speaks that it's a card issue...
>>>> 
>>>> Check the firmware on the bad card.  Lots of issues disappear if you
>>>> have older firmware and update to the latest.
>>> 
>>> That's a valid point. A check of firmware versions is needed. At the
>>> time of the problem, I believe I had two machines that each had same
>>> firmware versions. After card replacement, the replacement card
>>> displays newer firmware.
>> 
>> I have upgraded the firmware on both machines involved to the latest
>> available firmware for the card and now I'm in the situation where
>> krping does not work on either machine --- when either of them is a
>> server it fails with the same information in the var log messages:
> 
> Doesn't it mean that the issue in FW?
Did you do cold reboot after FW upgrade?
> 
>> 
>> Jan 18 11:05:54 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error cqe
>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
>> Jan 18 11:05:54 localhost kernel: 00000000 93003204 10000122 0005bfd2
>> Jan 18 11:05:54 localhost kernel: krping: cq completion failed with
>> wr_id 0 status 4 opcode 128 vender_err 32
>> Jan 18 11:05:54 localhost kernel: krping: cq completion in ERROR state
>> Jan 18 11:05:54 localhost kernel: krping: wait for RDMA_READ_COMPLETE state 10
>> 
>> client side logs:
>> Jan 18 11:14:30 localhost kernel: krping: DISCONNECT EVENT...
>> Jan 18 11:14:30 localhost kernel: krping: wait for RDMA_WRITE_ADV state 10
>> Jan 18 11:14:30 localhost kernel: krping: cq completion in ERROR state
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: krping problem on 4.15-rc4
       [not found]                                         ` <14B966CB-B883-4431-A2A3-9DDE6B88B9AB-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2018-01-19 13:57                                           ` Olga Kornievskaia
       [not found]                                             ` <CAN-5tyGiuuvzxru+aeeCahukrbm_aivN+HfLx=X1d8txxL4A9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 16+ messages in thread
From: Olga Kornievskaia @ 2018-01-19 13:57 UTC (permalink / raw)
  To: Majd Dibbiny
  Cc: Leon Romanovsky, Doug Ledford, Steve Wise, linux-rdma, Matan Barak

On Fri, Jan 19, 2018 at 7:21 AM, Majd Dibbiny <majd-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>
>> On Jan 19, 2018, at 1:09 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>>
>>> On Thu, Jan 18, 2018 at 11:13:08AM -0500, Olga Kornievskaia wrote:
>>>> On Wed, Jan 17, 2018 at 5:03 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
>>>>> On Wed, Jan 17, 2018 at 4:03 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>> On Tue, 2018-01-16 at 16:14 -0500, Olga Kornievskaia wrote:
>>>>>>> On Tue, Jan 16, 2018 at 2:50 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
>>>>>>> On Fri, Jan 12, 2018 at 7:07 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
>>>>>>>>>> Ok.  The memory probably doesn't matter.  Maybe run krping client and
>>>>>>>>>
>>>>>>>>> server on the same host (to use hw-loopback), and see if it works on both,
>>>>>>>>> one, or neither systems when they are both the client and server.
>>>>>>>>>
>>>>>>>>> Loopback on the original "server" machine produces the same failure.
>>>>>>>>> Jan 12 17:05:40 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error
>>>>>>>>> cqe
>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 93003204 1000017c 0005e1d2
>>>>>>>>> Jan 12 17:05:40 localhost kernel: krping: cq completion failed with
>>>>>>>>> wr_id 0 status 4 opcode 0 vender_err 32
>>>>>>>>
>>>>>>>> Can someone from Mellanox comment more on the above CQE error?  What exactly is it tell us?
>>>>>>>>
>>>>>>>>>
>>>>>>>>> What does this means?
>>>>>>>>
>>>>>>>> Not sure.  But it does seem to be tied to that specific machine.  Question:  Is an IOMMU enabled on that system?
>>>>>>>
>>>>>>> IOMMU (Inter's VT-d) is enabled in BIOS (on both machines).
>>>>>>>
>>>>>>>> Perhaps that is exposing a dma mapping problem with krping?
>>>>>>
>>>>>> I have replaces the CX-5 card with another one and I no longer see the
>>>>>> krping problem.  I think it speaks that it's a card issue...
>>>>>
>>>>> Check the firmware on the bad card.  Lots of issues disappear if you
>>>>> have older firmware and update to the latest.
>>>>
>>>> That's a valid point. A check of firmware versions is needed. At the
>>>> time of the problem, I believe I had two machines that each had same
>>>> firmware versions. After card replacement, the replacement card
>>>> displays newer firmware.
>>>
>>> I have upgraded the firmware on both machines involved to the latest
>>> available firmware for the card and now I'm in the situation where
>>> krping does not work on either machine --- when either of them is a
>>> server it fails with the same information in the var log messages:
>>
>> Doesn't it mean that the issue in FW?
> Did you do cold reboot after FW upgrade?

No I have done so. Firmware update instruction were to either
mlxfwreset or reboot (which i assumed would be warm). I will try a
cold reboot.


>>
>>>
>>> Jan 18 11:05:54 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error cqe
>>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
>>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
>>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
>>> Jan 18 11:05:54 localhost kernel: 00000000 93003204 10000122 0005bfd2
>>> Jan 18 11:05:54 localhost kernel: krping: cq completion failed with
>>> wr_id 0 status 4 opcode 128 vender_err 32
>>> Jan 18 11:05:54 localhost kernel: krping: cq completion in ERROR state
>>> Jan 18 11:05:54 localhost kernel: krping: wait for RDMA_READ_COMPLETE state 10
>>>
>>> client side logs:
>>> Jan 18 11:14:30 localhost kernel: krping: DISCONNECT EVENT...
>>> Jan 18 11:14:30 localhost kernel: krping: wait for RDMA_WRITE_ADV state 10
>>> Jan 18 11:14:30 localhost kernel: krping: cq completion in ERROR state
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* RE: krping problem on 4.15-rc4
       [not found]                                     ` <20180119110852.GB1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
  2018-01-19 12:21                                       ` Majd Dibbiny
@ 2018-01-19 15:53                                       ` Steve Wise
  1 sibling, 0 replies; 16+ messages in thread
From: Steve Wise @ 2018-01-19 15:53 UTC (permalink / raw)
  To: 'Leon Romanovsky', 'Olga Kornievskaia'
  Cc: 'Doug Ledford', 'linux-rdma',
	matanb-VPRAkNaXOzVWk0Htik3J/w

> > >>> > > Not sure.  But it does seem to be tied to that specific machine.
> Question:  Is an IOMMU enabled on that system?
> > >>> >
> > >>> > IOMMU (Inter's VT-d) is enabled in BIOS (on both machines).
> > >>> >
> > >>> > >  Perhaps that is exposing a dma mapping problem with krping?
> > >>>
> > >>> I have replaces the CX-5 card with another one and I no longer see
the
> > >>> krping problem.  I think it speaks that it's a card issue...
> > >>
> > >> Check the firmware on the bad card.  Lots of issues disappear if you
> > >> have older firmware and update to the latest.
> > >
> > > That's a valid point. A check of firmware versions is needed. At the
> > > time of the problem, I believe I had two machines that each had same
> > > firmware versions. After card replacement, the replacement card
> > > displays newer firmware.
> >
> > I have upgraded the firmware on both machines involved to the latest
> > available firmware for the card and now I'm in the situation where
> > krping does not work on either machine --- when either of them is a
> > server it fails with the same information in the var log messages:
> 
> Doesn't it mean that the issue in FW?
> 

Is still possible that krping has some dma mapping bug that wasn't detected
by older FW, and now is being detected by the new FW?


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: krping problem on 4.15-rc4
       [not found]                                             ` <CAN-5tyGiuuvzxru+aeeCahukrbm_aivN+HfLx=X1d8txxL4A9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2018-01-19 21:07                                               ` Olga Kornievskaia
  0 siblings, 0 replies; 16+ messages in thread
From: Olga Kornievskaia @ 2018-01-19 21:07 UTC (permalink / raw)
  To: Majd Dibbiny
  Cc: Leon Romanovsky, Doug Ledford, Steve Wise, linux-rdma, Matan Barak

On Fri, Jan 19, 2018 at 8:57 AM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
> On Fri, Jan 19, 2018 at 7:21 AM, Majd Dibbiny <majd-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>>
>>> On Jan 19, 2018, at 1:09 PM, Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org> wrote:
>>>
>>>> On Thu, Jan 18, 2018 at 11:13:08AM -0500, Olga Kornievskaia wrote:
>>>>> On Wed, Jan 17, 2018 at 5:03 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
>>>>>> On Wed, Jan 17, 2018 at 4:03 PM, Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
>>>>>>> On Tue, 2018-01-16 at 16:14 -0500, Olga Kornievskaia wrote:
>>>>>>>> On Tue, Jan 16, 2018 at 2:50 PM, Olga Kornievskaia <aglo-63aXycvo3TyHXe+LvDLADg@public.gmane.org> wrote:
>>>>>>>> On Fri, Jan 12, 2018 at 7:07 PM, Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org> wrote:
>>>>>>>>>>> Ok.  The memory probably doesn't matter.  Maybe run krping client and
>>>>>>>>>>
>>>>>>>>>> server on the same host (to use hw-loopback), and see if it works on both,
>>>>>>>>>> one, or neither systems when they are both the client and server.
>>>>>>>>>>
>>>>>>>>>> Loopback on the original "server" machine produces the same failure.
>>>>>>>>>> Jan 12 17:05:40 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error
>>>>>>>>>> cqe
>>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 00000000 00000000 00000000
>>>>>>>>>> Jan 12 17:05:40 localhost kernel: 00000000 93003204 1000017c 0005e1d2
>>>>>>>>>> Jan 12 17:05:40 localhost kernel: krping: cq completion failed with
>>>>>>>>>> wr_id 0 status 4 opcode 0 vender_err 32
>>>>>>>>>
>>>>>>>>> Can someone from Mellanox comment more on the above CQE error?  What exactly is it tell us?
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> What does this means?
>>>>>>>>>
>>>>>>>>> Not sure.  But it does seem to be tied to that specific machine.  Question:  Is an IOMMU enabled on that system?
>>>>>>>>
>>>>>>>> IOMMU (Inter's VT-d) is enabled in BIOS (on both machines).
>>>>>>>>
>>>>>>>>> Perhaps that is exposing a dma mapping problem with krping?
>>>>>>>
>>>>>>> I have replaces the CX-5 card with another one and I no longer see the
>>>>>>> krping problem.  I think it speaks that it's a card issue...
>>>>>>
>>>>>> Check the firmware on the bad card.  Lots of issues disappear if you
>>>>>> have older firmware and update to the latest.
>>>>>
>>>>> That's a valid point. A check of firmware versions is needed. At the
>>>>> time of the problem, I believe I had two machines that each had same
>>>>> firmware versions. After card replacement, the replacement card
>>>>> displays newer firmware.
>>>>
>>>> I have upgraded the firmware on both machines involved to the latest
>>>> available firmware for the card and now I'm in the situation where
>>>> krping does not work on either machine --- when either of them is a
>>>> server it fails with the same information in the var log messages:
>>>
>>> Doesn't it mean that the issue in FW?
>> Did you do cold reboot after FW upgrade?
>
> No I have done so. Firmware update instruction were to either
> mlxfwreset or reboot (which i assumed would be warm). I will try a
> cold reboot.
>

I have cold rebooted the machines and still have the same problem with krping.

>>>> Jan 18 11:05:54 localhost kernel: mlx5_0:dump_cqe:277:(pid 0): dump error cqe
>>>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
>>>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
>>>> Jan 18 11:05:54 localhost kernel: 00000000 00000000 00000000 00000000
>>>> Jan 18 11:05:54 localhost kernel: 00000000 93003204 10000122 0005bfd2
>>>> Jan 18 11:05:54 localhost kernel: krping: cq completion failed with
>>>> wr_id 0 status 4 opcode 128 vender_err 32
>>>> Jan 18 11:05:54 localhost kernel: krping: cq completion in ERROR state
>>>> Jan 18 11:05:54 localhost kernel: krping: wait for RDMA_READ_COMPLETE state 10
>>>>
>>>> client side logs:
>>>> Jan 18 11:14:30 localhost kernel: krping: DISCONNECT EVENT...
>>>> Jan 18 11:14:30 localhost kernel: krping: wait for RDMA_WRITE_ADV state 10
>>>> Jan 18 11:14:30 localhost kernel: krping: cq completion in ERROR state
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-01-19 21:07 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-09 15:30 krping problem on 4.15-rc4 Olga Kornievskaia
     [not found] ` <CAN-5tyH1HO7yzzQLyb5z5Pq=OrHnKzmCrR2MffLguqsEA-mwWg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-10 20:10   ` Steve Wise
2018-01-11 18:18     ` Olga Kornievskaia
2018-01-11 19:45       ` Steve Wise
2018-01-12 22:06         ` Olga Kornievskaia
     [not found]           ` <CAN-5tyGq=hmXY9HZYXpfaytOUV=gb0fri69gj69WKbbYtW3nTQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-13  0:07             ` Steve Wise
2018-01-16 19:50               ` Olga Kornievskaia
     [not found]                 ` <CAN-5tyG9ZsaKZs3ayfFfuy7o25DrXR2yWmwUvLdNutJ1SbEg1w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-16 21:14                   ` Olga Kornievskaia
     [not found]                     ` <CAN-5tyFSYWaTPVdq=99Yr9XwnULyf4tw06roZys=rtR0F3x03g-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-17 21:03                       ` Doug Ledford
     [not found]                         ` <1516223013.3403.285.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2018-01-17 22:03                           ` Olga Kornievskaia
     [not found]                             ` <CAN-5tyFM_Noj5n-BW+BMa-0VXBWnUVWU2JkiP2f5JBpZoA6YcQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-18 16:13                               ` Olga Kornievskaia
     [not found]                                 ` <CAN-5tyGxnd0WnvgxEpNpZ5fG6u2JZs=Wg0fEvt8EaNLHckvx0A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-19 11:08                                   ` Leon Romanovsky
     [not found]                                     ` <20180119110852.GB1393-U/DQcQFIOTAAJjI8aNfphQ@public.gmane.org>
2018-01-19 12:21                                       ` Majd Dibbiny
     [not found]                                         ` <14B966CB-B883-4431-A2A3-9DDE6B88B9AB-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2018-01-19 13:57                                           ` Olga Kornievskaia
     [not found]                                             ` <CAN-5tyGiuuvzxru+aeeCahukrbm_aivN+HfLx=X1d8txxL4A9w-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2018-01-19 21:07                                               ` Olga Kornievskaia
2018-01-19 15:53                                       ` Steve Wise

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.