All of lore.kernel.org
 help / color / mirror / Atom feed
* mlx5 core/en oops in 4.6-rc6+
@ 2016-05-05 16:00 Doug Ledford
  2016-05-05 16:42 ` Saeed Mahameed
  0 siblings, 1 reply; 7+ messages in thread
From: Doug Ledford @ 2016-05-05 16:00 UTC (permalink / raw)
  To: Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 4931 bytes --]

Just had this pop up during testing, happened very soon after bootup:

[   47.235925] BUG: unable to handle kernel NULL pointer dereference at
00000000000001e8
[   47.245057] IP: [<ffffffffc0328b9c>] mlx5e_sq_xmit+0x1c/0xd80 [mlx5_core]
[   47.252822] PGD 0
[   47.255218] Oops: 0000 [#1] SMP
[   47.259070] Modules linked in: sch_mqprio bridge 8021q garp mrp stp
llc ib_iser libiscsi scsi_transport_iscsi ib_srp scsi_transport_srp
ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa
ib_mad x86_pkg_temp_thermal coretd
[   47.352984] CPU: 18 PID: 1358 Comm: NetworkManager Not tainted
4.6.0-rc6-00004-g7199787 #102
[   47.362460] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS
1.6.2 01/08/2016
[   47.370869] task: ffff88103369d000 ti: ffff88103751c000 task.ti:
ffff88103751c000
[   47.379263] RIP: 0010:[<ffffffffc0328b9c>]  [<ffffffffc0328b9c>]
mlx5e_sq_xmit+0x1c/0xd80 [mlx5_core]
[   47.389627] RSP: 0018:ffff88103751f7d0  EFLAGS: 00010282
[   47.395574] RAX: ffff880fe6f51d00 RBX: 0000000000000000 RCX:
0000000000000081
[   47.403571] RDX: ffff880ff1dc3000 RSI: ffff880fe6f51d00 RDI:
0000000000000000
[   47.411561] RBP: ffff88103751f828 R08: 0000000000020c80 R09:
ffffffff81871e04
[   47.419563] R10: ffffea003f9bd400 R11: ffff88100116de00 R12:
000000000000003e
[   47.427566] R13: ffff880fe6f51d00 R14: ffff8810240d0090 R15:
ffff8810240d0068
[   47.435557] FS:  00007fd79b882dc0(0000) GS:ffff88103ee40000(0000)
knlGS:0000000000000000
[   47.444625] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   47.451062] CR2: 00000000000001e8 CR3: 0000001cf86c5000 CR4:
00000000001406e0
[   47.459053] Stack:
[   47.461306]  ffffffff81875480 ffff880fe6f50c00 ffff881d02f9b800
ffff88103751f838
[   47.469647]  ffffffff81a08415 ffff88103751f818 ffff880fe6f51d00
000000000000003e
[   47.477964]  ffff881d02f9bd00 ffff8810240d0090 ffff8810240d0068
ffff88103751f838
[   47.486279] Call Trace:
[   47.489019]  [<ffffffff81875480>] ? consume_skb+0x80/0x150
[   47.495178]  [<ffffffff81a08415>] ? packet_rcv+0x65/0x6d0
[   47.501244]  [<ffffffffc03299ae>] mlx5e_xmit+0x2e/0x40 [mlx5_core]
[   47.508169]  [<ffffffff818959d4>] dev_hard_start_xmit+0x384/0x650
[   47.515007]  [<ffffffff818951bb>] ? validate_xmit_skb.isra.80+0x4b/0x4e0
[   47.522516]  [<ffffffff818d036f>] sch_direct_xmit+0x19f/0x360
[   47.528963]  [<ffffffff81896565>] __dev_queue_xmit+0x6e5/0xaa0
[   47.535502]  [<ffffffff81875480>] ? consume_skb+0x80/0x150
[   47.542723]  [<ffffffff81896958>] dev_queue_xmit+0x18/0x30
[   47.549856]  [<ffffffffc08d1d54>]
vlan_dev_hard_start_xmit+0x104/0x210 [8021q]
[   47.558933]  [<ffffffff818959d4>] dev_hard_start_xmit+0x384/0x650
[   47.566738]  [<ffffffff8189675a>] __dev_queue_xmit+0x8da/0xaa0
[   47.574246]  [<ffffffff81896958>] dev_queue_xmit+0x18/0x30
[   47.581349]  [<ffffffff818a2d07>] neigh_connected_output+0x107/0x170
[   47.589433]  [<ffffffff819a3e9f>] ip6_finish_output2+0x23f/0x720
[   47.597128]  [<ffffffff81430f32>] ? selinux_ipv6_postroute+0x22/0x30
[   47.605207]  [<ffffffff819a666b>] ip6_finish_output+0x13b/0x1e0
[   47.612809]  [<ffffffff819a6777>] ip6_output+0x67/0x1c0
[   47.619619]  [<ffffffff819a6530>] ? ip6_fragment+0xd80/0xd80
[   47.626903]  [<ffffffff819fb80d>] ip6_local_out+0x4d/0x60
[   47.633884]  [<ffffffff819a703b>] ip6_send_skb+0x2b/0xb0
[   47.640773]  [<ffffffff819a713d>] ip6_push_pending_frames+0x7d/0x90
[   47.648710]  [<ffffffff819d533d>] rawv6_sendmsg+0xd2d/0x1210
[   47.655938]  [<ffffffff8128f70a>] ? do_wp_page+0x3ba/0x910
[   47.662944]  [<ffffffff8142a970>] ? sock_has_perm+0x80/0xb0
[   47.670020]  [<ffffffff8194f2c7>] inet_sendmsg+0x97/0xf0
[   47.676778]  [<ffffffff818673f8>] sock_sendmsg+0x58/0x90
[   47.683505]  [<ffffffff81868148>] SYSC_sendto+0x138/0x1b0
[   47.690302]  [<ffffffff8109d5a8>] ? __do_page_fault+0x338/0x9d0
[   47.697656]  [<ffffffff8116b131>] ? ktime_get_with_offset+0x71/0x130
[   47.705481]  [<ffffffff81163ee7>] ? posix_get_boottime+0x37/0x60
[   47.712904]  [<ffffffff81868b36>] SyS_sendto+0x16/0x20
[   47.719346]  [<ffffffff81a336b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
[   47.727230] Code: 05 a9 9f 03 00 01 66 31 47 48 5d c3 0f 1f 00 0f 1f
44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 f5 41 54 53 48 89 fb 48 83
ec 30 <0f> b7 87 e8 01 00 00 0f b6 8f ea 01 00 00 45 8b 95 80 00 00 00
[   47.750336] RIP  [<ffffffffc0328b9c>] mlx5e_sq_xmit+0x1c/0xd80
[mlx5_core]
[   47.758755]  RSP <ffff88103751f7d0>
[   47.763368] CR2: 00000000000001e8
[   47.767779] ---[ end trace 35565b04ca44e521 ]---

It appears to be intermittent as this machine has booted this kernel
multiple times without hitting this.  Network setup includes both vlan
and non-vlan interfaces.  If you need more info from me, please include
me on the Cc: as I don't follow netdev@

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mlx5 core/en oops in 4.6-rc6+
  2016-05-05 16:00 mlx5 core/en oops in 4.6-rc6+ Doug Ledford
@ 2016-05-05 16:42 ` Saeed Mahameed
  2016-05-05 17:16   ` Doug Ledford
  0 siblings, 1 reply; 7+ messages in thread
From: Saeed Mahameed @ 2016-05-05 16:42 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Linux Netdev List

On Thu, May 5, 2016 at 7:00 PM, Doug Ledford <dledford@redhat.com> wrote:
> Just had this pop up during testing, happened very soon after bootup:
>
> [   47.235925] BUG: unable to handle kernel NULL pointer dereference at
> 00000000000001e8
> [   47.245057] IP: [<ffffffffc0328b9c>] mlx5e_sq_xmit+0x1c/0xd80 [mlx5_core]
> [   47.252822] PGD 0
> [   47.255218] Oops: 0000 [#1] SMP
> [   47.259070] Modules linked in: sch_mqprio bridge 8021q garp mrp stp
> llc ib_iser libiscsi scsi_transport_iscsi ib_srp scsi_transport_srp
> ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_sa
> ib_mad x86_pkg_temp_thermal coretd
> [   47.352984] CPU: 18 PID: 1358 Comm: NetworkManager Not tainted
> 4.6.0-rc6-00004-g7199787 #102
> [   47.362460] Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS
> 1.6.2 01/08/2016
> [   47.370869] task: ffff88103369d000 ti: ffff88103751c000 task.ti:
> ffff88103751c000
> [   47.379263] RIP: 0010:[<ffffffffc0328b9c>]  [<ffffffffc0328b9c>]
> mlx5e_sq_xmit+0x1c/0xd80 [mlx5_core]
> [   47.389627] RSP: 0018:ffff88103751f7d0  EFLAGS: 00010282
> [   47.395574] RAX: ffff880fe6f51d00 RBX: 0000000000000000 RCX:
> 0000000000000081
> [   47.403571] RDX: ffff880ff1dc3000 RSI: ffff880fe6f51d00 RDI:
> 0000000000000000
> [   47.411561] RBP: ffff88103751f828 R08: 0000000000020c80 R09:
> ffffffff81871e04
> [   47.419563] R10: ffffea003f9bd400 R11: ffff88100116de00 R12:
> 000000000000003e
> [   47.427566] R13: ffff880fe6f51d00 R14: ffff8810240d0090 R15:
> ffff8810240d0068
> [   47.435557] FS:  00007fd79b882dc0(0000) GS:ffff88103ee40000(0000)
> knlGS:0000000000000000
> [   47.444625] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   47.451062] CR2: 00000000000001e8 CR3: 0000001cf86c5000 CR4:
> 00000000001406e0
> [   47.459053] Stack:
> [   47.461306]  ffffffff81875480 ffff880fe6f50c00 ffff881d02f9b800
> ffff88103751f838
> [   47.469647]  ffffffff81a08415 ffff88103751f818 ffff880fe6f51d00
> 000000000000003e
> [   47.477964]  ffff881d02f9bd00 ffff8810240d0090 ffff8810240d0068
> ffff88103751f838
> [   47.486279] Call Trace:
> [   47.489019]  [<ffffffff81875480>] ? consume_skb+0x80/0x150
> [   47.495178]  [<ffffffff81a08415>] ? packet_rcv+0x65/0x6d0
> [   47.501244]  [<ffffffffc03299ae>] mlx5e_xmit+0x2e/0x40 [mlx5_core]
> [   47.508169]  [<ffffffff818959d4>] dev_hard_start_xmit+0x384/0x650
> [   47.515007]  [<ffffffff818951bb>] ? validate_xmit_skb.isra.80+0x4b/0x4e0
> [   47.522516]  [<ffffffff818d036f>] sch_direct_xmit+0x19f/0x360
> [   47.528963]  [<ffffffff81896565>] __dev_queue_xmit+0x6e5/0xaa0
> [   47.535502]  [<ffffffff81875480>] ? consume_skb+0x80/0x150
> [   47.542723]  [<ffffffff81896958>] dev_queue_xmit+0x18/0x30
> [   47.549856]  [<ffffffffc08d1d54>]
> vlan_dev_hard_start_xmit+0x104/0x210 [8021q]
> [   47.558933]  [<ffffffff818959d4>] dev_hard_start_xmit+0x384/0x650
> [   47.566738]  [<ffffffff8189675a>] __dev_queue_xmit+0x8da/0xaa0
> [   47.574246]  [<ffffffff81896958>] dev_queue_xmit+0x18/0x30
> [   47.581349]  [<ffffffff818a2d07>] neigh_connected_output+0x107/0x170
> [   47.589433]  [<ffffffff819a3e9f>] ip6_finish_output2+0x23f/0x720
> [   47.597128]  [<ffffffff81430f32>] ? selinux_ipv6_postroute+0x22/0x30
> [   47.605207]  [<ffffffff819a666b>] ip6_finish_output+0x13b/0x1e0
> [   47.612809]  [<ffffffff819a6777>] ip6_output+0x67/0x1c0
> [   47.619619]  [<ffffffff819a6530>] ? ip6_fragment+0xd80/0xd80
> [   47.626903]  [<ffffffff819fb80d>] ip6_local_out+0x4d/0x60
> [   47.633884]  [<ffffffff819a703b>] ip6_send_skb+0x2b/0xb0
> [   47.640773]  [<ffffffff819a713d>] ip6_push_pending_frames+0x7d/0x90
> [   47.648710]  [<ffffffff819d533d>] rawv6_sendmsg+0xd2d/0x1210
> [   47.655938]  [<ffffffff8128f70a>] ? do_wp_page+0x3ba/0x910
> [   47.662944]  [<ffffffff8142a970>] ? sock_has_perm+0x80/0xb0
> [   47.670020]  [<ffffffff8194f2c7>] inet_sendmsg+0x97/0xf0
> [   47.676778]  [<ffffffff818673f8>] sock_sendmsg+0x58/0x90
> [   47.683505]  [<ffffffff81868148>] SYSC_sendto+0x138/0x1b0
> [   47.690302]  [<ffffffff8109d5a8>] ? __do_page_fault+0x338/0x9d0
> [   47.697656]  [<ffffffff8116b131>] ? ktime_get_with_offset+0x71/0x130
> [   47.705481]  [<ffffffff81163ee7>] ? posix_get_boottime+0x37/0x60
> [   47.712904]  [<ffffffff81868b36>] SyS_sendto+0x16/0x20
> [   47.719346]  [<ffffffff81a336b2>] entry_SYSCALL_64_fastpath+0x1a/0xa4
> [   47.727230] Code: 05 a9 9f 03 00 01 66 31 47 48 5d c3 0f 1f 00 0f 1f
> 44 00 00 55 48 89 e5 41 57 41 56 41 55 49 89 f5 41 54 53 48 89 fb 48 83
> ec 30 <0f> b7 87 e8 01 00 00 0f b6 8f ea 01 00 00 45 8b 95 80 00 00 00
> [   47.750336] RIP  [<ffffffffc0328b9c>] mlx5e_sq_xmit+0x1c/0xd80
> [mlx5_core]
> [   47.758755]  RSP <ffff88103751f7d0>
> [   47.763368] CR2: 00000000000001e8
> [   47.767779] ---[ end trace 35565b04ca44e521 ]---
>
> It appears to be intermittent as this machine has booted this kernel
> multiple times without hitting this.  Network setup includes both vlan
> and non-vlan interfaces.  If you need more info from me, please include
> me on the Cc: as I don't follow netdev@
>

Hi Doug,

did you by change configure  TC queues for the netdev ? i.e. dev->num_tc > 1
if not i would be happy to get more info in you network configuration.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mlx5 core/en oops in 4.6-rc6+
  2016-05-05 16:42 ` Saeed Mahameed
@ 2016-05-05 17:16   ` Doug Ledford
  2016-05-05 20:51     ` Saeed Mahameed
  0 siblings, 1 reply; 7+ messages in thread
From: Doug Ledford @ 2016-05-05 17:16 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 5654 bytes --]

On 05/05/2016 12:42 PM, Saeed Mahameed wrote:
> On Thu, May 5, 2016 at 7:00 PM, Doug Ledford <dledford@redhat.com> wrote:
>> Just had this pop up during testing, happened very soon after bootup:
>>

[ snip oops ]

> Hi Doug,
> 
> did you by change configure  TC queues for the netdev ? i.e. dev->num_tc > 1
> if not i would be happy to get more info in you network configuration.

That depends on which interface actually generated the oops.  If it was
the base interface, then I don't manually set any special params on it.
If it's one of the vlan interfaces, then there is a NetworkManager
dispatcher script that is intended to set the tc count on interface up:

[root@rdma-virt-03 ~]$ more /etc/NetworkManager/dispatcher.d/98-mlx5_roce.4*
::::::::::::::
/etc/NetworkManager/dispatcher.d/98-mlx5_roce.43-egress.conf
::::::::::::::
#!/bin/sh
interface=$1
status=$2
[ "$interface" = mlx5_roce.43 ] || exit 0
case $status in
up)
	tc qdisc add dev mlx5_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5
5 5 5 5 5 5
	# tc_wrap.py -i mlx5_roce -u 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5
	;;
esac
--More--(Next file:
/etc/NetworkManager/dispatcher.d/98-mlx5_roce.45-egress.conf::::::::::::::
/etc/NetworkManager/dispatcher.d/98-mlx5_roce.45-egress.conf
::::::::::::::
#!/bin/sh
interface=$1
status=$2
[ "$interface" = mlx5_roce.45 ] || exit 0
case $status in
up)
	tc qdisc add dev mlx5_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5
5 5 5 5 5 5
	# tc_wrap.py -i mlx5_roce -u 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5
	;;
esac
[root@rdma-virt-03 ~]$


However, I should note that this usage of tc is a bit out of date last I
checked and doesn't even work any more.  Let me double check...

[root@rdma-virt-02 vlan]$ cd /proc/net/vlan/
[root@rdma-virt-02 vlan]$ ls
config  mlx5_roce.43  mlx5_roce.45
[root@rdma-virt-02 vlan]$
[root@rdma-virt-02 vlan]$ for i in *; do echo "$i:"; cat $i; echo; done
config:
VLAN Dev name	 | VLAN ID
Name-Type: VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD
mlx5_roce.45   | 45  | mlx5_roce
mlx5_roce.43   | 43  | mlx5_roce

mlx5_roce.43:
mlx5_roce.43  VID: 43	 REORDER_HDR: 1  dev->priv_flags: 1001
         total frames received           57
          total bytes received         5010
      Broadcast/Multicast Rcvd            0

      total frames transmitted           20
       total bytes transmitted         2525
Device: mlx5_roce
INGRESS priority mappings: 0:0  1:0  2:0  3:0  4:0  5:0  6:0 7:0
 EGRESS priority mappings: 0:3 1:3 2:3 3:3 4:3 5:3 6:3 7:3

mlx5_roce.45:
mlx5_roce.45  VID: 45	 REORDER_HDR: 1  dev->priv_flags: 1001
         total frames received           57
          total bytes received         5010
      Broadcast/Multicast Rcvd            0

      total frames transmitted           21
       total bytes transmitted         2603
Device: mlx5_roce
INGRESS priority mappings: 0:0  1:0  2:0  3:0  4:0  5:0  6:0 7:0
 EGRESS priority mappings: 0:5 1:5 2:5 3:5 4:5 5:5 6:5 7:5

OK, so the vlans have egress mappings, but they don't match what the
mlx5_roce.43 egress.conf file should have enabled.  Digging a little
further on this machine:

[root@rdma-virt-03 vlan]$ more
/etc/sysconfig/network-scripts/ifcfg-mlx5_roce.4?
::::::::::::::
/etc/sysconfig/network-scripts/ifcfg-mlx5_roce.43
::::::::::::::
DEVICE=mlx5_roce.43
VLAN=yes
VLAN_ID=43
VLAN_EGRESS_PRIORITY_MAP=0:3,1:3,2:3,3:3,4:3,5:3,6:3,7:3
TYPE=Vlan
ONBOOT=yes
BOOTPROTO=dhcp
DEFROUTE=no
PEERDNS=no
PEERROUTES=yes
IPV4_FAILURE_FATAL=yes
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=no
IPV6_PEERDNS=no
IPV6_PEERROUTES=yes
IPV6_FAILURE_FATAL=no
NAME=mlx5_roce.43
::::::::::::::
/etc/sysconfig/network-scripts/ifcfg-mlx5_roce.45
::::::::::::::
DEVICE=mlx5_roce.45
VLAN=yes
VLAN_ID=45
VLAN_EGRESS_PRIORITY_MAP=0:5,1:5,2:5,3:5,4:5,5:5,6:5,7:5
TYPE=Vlan
ONBOOT=yes
BOOTPROTO=dhcp
DEFROUTE=no
PEERDNS=no
PEERROUTES=yes
IPV4_FAILURE_FATAL=yes
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=no
IPV6_PEERDNS=no
IPV6_PEERROUTES=yes
IPV6_FAILURE_FATAL=no
NAME=mlx5_roce.45
[root@rdma-virt-03 vlan]$

This is a Fedora rawhide machine, using NetworkManager to handle the
network interfaces.  So, the egress priority mappings are being set by
NM.  I don't know if they are overriding the egress mapping dispatchers
or if the egress mapping dispatchers are failing to work/run properly.
It might be the latter.  Let me double check the command...

OK, re-reading the egress dispatchers above, they work on the base
interface, not on the vlan interface that triggers them.  That's why
they both use the same command (mapping to egress 5) instead of being
like the ifcfg files, which map the 43 vlan to egress priority 3, and
the 45 vlan to egress priority 5.  Running tc qdisc | grep mlx5_roce
shows that the egress mapping is being applied (although I'm not sure it
should be...I made that mapping many kernels ago when that was the right
thing to do, the modern mlx5 ethernet drivers create their own mappings
that are drastically different).

So, to answer your question, yes, num_tc > 1, num_tc == 8, and I
probably need to reconfigure that egress dispatcher to do what I want it
to do (which is merely to make sure that all packets from specific
interfaces are tagged with specific vlan priorities so per-priority flow
control between the card and switch works properly, the base interface
is supposed to have no priority tag, the 43 vlan is supposed to have
priority tag 3, and vlan 45 is supposed to have priority tag 5) on
modern kernels.

-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mlx5 core/en oops in 4.6-rc6+
  2016-05-05 17:16   ` Doug Ledford
@ 2016-05-05 20:51     ` Saeed Mahameed
  2016-05-12 17:28       ` Doug Ledford
  0 siblings, 1 reply; 7+ messages in thread
From: Saeed Mahameed @ 2016-05-05 20:51 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Linux Netdev List

On Thu, May 5, 2016 at 8:16 PM, Doug Ledford <dledford@redhat.com> wrote:
>
> That depends on which interface actually generated the oops.  If it was
> the base interface, then I don't manually set any special params on it.
> If it's one of the vlan interfaces, then there is a NetworkManager
> dispatcher script that is intended to set the tc count on interface up:
>
> [root@rdma-virt-03 ~]$ more /etc/NetworkManager/dispatcher.d/98-mlx5_roce.4*
> ::::::::::::::
> /etc/NetworkManager/dispatcher.d/98-mlx5_roce.43-egress.conf
> ::::::::::::::
> #!/bin/sh
> interface=$1
> status=$2
> [ "$interface" = mlx5_roce.43 ] || exit 0
> case $status in
> up)
>         tc qdisc add dev mlx5_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5
> 5 5 5 5 5 5

Well, here you are configuring 8 TCs on the base mlx5 interface, so
the answer to my question is yes.

It appears that we have a bug in mlx5e_slelect_queue

int channel_ix = fallback(dev, skb);
return priv->channeltc_to_txq_map[channel_ix][tc];

When num_tc > 1 the fallback can return any value between [0..
num_channles * num_tc ]

while channeltc_to_txq_map is an array of the size num_channels.

so there is a good chance that channel_ix exceeds the array limits and
resulting OOPs.

>         # tc_wrap.py -i mlx5_roce -u 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5
>         ;;
> esac
> --More--(Next file:
> /etc/NetworkManager/dispatcher.d/98-mlx5_roce.45-egress.conf::::::::::::::
> /etc/NetworkManager/dispatcher.d/98-mlx5_roce.45-egress.conf
> ::::::::::::::
> #!/bin/sh
> interface=$1
> status=$2
> [ "$interface" = mlx5_roce.45 ] || exit 0
> case $status in
> up)
>         tc qdisc add dev mlx5_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5
> 5 5 5 5 5 5

will, here you map all user skb prios (skb->priority) to HW tc 5.
BTW skprio or user prio in this example is never the vlan prio it is
the ipv4 (ToS).

please see http://lartc.org/manpages/tc-prio.html

So to achieve a vlan prio to HW tc mapping, you will need to map the
skprios to vlan prios using vlan egress mapping
which i see you already do down below.

But, our select queue implementation will extract the vlan priority
and use the corresponding TC from our own
priv->channeltc_to_txq_map[channel_ix][up] mapping
where up is vlan user priority.  but this only applies to kernel
traffic, i don't see why it is needed for RoCE.

Currently this code is buggy and I will need to dig more into how to
provide a full working solution that fits our hardware requirements
and complies with the kernel QoS APIs.


[...]

> [root@rdma-virt-02 vlan]$ for i in *; do echo "$i:"; cat $i; echo; done
> config:
> VLAN Dev name    | VLAN ID
> Name-Type: VLAN_NAME_TYPE_RAW_PLUS_VID_NO_PAD
> mlx5_roce.45   | 45  | mlx5_roce
> mlx5_roce.43   | 43  | mlx5_roce
>
> mlx5_roce.43:
> mlx5_roce.43  VID: 43    REORDER_HDR: 1  dev->priv_flags: 1001
>          total frames received           57
>           total bytes received         5010
>       Broadcast/Multicast Rcvd            0
>
>       total frames transmitted           20
>        total bytes transmitted         2525
> Device: mlx5_roce
> INGRESS priority mappings: 0:0  1:0  2:0  3:0  4:0  5:0  6:0 7:0
>  EGRESS priority mappings: 0:3 1:3 2:3 3:3 4:3 5:3 6:3 7:3
>

Here you map every SKB prio (0..7) to vlan priorty 3.

>
> mlx5_roce.45:
> mlx5_roce.45  VID: 45    REORDER_HDR: 1  dev->priv_flags: 1001
>          total frames received           57
>           total bytes received         5010
>       Broadcast/Multicast Rcvd            0
>
>       total frames transmitted           21
>        total bytes transmitted         2603
> Device: mlx5_roce
> INGRESS priority mappings: 0:0  1:0  2:0  3:0  4:0  5:0  6:0 7:0
>  EGRESS priority mappings: 0:5 1:5 2:5 3:5 4:5 5:5 6:5 7:5
>
> OK, so the vlans have egress mappings, but they don't match what the
> mlx5_roce.43 egress.conf file should have enabled.  Digging a little
> further on this machine:
>
> [root@rdma-virt-03 vlan]$ more
> /etc/sysconfig/network-scripts/ifcfg-mlx5_roce.4?
> ::::::::::::::
> /etc/sysconfig/network-scripts/ifcfg-mlx5_roce.43
> ::::::::::::::
> DEVICE=mlx5_roce.43
> VLAN=yes
> VLAN_ID=43
> VLAN_EGRESS_PRIORITY_MAP=0:3,1:3,2:3,3:3,4:3,5:3,6:3,7:3
> TYPE=Vlan
> ONBOOT=yes
> BOOTPROTO=dhcp
> DEFROUTE=no
> PEERDNS=no
> PEERROUTES=yes
> IPV4_FAILURE_FATAL=yes
> IPV6INIT=yes
> IPV6_AUTOCONF=yes
> IPV6_DEFROUTE=no
> IPV6_PEERDNS=no
> IPV6_PEERROUTES=yes
> IPV6_FAILURE_FATAL=no
> NAME=mlx5_roce.43
> ::::::::::::::
> /etc/sysconfig/network-scripts/ifcfg-mlx5_roce.45
> ::::::::::::::
> DEVICE=mlx5_roce.45
> VLAN=yes
> VLAN_ID=45
> VLAN_EGRESS_PRIORITY_MAP=0:5,1:5,2:5,3:5,4:5,5:5,6:5,7:5
> TYPE=Vlan
> ONBOOT=yes
> BOOTPROTO=dhcp
> DEFROUTE=no
> PEERDNS=no
> PEERROUTES=yes
> IPV4_FAILURE_FATAL=yes
> IPV6INIT=yes
> IPV6_AUTOCONF=yes
> IPV6_DEFROUTE=no
> IPV6_PEERDNS=no
> IPV6_PEERROUTES=yes
> IPV6_FAILURE_FATAL=no
> NAME=mlx5_roce.45
> [root@rdma-virt-03 vlan]$
>
> This is a Fedora rawhide machine, using NetworkManager to handle the
> network interfaces.  So, the egress priority mappings are being set by
> NM.  I don't know if they are overriding the egress mapping dispatchers
> or if the egress mapping dispatchers are failing to work/run properly.
> It might be the latter.  Let me double check the command...
>
> OK, re-reading the egress dispatchers above, they work on the base
> interface, not on the vlan interface that triggers them.  That's why
> they both use the same command (mapping to egress 5) instead of being
> like the ifcfg files, which map the 43 vlan to egress priority 3, and
> the 45 vlan to egress priority 5.  Running tc qdisc | grep mlx5_roce
> shows that the egress mapping is being applied (although I'm not sure it
> should be...I made that mapping many kernels ago when that was the right
> thing to do, the modern mlx5 ethernet drivers create their own mappings
> that are drastically different).
>
> So, to answer your question, yes, num_tc > 1, num_tc == 8, and I
> probably need to reconfigure that egress dispatcher to do what I want it
> to do (which is merely to make sure that all packets from specific
> interfaces are tagged with specific vlan priorities so per-priority flow
> control between the card and switch works properly, the base interface
> is supposed to have no priority tag, the 43 vlan is supposed to have
> priority tag 3, and vlan 45 is supposed to have priority tag 5) on
> modern kernels.
>

As i said above configuring any num_tc > 1 might cause the panic you saw.

Regarding the proper mapping to do for 45 => priority 5, 43 => prio 3.
the egress mappings you already did above should be sufficient, the
question is, do you need the vlan priorities to be mapped to a
specific HW TC dispatchers ?

if not, then you don't need to configure  "tc qdisc add dev mlx5_roce
root ..." at all.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mlx5 core/en oops in 4.6-rc6+
  2016-05-05 20:51     ` Saeed Mahameed
@ 2016-05-12 17:28       ` Doug Ledford
  2016-05-19 17:13         ` Eran Ben Elisha
  0 siblings, 1 reply; 7+ messages in thread
From: Doug Ledford @ 2016-05-12 17:28 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 5334 bytes --]

On 05/05/2016 04:51 PM, Saeed Mahameed wrote:
> On Thu, May 5, 2016 at 8:16 PM, Doug Ledford <dledford@redhat.com> wrote:
>>
>> That depends on which interface actually generated the oops.  If it was
>> the base interface, then I don't manually set any special params on it.
>> If it's one of the vlan interfaces, then there is a NetworkManager
>> dispatcher script that is intended to set the tc count on interface up:
>>
>> [root@rdma-virt-03 ~]$ more /etc/NetworkManager/dispatcher.d/98-mlx5_roce.4*
>> ::::::::::::::
>> /etc/NetworkManager/dispatcher.d/98-mlx5_roce.43-egress.conf
>> ::::::::::::::
>> #!/bin/sh
>> interface=$1
>> status=$2
>> [ "$interface" = mlx5_roce.43 ] || exit 0
>> case $status in
>> up)
>>         tc qdisc add dev mlx5_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5
>> 5 5 5 5 5 5
> 
> Well, here you are configuring 8 TCs on the base mlx5 interface, so
> the answer to my question is yes.

Correct.  I mentioned that at the end of my email ;-)

> It appears that we have a bug in mlx5e_slelect_queue
> 
> int channel_ix = fallback(dev, skb);
> return priv->channeltc_to_txq_map[channel_ix][tc];
> 
> When num_tc > 1 the fallback can return any value between [0..
> num_channles * num_tc ]
> 
> while channeltc_to_txq_map is an array of the size num_channels.
> 
> so there is a good chance that channel_ix exceeds the array limits and
> resulting OOPs.
> 
>>         # tc_wrap.py -i mlx5_roce -u 5,5,5,5,5,5,5,5,5,5,5,5,5,5,5,5
>>         ;;
>> esac
>> --More--(Next file:
>> /etc/NetworkManager/dispatcher.d/98-mlx5_roce.45-egress.conf::::::::::::::
>> /etc/NetworkManager/dispatcher.d/98-mlx5_roce.45-egress.conf
>> ::::::::::::::
>> #!/bin/sh
>> interface=$1
>> status=$2
>> [ "$interface" = mlx5_roce.45 ] || exit 0
>> case $status in
>> up)
>>         tc qdisc add dev mlx5_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5
>> 5 5 5 5 5 5
> 
> will, here you map all user skb prios (skb->priority) to HW tc 5.
> BTW skprio or user prio in this example is never the vlan prio it is
> the ipv4 (ToS).
> 
> please see http://lartc.org/manpages/tc-prio.html

Ok.

> So to achieve a vlan prio to HW tc mapping, you will need to map the
> skprios to vlan prios using vlan egress mapping
> which i see you already do down below.

I do, and this is all related to trying to get PFC working for RoCE on
these cards.  For the most part, the things you see here are documented
in the Mellanox guides related to RoCE setup, or they are things I
pulled from the tcwrap.py program that you guys distribute for setting
this stuff up.

> But, our select queue implementation will extract the vlan priority
> and use the corresponding TC from our own
> priv->channeltc_to_txq_map[channel_ix][up] mapping
> where up is vlan user priority.  but this only applies to kernel
> traffic, i don't see why it is needed for RoCE.

Read your own guides ;-).

I'm using this one for your switches:
https://community.mellanox.com/docs/DOC-1417

And these to try and get the linux machines configured properly:
https://community.mellanox.com/docs/DOC-1414
https://community.mellanox.com/docs/DOC-1415
https://community.mellanox.com/docs/DOC-2311
https://community.mellanox.com/docs/DOC-2474
http://www.mellanox.com/related-docs/prod_software/RoCE_with_Priority_Flow_Control_Application_Guide.pdf

The guides are helpful if your setup allows you to follow their exact
example.  But, they are shy on information about how to modify the
examples to your specific situation.  For instance, I have to use vlan
priority 5 as my no-drop priority for RoCE traffic.  I can't reliably
tell which portions of the guide I must switch the 3s to 5s in order to
get the new priority, and which uses of 3s in the guides relate to other
things that could be mapped to 5.  On a separate note, it's unclear to
me if your switches and cards support more than one no-drop priority
(other vendor's RoCE cards I'm using here don't, they only allow one
no-drop priority for RoCE traffic and it must be 5).  If it does support
more than one, I'd actually like both 3 and 5 to be no-drop and for one
vlan to use 3 and another to use 5.

> As i said above configuring any num_tc > 1 might cause the panic you saw.
> 
> Regarding the proper mapping to do for 45 => priority 5, 43 => prio 3.
> the egress mappings you already did above should be sufficient, the
> question is, do you need the vlan priorities to be mapped to a
> specific HW TC dispatchers ?

You'd have to tell me.  The switch docs make it clear that it's best if
no-drop priorities are mapped to TC1 or TC2 (which is not necessarily
the same as the TC mapping you refer to here as far as I know, but it
might be similar).  The doc on setting up ConnectX-4 cards talks about
the same basic TC dispatchers on the card, but instead of 4 like the
switches have, there are 8.  So, does the card's built in
firmware/silicon have a preference for where no-drop traffic is queued
via TC dispatches like the switches do?

> 
> if not, then you don't need to configure  "tc qdisc add dev mlx5_roce
> root ..." at all.

That appears to be a question for Mellanox to answer.  I can't say.


-- 
Doug Ledford <dledford@redhat.com>
              GPG KeyID: 0E572FDD



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mlx5 core/en oops in 4.6-rc6+
  2016-05-12 17:28       ` Doug Ledford
@ 2016-05-19 17:13         ` Eran Ben Elisha
  2016-06-08 12:48           ` Doug Ledford
  0 siblings, 1 reply; 7+ messages in thread
From: Eran Ben Elisha @ 2016-05-19 17:13 UTC (permalink / raw)
  To: Doug Ledford; +Cc: Saeed Mahameed, Linux Netdev List, ophirm, Eran Ben Elisha

Hi Doug,
Attaching here a response from Ophir Maor (from Mellanox community)

>
> Read your own guides ;-).
>
> I'm using this one for your switches:
> https://community.mellanox.com/docs/DOC-1417
>
> And these to try and get the linux machines configured properly:
> https://community.mellanox.com/docs/DOC-1414
> https://community.mellanox.com/docs/DOC-1415
> https://community.mellanox.com/docs/DOC-2311
> https://community.mellanox.com/docs/DOC-2474
> http://www.mellanox.com/related-docs/prod_software/RoCE_with_Priority_Flow_Control_Application_Guide.pdf
>
> The guides are helpful if your setup allows you to follow their exact
> example.  But, they are shy on information about how to modify the
> examples to your specific situation.  For instance, I have to use vlan
> priority 5 as my no-drop priority for RoCE traffic.  I can't reliably
> tell which portions of the guide I must switch the 3s to 5s in order to
> get the new priority, and which uses of 3s in the guides relate to other
> things that could be mapped to 5.  On a separate note, it's unclear to
> me if your switches and cards support more than one no-drop priority
> (other vendor's RoCE cards I'm using here don't, they only allow one
> no-drop priority for RoCE traffic and it must be 5).  If it does support
> more than one, I'd actually like both 3 and 5 to be no-drop and for one
> vlan to use 3 and another to use 5.

There are two flows to configure egress mapping

- flow that pass via the kernel. Then you need to use kernel commands
(e.g. vconfig set_egress_map, or other commands) to make the kernel
set the egress priority.

- flows that bypass the kernel such as RoCE, then you need to use
tc_wrap to set the egress mapping.

This post explains it very nicely for ConnectX-4.

https://community.mellanox.com/docs/DOC-2474

Thanks,

Ophir.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: mlx5 core/en oops in 4.6-rc6+
  2016-05-19 17:13         ` Eran Ben Elisha
@ 2016-06-08 12:48           ` Doug Ledford
  0 siblings, 0 replies; 7+ messages in thread
From: Doug Ledford @ 2016-06-08 12:48 UTC (permalink / raw)
  To: Eran Ben Elisha
  Cc: Saeed Mahameed, Linux Netdev List, ophirm, Eran Ben Elisha


[-- Attachment #1.1: Type: text/plain, Size: 3935 bytes --]

On 5/19/2016 1:13 PM, Eran Ben Elisha wrote:
> Hi Doug,
> Attaching here a response from Ophir Maor (from Mellanox community)

This conversation is a low priority, spare time thread for me, so it can
take a while to respond to sometimes ;-)

>>
>> Read your own guides ;-).
>>
>> I'm using this one for your switches:
>> https://community.mellanox.com/docs/DOC-1417
>>
>> And these to try and get the linux machines configured properly:
>> https://community.mellanox.com/docs/DOC-1414
>> https://community.mellanox.com/docs/DOC-1415
>> https://community.mellanox.com/docs/DOC-2311
>> https://community.mellanox.com/docs/DOC-2474
>> http://www.mellanox.com/related-docs/prod_software/RoCE_with_Priority_Flow_Control_Application_Guide.pdf
>>
>> The guides are helpful if your setup allows you to follow their exact
>> example.  But, they are shy on information about how to modify the
>> examples to your specific situation.  For instance, I have to use vlan
>> priority 5 as my no-drop priority for RoCE traffic.  I can't reliably
>> tell which portions of the guide I must switch the 3s to 5s in order to
>> get the new priority, and which uses of 3s in the guides relate to other
>> things that could be mapped to 5.  On a separate note, it's unclear to
>> me if your switches and cards support more than one no-drop priority
>> (other vendor's RoCE cards I'm using here don't, they only allow one
>> no-drop priority for RoCE traffic and it must be 5).  If it does support
>> more than one, I'd actually like both 3 and 5 to be no-drop and for one
>> vlan to use 3 and another to use 5.
> 
> There are two flows to configure egress mapping
> 
> - flow that pass via the kernel. Then you need to use kernel commands
> (e.g. vconfig set_egress_map, or other commands) to make the kernel
> set the egress priority.

Yes.  Done.  Which actually has nothing to do with RoCE (I don't think
even kernel RoCE flows go through this since they don't use the kernel
net stack but use the card's firmware and RoCE work requests to send
data) and is just part of the Mellanox recommended "put all traffic on
this vlan on this priority even if it isn't all RoCE".  I'm not sure I
agree with it, and explanations that specifically exclude it to make
things clearer would be nice.

> - flows that bypass the kernel such as RoCE, then you need to use
> tc_wrap to set the egress mapping.

tc_wrap is not an explanation, nor really a suitable answer to "how do I
do this" as it's out of date for the current upstream kernels last I
checked...

> This post explains it very nicely for ConnectX-4.
> 
> https://community.mellanox.com/docs/DOC-2474

Yes, I read this post, and I downloaded tc_wrap from Mellanox, and I
dissected tc_wrap to figure out it was doing what I added to my
dispatcher file, namely this:

tc qdisc add dev mlx4_roce root mqprio num_tc 8 map 5 5 5 5 5 5 5 5 5 5
5 5 5 5 5 5 queues 32@0 32@32 32@64 32@96 32@128 32@160 32@192 32@224

But even though I was able to pull that out of tc_wrap, the explanation
of how setting what appears to be a kernel queue discipline on packets
that the kernel does not see and are handled entirely by the card causes
those packets never seen by the kernel to be sent with a specific
priority is completely missing.  What is the chain here?  Does setting
the queue discipline here translate to a setting on the card and there
is some magic in that setting that triggers the firmware to do the right
thing on RoCE packets?  Does the driver read the queue disc when setting
up address handles to use on the work requests and get the information
that way?  How is this information actually making it to the packet
generation engine in the firmware?  And given how recent upstream
kernels have changed the default queue discipline on these cards, it is
unclear how this command might need to be modified to keep working.



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 884 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-06-08 12:48 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-05 16:00 mlx5 core/en oops in 4.6-rc6+ Doug Ledford
2016-05-05 16:42 ` Saeed Mahameed
2016-05-05 17:16   ` Doug Ledford
2016-05-05 20:51     ` Saeed Mahameed
2016-05-12 17:28       ` Doug Ledford
2016-05-19 17:13         ` Eran Ben Elisha
2016-06-08 12:48           ` Doug Ledford

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.