All of lore.kernel.org
 help / color / mirror / Atom feed
* umad_send with service level higher than 0 does not work
@ 2012-12-14 12:18 Jens Domke
  2012-12-14 13:47 ` Hal Rosenstock
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Domke @ 2012-12-14 12:18 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA; +Cc: Torsten Hoefler

Hello,

I'm trying to find a bug in our configuration, which causes the the IB fabric or at least the port where the OpenSM is running to crash. I hope someone on this list has more experience and can help, or give me a hint.

The configuration:
  a) HCAs: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE]; or Voltaire (ibv_devinfo shows board_id: VLT0130010001, fw_ver: 2.3.000)
  b) OFED 3.5 rc2
  c) OpenSM with DFSSSP routing algorithm running on a compute node (additinal OpenSM on a switch with lower priority)
  d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
  e) kernel 2.6.32-220.13.1.el6.x86_64

As far as I understand the whole system:
  1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
  2. the SA receives the request on QP1
  3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
  4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c

The osm_vendor_send() function builds the MAD packet with the following attributes:
        /* GS classes */
        umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
                          p_mad_addr->addr_type.gsi.remote_qp,
                          p_mad_addr->addr_type.gsi.service_level,
                          IB_QP1_WELL_KNOWN_Q_KEY);
So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).

If I look into the MAD before it is send, then it looks like this:
Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
    at src/umad.c:791
791             if (umaddebug > 1)
(gdb) p *mad
$1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
    lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
    hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
    pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}

The kernel writes the following messages after a short time into the log:
Dec 14 01:23:46 rc001 kernel: INFO: task opensm:2499 blocked for more than 120 seconds.
Dec 14 01:23:46 rc001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 14 01:23:46 rc001 kernel: opensm        D 0000000000000000     0  2499   2498 0x00000000
Dec 14 01:23:46 rc001 kernel: ffff880424bebc38 0000000000000082 0000000000000000 0000000000000000
Dec 14 01:23:46 rc001 kernel: 0000000000000000 ffff8804ffffffff ffff88042287eec0 0000000031bc502d
Dec 14 01:23:46 rc001 kernel: ffff880427fca678 ffff880424bebfd8 000000000000f4e8 ffff880427fca678
Dec 14 01:23:46 rc001 kernel: Call Trace:
Dec 14 01:23:46 rc001 kernel: [<ffffffff814eddc5>] schedule_timeout+0x215/0x2e0
Dec 14 01:23:46 rc001 kernel: [<ffffffff8109698f>] ? up+0x2f/0x50
Dec 14 01:23:46 rc001 kernel: [<ffffffffa00fb8d2>] ? __mlx4_cmd+0x202/0x300 [mlx4_core]
Dec 14 01:23:46 rc001 kernel: [<ffffffff814eda43>] wait_for_common+0x123/0x180
Dec 14 01:23:46 rc001 kernel: [<ffffffff8105e940>] ? default_wake_function+0x0/0x20
Dec 14 01:23:46 rc001 kernel: [<ffffffff814edb5d>] wait_for_completion+0x1d/0x20
Dec 14 01:23:46 rc001 kernel: [<ffffffffa0e1913a>] ib_unregister_mad_agent+0x33a/0x500 [ib_mad]
Dec 14 01:23:46 rc001 kernel: [<ffffffffa0d9f923>] ib_umad_unreg_agent+0xb3/0xe0 [ib_umad]
Dec 14 01:23:46 rc001 kernel: [<ffffffffa0d9fa37>] ib_umad_ioctl+0x67/0x70 [ib_umad]
Dec 14 01:23:46 rc001 kernel: [<ffffffff81189582>] vfs_ioctl+0x22/0xa0
Dec 14 01:23:46 rc001 kernel: [<ffffffff81141190>] ? unmap_region+0x110/0x130
Dec 14 01:23:46 rc001 kernel: [<ffffffff81189724>] do_vfs_ioctl+0x84/0x580
Dec 14 01:23:46 rc001 kernel: [<ffffffff8113f33e>] ? remove_vma+0x6e/0x90
Dec 14 01:23:46 rc001 kernel: [<ffffffff81141828>] ? do_munmap+0x308/0x3a0
Dec 14 01:23:46 rc001 kernel: [<ffffffff81189ca1>] sys_ioctl+0x81/0xa0
Dec 14 01:23:46 rc001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
(Even "modprobe mlx4_core enable_qos=Y debug_level=1" does not make any difference and I get the same output like the one above)

The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
The OpenSM does not really respond to ctrl+c and becomes a zombi process afterwards, so that the only option is to reboot the node.

So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.

Please let me know if you need more information, or if I can test something to give you more inside.

Thank you in advance,
Jens

--------------------------------
Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku, 
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: domke.j.aa@m.titech.ac.jp
--------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
  2012-12-14 12:18 umad_send with service level higher than 0 does not work Jens Domke
@ 2012-12-14 13:47 ` Hal Rosenstock
       [not found]   ` <50CB2DF3.7020409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Hal Rosenstock @ 2012-12-14 13:47 UTC (permalink / raw)
  To: Jens Domke; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

On 12/14/2012 7:18 AM, Jens Domke wrote:
> Hello,
> 
> I'm trying to find a bug in our configuration, which causes the the IB fabric or at least the port where the OpenSM is running to crash. I hope someone on this list has more experience and can help, or give me a hint.
> 
> The configuration:
>   a) HCAs: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB DDR / 10GigE]; or Voltaire (ibv_devinfo shows board_id: VLT0130010001, fw_ver: 2.3.000)
>   b) OFED 3.5 rc2
>   c) OpenSM with DFSSSP routing algorithm running on a compute node (additinal OpenSM on a switch with lower priority)

Not related to this problem but it is problematic to mix SM flavors like
this in a subnet.

>   d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"

I'm not familiar with what DFSSSP does to figure out SLs exactly but
there should be no need to set this. The proper SL for querying the SA
for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
(and other QoS based routing algorithms), it calculates that and the SM
pushes this into each port. That should be used. It's possible that SL1
is not a valid SL for port <-> SA querying using DFSSSP.

>   e) kernel 2.6.32-220.13.1.el6.x86_64
> 
> As far as I understand the whole system:
>   1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>   2. the SA receives the request on QP1

There is the SL in the query itself. This should be the SMSL that the SM
set for that port.

>   3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path

This is a (potentially) different SL (for MPI<->MPI port communication)
than the one the query used and is the one returned inside the
PathRecord attribute/data.

>   4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c

By the response reversibility rule, I think this is returned on the SL
of the original query but haven't verified this in the code base yet.

> The osm_vendor_send() function builds the MAD packet with the following attributes:
>         /* GS classes */
>         umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>                           p_mad_addr->addr_type.gsi.remote_qp,
>                           p_mad_addr->addr_type.gsi.service_level,
>                           IB_QP1_WELL_KNOWN_Q_KEY);
> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).

By not working, what do you mean ? Do you mean it's not received at the
requester with no message in the OpenSM log or not received at the
OpenSM or something else ? It could be due to the wrong SL being used in
the original request (forcing it to SL 1). That could cause it not to be
received at the SM or the response not to make it back to the requester
from the SA if the SL used is not "reversible".

> If I look into the MAD before it is send, then it looks like this:
> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>     at src/umad.c:791
> 791             if (umaddebug > 1)
> (gdb) p *mad
> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>     lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>     hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>     pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}

Is this the PathRecord query on the OpenMPI side or the response on the
OpenSM side ? SL is 6 rather than 1 here.

> The kernel writes the following messages after a short time into the log:
> Dec 14 01:23:46 rc001 kernel: INFO: task opensm:2499 blocked for more than 120 seconds.
> Dec 14 01:23:46 rc001 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Dec 14 01:23:46 rc001 kernel: opensm        D 0000000000000000     0  2499   2498 0x00000000
> Dec 14 01:23:46 rc001 kernel: ffff880424bebc38 0000000000000082 0000000000000000 0000000000000000
> Dec 14 01:23:46 rc001 kernel: 0000000000000000 ffff8804ffffffff ffff88042287eec0 0000000031bc502d
> Dec 14 01:23:46 rc001 kernel: ffff880427fca678 ffff880424bebfd8 000000000000f4e8 ffff880427fca678
> Dec 14 01:23:46 rc001 kernel: Call Trace:
> Dec 14 01:23:46 rc001 kernel: [<ffffffff814eddc5>] schedule_timeout+0x215/0x2e0
> Dec 14 01:23:46 rc001 kernel: [<ffffffff8109698f>] ? up+0x2f/0x50
> Dec 14 01:23:46 rc001 kernel: [<ffffffffa00fb8d2>] ? __mlx4_cmd+0x202/0x300 [mlx4_core]
> Dec 14 01:23:46 rc001 kernel: [<ffffffff814eda43>] wait_for_common+0x123/0x180
> Dec 14 01:23:46 rc001 kernel: [<ffffffff8105e940>] ? default_wake_function+0x0/0x20
> Dec 14 01:23:46 rc001 kernel: [<ffffffff814edb5d>] wait_for_completion+0x1d/0x20
> Dec 14 01:23:46 rc001 kernel: [<ffffffffa0e1913a>] ib_unregister_mad_agent+0x33a/0x500 [ib_mad]
> Dec 14 01:23:46 rc001 kernel: [<ffffffffa0d9f923>] ib_umad_unreg_agent+0xb3/0xe0 [ib_umad]
> Dec 14 01:23:46 rc001 kernel: [<ffffffffa0d9fa37>] ib_umad_ioctl+0x67/0x70 [ib_umad]
> Dec 14 01:23:46 rc001 kernel: [<ffffffff81189582>] vfs_ioctl+0x22/0xa0
> Dec 14 01:23:46 rc001 kernel: [<ffffffff81141190>] ? unmap_region+0x110/0x130
> Dec 14 01:23:46 rc001 kernel: [<ffffffff81189724>] do_vfs_ioctl+0x84/0x580
> Dec 14 01:23:46 rc001 kernel: [<ffffffff8113f33e>] ? remove_vma+0x6e/0x90
> Dec 14 01:23:46 rc001 kernel: [<ffffffff81141828>] ? do_munmap+0x308/0x3a0
> Dec 14 01:23:46 rc001 kernel: [<ffffffff81189ca1>] sys_ioctl+0x81/0xa0
> Dec 14 01:23:46 rc001 kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
> (Even "modprobe mlx4_core enable_qos=Y debug_level=1" does not make any difference and I get the same output like the one above)

This looks like the problem reported on the list where there are
outstanding work completions and some MAD client is trying to exit. The
root cause for that has yet to be determined AFAIK.

> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.

So nothing interesting logged relative to the PathRecord queries ?

> The OpenSM does not really respond to ctrl+c and becomes a zombi process afterwards, so that the only option is to reboot the node.

Right, after the above error, I wouldn't expect OpenSM to be able to
exit cleanly.

> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.

So SL 0 works between all nodes and SA for querying/responses. Wonder if
that's how SMSL is set by DFSSSP.

-- Hal

> Please let me know if you need more information, or if I can test something to give you more inside.
> 
> Thank you in advance,
> Jens
> 
> --------------------------------
> Dipl.-Math. Jens Domke
> Researcher - Tokyo Institute of Technology
> Satoshi MATSUOKA Laboratory
> Global Scientific Information and Computing Center
> 2-12-1-E2-7 Ookayama, Meguro-ku, 
> Tokyo, 152-8550, JAPAN
> Tel/Fax: +81-3-5734-3876
> E-Mail: domke.j.aa@m.titech.ac.jp
> --------------------------------
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
       [not found]   ` <50CB2DF3.7020409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2012-12-14 15:17     ` Jens Domke
  2012-12-14 16:42       ` Hal Rosenstock
  2012-12-14 18:17     ` Ira Weiny
  1 sibling, 1 reply; 18+ messages in thread
From: Jens Domke @ 2012-12-14 15:17 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

Hello Hal,

thank you for the fast response. I will try to clarify some points.

>>  d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
> 
> I'm not familiar with what DFSSSP does to figure out SLs exactly but
> there should be no need to set this. The proper SL for querying the SA
> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
> (and other QoS based routing algorithms), it calculates that and the SM
> pushes this into each port. That should be used. It's possible that SL1
> is not a valid SL for port <-> SA querying using DFSSSP.
The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
> 
>>  e) kernel 2.6.32-220.13.1.el6.x86_64
>> 
>> As far as I understand the whole system:
>>  1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>  2. the SA receives the request on QP1
> 
> There is the SL in the query itself. This should be the SMSL that the SM
> set for that port.
Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
In fact OpenMPI sets everthing to 0 except for slid and dlid.
> 
>>  3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
> 
> This is a (potentially) different SL (for MPI<->MPI port communication)
> than the one the query used and is the one returned inside the
> PathRecord attribute/data.
Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
> 
>>  4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
> 
> By the response reversibility rule, I think this is returned on the SL
> of the original query but haven't verified this in the code base yet.
Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
> 
>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>        /* GS classes */
>>        umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>                          p_mad_addr->addr_type.gsi.remote_qp,
>>                          p_mad_addr->addr_type.gsi.service_level,
>>                          IB_QP1_WELL_KNOWN_Q_KEY);
>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
> 
> By not working, what do you mean ? Do you mean it's not received at the
> requester with no message in the OpenSM log or not received at the
> OpenSM or something else ? It could be due to the wrong SL being used in
> the original request (forcing it to SL 1). That could cause it not to be
> received at the SM or the response not to make it back to the requester
> from the SA if the SL used is not "reversible".
By "not working" I mean, that the MPI process does not receive any response from the SA.
I get messages from the MPI process like the following:
[rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
And I think I was some messages in the log about "…1 outstanding MAD…".
> 
>> If I look into the MAD before it is send, then it looks like this:
>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>    at src/umad.c:791
>> 791             if (umaddebug > 1)
>> (gdb) p *mad
>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>    lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>    hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>    pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
> 
> Is this the PathRecord query on the OpenMPI side or the response on the
> OpenSM side ? SL is 6 rather than 1 here.
This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
SL=6 indicates, that the MPI process was sending the request on SL 6.
> 
>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
> 
> So nothing interesting logged relative to the PathRecord queries ?
In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
And a few "outstanding MADs" a few lines later in the log.
> 
>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
> 
> So SL 0 works between all nodes and SA for querying/responses. Wonder if
> that's how SMSL is set by DFSSSP.
No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used for MPI<->MPI traffic, to ensure deadlock freedom.

Regards
Jens

--------------------------------
Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku, 
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: domke.j.aa@m.titech.ac.jp
--------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
  2012-12-14 15:17     ` Jens Domke
@ 2012-12-14 16:42       ` Hal Rosenstock
       [not found]         ` <50CB56E9.70900-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Hal Rosenstock @ 2012-12-14 16:42 UTC (permalink / raw)
  To: Jens Domke; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

Hi again,

On 12/14/2012 10:17 AM, Jens Domke wrote:
> Hello Hal,
> 
> thank you for the fast response. I will try to clarify some points.
> 
>>>  d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>
>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>> there should be no need to set this. The proper SL for querying the SA
>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>> (and other QoS based routing algorithms), it calculates that and the SM
>> pushes this into each port. That should be used. It's possible that SL1
>> is not a valid SL for port <-> SA querying using DFSSSP.
> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>
>>>  e) kernel 2.6.32-220.13.1.el6.x86_64
>>>
>>> As far as I understand the whole system:
>>>  1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>  2. the SA receives the request on QP1
>>
>> There is the SL in the query itself. This should be the SMSL that the SM
>> set for that port.
> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>
>>>  3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>
>> This is a (potentially) different SL (for MPI<->MPI port communication)
>> than the one the query used and is the one returned inside the
>> PathRecord attribute/data.
> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.

With DFSSSP are all SLs same from source port to get to any destination ?

>>
>>>  4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>
>> By the response reversibility rule, I think this is returned on the SL
>> of the original query but haven't verified this in the code base yet.
> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.

I doubled checked and indeed the SA response does use the SL that the
incoming request was received on.

>>
>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>        /* GS classes */
>>>        umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>                          p_mad_addr->addr_type.gsi.remote_qp,
>>>                          p_mad_addr->addr_type.gsi.service_level,
>>>                          IB_QP1_WELL_KNOWN_Q_KEY);
>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>
>> By not working, what do you mean ? Do you mean it's not received at the
>> requester with no message in the OpenSM log or not received at the
>> OpenSM or something else ? It could be due to the wrong SL being used in
>> the original request (forcing it to SL 1). That could cause it not to be
>> received at the SM or the response not to make it back to the requester
>> from the SA if the SL used is not "reversible".
> By "not working" I mean, that the MPI process does not receive any response from the SA.
> I get messages from the MPI process like the following:
> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
> And I think I was some messages in the log about "…1 outstanding MAD…".
>>
>>> If I look into the MAD before it is send, then it looks like this:
>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>    at src/umad.c:791
>>> 791             if (umaddebug > 1)
>>> (gdb) p *mad
>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>    lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>    hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>    pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>
>> Is this the PathRecord query on the OpenMPI side or the response on the
>> OpenSM side ? SL is 6 rather than 1 here.
> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
> SL=6 indicates, that the MPI process was sending the request on SL 6.

What is SMSL for the requester ? Was it SL 6 ?

One would need to walk the SLToVLMappingTables from requester (OMPI
port) to SA and back to see whether SL6 would even have a chance of
working (not dropping) aside from whether it's really the correct SL to use.

-- Hal

>>
>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>
>> So nothing interesting logged relative to the PathRecord queries ?
> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
> And a few "outstanding MADs" a few lines later in the log.
>>
>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>
>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>> that's how SMSL is set by DFSSSP.
> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
for MPI<->MPI traffic, to ensure deadlock freedom.
> 
> Regards
> Jens
> 
> --------------------------------
> Dipl.-Math. Jens Domke
> Researcher - Tokyo Institute of Technology
> Satoshi MATSUOKA Laboratory
> Global Scientific Information and Computing Center
> 2-12-1-E2-7 Ookayama, Meguro-ku, 
> Tokyo, 152-8550, JAPAN
> Tel/Fax: +81-3-5734-3876
> E-Mail: domke.j.aa@m.titech.ac.jp
> --------------------------------
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
       [not found]   ` <50CB2DF3.7020409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2012-12-14 15:17     ` Jens Domke
@ 2012-12-14 18:17     ` Ira Weiny
  1 sibling, 0 replies; 18+ messages in thread
From: Ira Weiny @ 2012-12-14 18:17 UTC (permalink / raw)
  To: Hal Rosenstock
  Cc: Jens Domke, linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

On Fri, 14 Dec 2012 08:47:31 -0500
Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:

> On 12/14/2012 7:18 AM, Jens Domke wrote:
> > Hello,

[snip]

> 
> >   d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
> 
> I'm not familiar with what DFSSSP does to figure out SLs exactly but
> there should be no need to set this.

DFSSSP requires QoS to be configured in OpenSM; "... to equally spread the load on the available SL or virtual lanes."

Ira

[snip]


-- 
Ira Weiny
Member of Technical Staff
Lawrence Livermore National Lab
925-423-8008
weiny2-i2BcT+NCU+M@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
       [not found]         ` <50CB56E9.70900-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2012-12-14 18:24           ` Jens Domke
  2012-12-14 18:58             ` Hal Rosenstock
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Domke @ 2012-12-14 18:24 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

Hello Hal,

On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:

> Hi again,
> 
> On 12/14/2012 10:17 AM, Jens Domke wrote:
>> Hello Hal,
>> 
>> thank you for the fast response. I will try to clarify some points.
>> 
>>>> d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>> 
>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>> there should be no need to set this. The proper SL for querying the SA
>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>> (and other QoS based routing algorithms), it calculates that and the SM
>>> pushes this into each port. That should be used. It's possible that SL1
>>> is not a valid SL for port <-> SA querying using DFSSSP.
>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
>> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>> 
>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>> 
>>>> As far as I understand the whole system:
>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>> 2. the SA receives the request on QP1
>>> 
>>> There is the SL in the query itself. This should be the SMSL that the SM
>>> set for that port.
>> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>> 
>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>> 
>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>> than the one the query used and is the one returned inside the
>>> PathRecord attribute/data.
>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
> 
> With DFSSSP are all SLs same from source port to get to any destination ?
No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
> 
>>> 
>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>> 
>>> By the response reversibility rule, I think this is returned on the SL
>>> of the original query but haven't verified this in the code base yet.
>> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
> 
> I doubled checked and indeed the SA response does use the SL that the
> incoming request was received on.
> 
>>> 
>>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>>       /* GS classes */
>>>>       umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>                         p_mad_addr->addr_type.gsi.remote_qp,
>>>>                         p_mad_addr->addr_type.gsi.service_level,
>>>>                         IB_QP1_WELL_KNOWN_Q_KEY);
>>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>> 
>>> By not working, what do you mean ? Do you mean it's not received at the
>>> requester with no message in the OpenSM log or not received at the
>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>> the original request (forcing it to SL 1). That could cause it not to be
>>> received at the SM or the response not to make it back to the requester
>>> from the SA if the SL used is not "reversible".
>> By "not working" I mean, that the MPI process does not receive any response from the SA.
>> I get messages from the MPI process like the following:
>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
>> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>> 
>>>> If I look into the MAD before it is send, then it looks like this:
>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>>   at src/umad.c:791
>>>> 791             if (umaddebug > 1)
>>>> (gdb) p *mad
>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>   lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>>   hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>>   pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>> 
>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>> OpenSM side ? SL is 6 rather than 1 here.
>> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
>> SL=6 indicates, that the MPI process was sending the request on SL 6.
> 
> What is SMSL for the requester ? Was it SL 6 ?
Yes, it was SL 6.
Here is a content of a similar packet which was received by the SA. I have used ibdump on the port where the OpenSM was running:
======================================================================================
No.     Time        Source                Destination           Protocol Length Info
    785 14.352168   LID: 384              LID: 4140             InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)

Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
    Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
    Epoch Time: 1355389784.437633332 seconds
    [Time delta from previous captured frame: 4.332020528 seconds]
    [Time delta from previous displayed frame: 4.332020528 seconds]
    [Time since reference or first frame: 14.352168681 seconds]
    Frame Number: 785
    Frame Length: 290 bytes (2320 bits)
    Capture Length: 290 bytes (2320 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: erf:infiniband]
Extensible Record Format
    [ERF Header]
        Timestamp: 0x50c99b587008bcf2
        [Header type]
            .001 0101 = type: INFINIBAND (21)
            0... .... = Extension header present: 0
        0000 0100 = flags: 4
            .... ..00 = capture interface: 0
            .... .1.. = varying record length: 1
            .... 0... = truncated: 0
            ...0 .... = rx error: 0
            ..0. .... = ds error: 0
            00.. .... = reserved: 0
        record length: 306
        loss counter: 0
        wire length: 290
InfiniBand
    Local Route Header
        0110 .... = Virtual Lane: 0x06
        .... 0000 = Link Version: 0
        0110 .... = Service Level: 6
        .... 00.. = Reserved (2 bits): 0
        .... ..10 = Link Next Header: 0x02
        Destination Local ID: 19
        0000 0... .... .... = Reserved (5 bits): 0
        .... .000 0100 1000 = Packet Length: 72
        Source Local ID: 16
    Base Transport Header
        Opcode: 100
        1... .... = Solicited Event: True
        .1.. .... = MigReq: True
        ..00 .... = Pad Count: 0
        .... 0000 = Header Version: 0
        Partition Key: 65535
        Reserved (8 bits): 0
        Destination Queue Pair: 0x000001
        0... .... = Acknowledge Request: False
        .000 0000 = Reserved (7 bits): 0
        Packet Sequence Number: 0
    DETH - Datagram Extended Transport Header
        Queue Key: 2147549184
        Reserved (8 bits): 0
        Source Queue Pair: 0x00380050
    MAD Header - Common Management Datagram
        Base Version: 0x01
        Management Class: 0x03
        Class Version: 0x02
        Method: Get() (0x01)
        Status: 0x0000
        Class Specific: 0x0000
        Transaction ID: 0x0010000f38005000
        Attribute ID: 0x0035
        Reserved: 0x0000
        Attribute Modifier: 0x00000000
        MAD Data Payload: 000000000000000000000000000000000000000000000000...
     Illegal RMPP Type (0)! 
        RMPP Type: 0x00
        RMPP Type: 0x00
        0000 .... = R Resp Time: 0x00
        .... 0000 = RMPP Flags: Unknown (0x00)
        RMPP Status:  (Normal) (0x00)
        RMPP Data 1: 0x00000000
        RMPP Data 2: 0x00000000
    SMASubnAdmGet(PathRecord)
        SM_Key (Verification Key): 0x0000000000000000
        Attribute Offset: 0x0000
        Reserved: 0x0000
        Component Mask: 0x0000003000000000
        Attribute (PathRecord)
            PathRecord
                DGID: :: (::)
                SGID: ::0.15.0.16 (::0.15.0.16)
                DLID: 0x0000
                SLID: 0x0000
                0... .... = RawTraffic: 0x00
                .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
                HopLimit: 0x00
                TClass: 0x00
                0... .... = Reversible: 0x00
                .000 0000 = NumbPath: 0x00
                P_Key: 0x0000
                .... .... .... 0000 = SL: 0x0000
                00.. .... = MTUSelector: 0x00
                ..00 0000 = MTU: 0x00
                00.. .... = RateSelector: 0x00
                ..00 0000 = Rate: 0x00
                00.. .... = PacketLifeTimeSelector: 0x00
                ..00 0000 = PacketLifeTime: 0x00
                Preference: 0x00
    Variant CRC: 0xad4e
======================================================================================
> 
> One would need to walk the SLToVLMappingTables from requester (OMPI
> port) to SA and back to see whether SL6 would even have a chance of
> working (not dropping) aside from whether it's really the correct SL to use.
All SL2VL tables look the same. I checked the output of OpenSM.
	SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
	VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
But this is also as expected, because I have set the QoS in the opensm config as follows:
	qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
This was set for "default", "CA" and "Switch external ports". I have not touched the config for "Switch Port 0" and "Router ports", they remained: qos_[sw0 | rtr]_sl2vl (null)

Regards
Jens

> 
> -- Hal
> 
>>> 
>>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>> 
>>> So nothing interesting logged relative to the PathRecord queries ?
>> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
>> And a few "outstanding MADs" a few lines later in the log.
>>> 
>>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>> 
>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>> that's how SMSL is set by DFSSSP.
>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
> for MPI<->MPI traffic, to ensure deadlock freedom.
>> 
>> Regards
>> Jens
>> 
>> --------------------------------
>> Dipl.-Math. Jens Domke
>> Researcher - Tokyo Institute of Technology
>> Satoshi MATSUOKA Laboratory
>> Global Scientific Information and Computing Center
>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>> Tokyo, 152-8550, JAPAN
>> Tel/Fax: +81-3-5734-3876
>> E-Mail: domke.j.aa@m.titech.ac.jp
>> --------------------------------
>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--------------------------------
Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku, 
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: domke.j.aa@m.titech.ac.jp
--------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
  2012-12-14 18:24           ` Jens Domke
@ 2012-12-14 18:58             ` Hal Rosenstock
       [not found]               ` <50CB76F2.70003-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Hal Rosenstock @ 2012-12-14 18:58 UTC (permalink / raw)
  To: Jens Domke; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

Hi,

On 12/14/2012 1:24 PM, Jens Domke wrote:
> Hello Hal,
> 
> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
> 
>> Hi again,
>>
>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>> Hello Hal,
>>>
>>> thank you for the fast response. I will try to clarify some points.
>>>
>>>>> d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>>>
>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>>> there should be no need to set this. The proper SL for querying the SA
>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>>> (and other QoS based routing algorithms), it calculates that and the SM
>>>> pushes this into each port. That should be used. It's possible that SL1
>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
>>> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>>>
>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>
>>>>> As far as I understand the whole system:
>>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>>> 2. the SA receives the request on QP1
>>>>
>>>> There is the SL in the query itself. This should be the SMSL that the SM
>>>> set for that port.
>>> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>
>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>>>
>>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>>> than the one the query used and is the one returned inside the
>>>> PathRecord attribute/data.
>>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
>>
>> With DFSSSP are all SLs same from source port to get to any destination ?
> No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).

If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.

>>
>>>>
>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>>>
>>>> By the response reversibility rule, I think this is returned on the SL
>>>> of the original query but haven't verified this in the code base yet.
>>> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
>>
>> I doubled checked and indeed the SA response does use the SL that the
>> incoming request was received on.
>>
>>>>
>>>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>>>       /* GS classes */
>>>>>       umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>                         p_mad_addr->addr_type.gsi.remote_qp,
>>>>>                         p_mad_addr->addr_type.gsi.service_level,
>>>>>                         IB_QP1_WELL_KNOWN_Q_KEY);
>>>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>>>
>>>> By not working, what do you mean ? Do you mean it's not received at the
>>>> requester with no message in the OpenSM log or not received at the
>>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>>> the original request (forcing it to SL 1). That could cause it not to be
>>>> received at the SM or the response not to make it back to the requester
>>>> from the SA if the SL used is not "reversible".
>>> By "not working" I mean, that the MPI process does not receive any response from the SA.
>>> I get messages from the MPI process like the following:
>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
>>> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
>>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>>>
>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>>>   at src/umad.c:791
>>>>> 791             if (umaddebug > 1)
>>>>> (gdb) p *mad
>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>   lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>>>   hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>>>   pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>>>
>>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>>> OpenSM side ? SL is 6 rather than 1 here.
>>> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
>>> SL=6 indicates, that the MPI process was sending the request on SL 6.
>>
>> What is SMSL for the requester ? Was it SL 6 ?
> Yes, it was SL 6.
> Here is a content of a similar packet which was received by the SA. I have used ibdump on the port where the OpenSM was running:
> ======================================================================================
> No.     Time        Source                Destination           Protocol Length Info
>     785 14.352168   LID: 384              LID: 4140             InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
> 
> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
>     Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>     Epoch Time: 1355389784.437633332 seconds
>     [Time delta from previous captured frame: 4.332020528 seconds]
>     [Time delta from previous displayed frame: 4.332020528 seconds]
>     [Time since reference or first frame: 14.352168681 seconds]
>     Frame Number: 785
>     Frame Length: 290 bytes (2320 bits)
>     Capture Length: 290 bytes (2320 bits)
>     [Frame is marked: False]
>     [Frame is ignored: False]
>     [Protocols in frame: erf:infiniband]
> Extensible Record Format
>     [ERF Header]
>         Timestamp: 0x50c99b587008bcf2
>         [Header type]
>             .001 0101 = type: INFINIBAND (21)
>             0... .... = Extension header present: 0
>         0000 0100 = flags: 4
>             .... ..00 = capture interface: 0
>             .... .1.. = varying record length: 1
>             .... 0... = truncated: 0
>             ...0 .... = rx error: 0
>             ..0. .... = ds error: 0
>             00.. .... = reserved: 0
>         record length: 306
>         loss counter: 0
>         wire length: 290
> InfiniBand
>     Local Route Header
>         0110 .... = Virtual Lane: 0x06
>         .... 0000 = Link Version: 0
>         0110 .... = Service Level: 6
>         .... 00.. = Reserved (2 bits): 0
>         .... ..10 = Link Next Header: 0x02
>         Destination Local ID: 19
>         0000 0... .... .... = Reserved (5 bits): 0
>         .... .000 0100 1000 = Packet Length: 72
>         Source Local ID: 16
>     Base Transport Header
>         Opcode: 100
>         1... .... = Solicited Event: True
>         .1.. .... = MigReq: True
>         ..00 .... = Pad Count: 0
>         .... 0000 = Header Version: 0
>         Partition Key: 65535
>         Reserved (8 bits): 0
>         Destination Queue Pair: 0x000001
>         0... .... = Acknowledge Request: False
>         .000 0000 = Reserved (7 bits): 0
>         Packet Sequence Number: 0
>     DETH - Datagram Extended Transport Header
>         Queue Key: 2147549184
>         Reserved (8 bits): 0
>         Source Queue Pair: 0x00380050
>     MAD Header - Common Management Datagram
>         Base Version: 0x01
>         Management Class: 0x03
>         Class Version: 0x02
>         Method: Get() (0x01)
>         Status: 0x0000
>         Class Specific: 0x0000
>         Transaction ID: 0x0010000f38005000
>         Attribute ID: 0x0035
>         Reserved: 0x0000
>         Attribute Modifier: 0x00000000
>         MAD Data Payload: 000000000000000000000000000000000000000000000000...
>      Illegal RMPP Type (0)! 
>         RMPP Type: 0x00
>         RMPP Type: 0x00
>         0000 .... = R Resp Time: 0x00
>         .... 0000 = RMPP Flags: Unknown (0x00)
>         RMPP Status:  (Normal) (0x00)
>         RMPP Data 1: 0x00000000
>         RMPP Data 2: 0x00000000
>     SMASubnAdmGet(PathRecord)
>         SM_Key (Verification Key): 0x0000000000000000
>         Attribute Offset: 0x0000
>         Reserved: 0x0000
>         Component Mask: 0x0000003000000000
>         Attribute (PathRecord)
>             PathRecord
>                 DGID: :: (::)
>                 SGID: ::0.15.0.16 (::0.15.0.16)
>                 DLID: 0x0000
>                 SLID: 0x0000
>                 0... .... = RawTraffic: 0x00
>                 .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>                 HopLimit: 0x00
>                 TClass: 0x00
>                 0... .... = Reversible: 0x00
>                 .000 0000 = NumbPath: 0x00
>                 P_Key: 0x0000
>                 .... .... .... 0000 = SL: 0x0000
>                 00.. .... = MTUSelector: 0x00
>                 ..00 0000 = MTU: 0x00
>                 00.. .... = RateSelector: 0x00
>                 ..00 0000 = Rate: 0x00
>                 00.. .... = PacketLifeTimeSelector: 0x00
>                 ..00 0000 = PacketLifeTime: 0x00
>                 Preference: 0x00
>     Variant CRC: 0xad4e
> ======================================================================================

And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
out that machine and the issue is internal to that machine. It could be
because of the underlying issue which hangs OpenSM when some IB program
tried to unregister from the MAD layer but there were outstanding work
completions. That's based on your original email earlier this AM.

>>
>> One would need to walk the SLToVLMappingTables from requester (OMPI
>> port) to SA and back to see whether SL6 would even have a chance of
>> working (not dropping) aside from whether it's really the correct SL to use.
> All SL2VL tables look the same. I checked the output of OpenSM.
> 	SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
> 	VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
> But this is also as expected, because I have set the QoS in the opensm config as follows:
> 	qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
> This was set for "default", "CA" and "Switch external ports". I have not touched the config for "Switch Port 0" and "Router ports", they remained: qos_[sw0 | rtr]_sl2vl (null)

That works as long as all links have (at least) 8 data VLs (VLCap 4).

-- Hal

> Regards
> Jens
> 
>>
>> -- Hal
>>
>>>>
>>>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>>>
>>>> So nothing interesting logged relative to the PathRecord queries ?
>>> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
>>> And a few "outstanding MADs" a few lines later in the log.
>>>>
>>>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>>>
>>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>>> that's how SMSL is set by DFSSSP.
>>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>
>>> Regards
>>> Jens
>>>
>>> --------------------------------
>>> Dipl.-Math. Jens Domke
>>> Researcher - Tokyo Institute of Technology
>>> Satoshi MATSUOKA Laboratory
>>> Global Scientific Information and Computing Center
>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>> Tokyo, 152-8550, JAPAN
>>> Tel/Fax: +81-3-5734-3876
>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>> --------------------------------
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --------------------------------
> Dipl.-Math. Jens Domke
> Researcher - Tokyo Institute of Technology
> Satoshi MATSUOKA Laboratory
> Global Scientific Information and Computing Center
> 2-12-1-E2-7 Ookayama, Meguro-ku, 
> Tokyo, 152-8550, JAPAN
> Tel/Fax: +81-3-5734-3876
> E-Mail: domke.j.aa@m.titech.ac.jp
> --------------------------------
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
       [not found]               ` <50CB76F2.70003-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2012-12-14 20:32                 ` Jens Domke
  2012-12-14 20:44                   ` Hal Rosenstock
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Domke @ 2012-12-14 20:32 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

Hello Hal,

On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:

> Hi,
> 
> On 12/14/2012 1:24 PM, Jens Domke wrote:
>> Hello Hal,
>> 
>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>> 
>>> Hi again,
>>> 
>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>> Hello Hal,
>>>> 
>>>> thank you for the fast response. I will try to clarify some points.
>>>> 
>>>>>> d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>>>> 
>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>>>> there should be no need to set this. The proper SL for querying the SA
>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>>>> (and other QoS based routing algorithms), it calculates that and the SM
>>>>> pushes this into each port. That should be used. It's possible that SL1
>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
>>>> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>>>> 
>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>> 
>>>>>> As far as I understand the whole system:
>>>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>>>> 2. the SA receives the request on QP1
>>>>> 
>>>>> There is the SL in the query itself. This should be the SMSL that the SM
>>>>> set for that port.
>>>> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>> 
>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>>>> 
>>>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>>>> than the one the query used and is the one returned inside the
>>>>> PathRecord attribute/data.
>>>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
>>> 
>>> With DFSSSP are all SLs same from source port to get to any destination ?
>> No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
> 
> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path.
So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL.

I just read the IB Specs and it says, that "SL specified in the received packet is used as the SL in the response packet" for MAD packets.
So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet.
OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, and sends the packet on SL_b (PortInfo.SMSL).
The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, for the response.
If SL_b is not 0, then the packet can't reach the OMPI process. Right?

If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send, and not the SL specified within the packet.
What do you think?

I can try to change the PathRequest of OMPI tomorrow, so that it matches addr_type.gsi.service_level.
Maybe, with this change the packets of the SA will reach the OMPI process on a SL>0.
> 
>>> 
>>>>> 
>>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>>>> 
>>>>> By the response reversibility rule, I think this is returned on the SL
>>>>> of the original query but haven't verified this in the code base yet.
>>>> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
>>> 
>>> I doubled checked and indeed the SA response does use the SL that the
>>> incoming request was received on.
>>> 
>>>>> 
>>>>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>>>>      /* GS classes */
>>>>>>      umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>                        p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>                        p_mad_addr->addr_type.gsi.service_level,
>>>>>>                        IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>>>> 
>>>>> By not working, what do you mean ? Do you mean it's not received at the
>>>>> requester with no message in the OpenSM log or not received at the
>>>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>>>> the original request (forcing it to SL 1). That could cause it not to be
>>>>> received at the SM or the response not to make it back to the requester
>>>>> from the SA if the SL used is not "reversible".
>>>> By "not working" I mean, that the MPI process does not receive any response from the SA.
>>>> I get messages from the MPI process like the following:
>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
>>>> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
>>>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>>>> 
>>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>>>>  at src/umad.c:791
>>>>>> 791             if (umaddebug > 1)
>>>>>> (gdb) p *mad
>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>>  lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>>>>  hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>>>>  pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>>>> 
>>>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
>>>> SL=6 indicates, that the MPI process was sending the request on SL 6.
>>> 
>>> What is SMSL for the requester ? Was it SL 6 ?
>> Yes, it was SL 6.
>> Here is a content of a similar packet which was received by the SA. I have used ibdump on the port where the OpenSM was running:
>> ======================================================================================
>> No.     Time        Source                Destination           Protocol Length Info
>>    785 14.352168   LID: 384              LID: 4140             InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
>> 
>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
>>    Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>    Epoch Time: 1355389784.437633332 seconds
>>    [Time delta from previous captured frame: 4.332020528 seconds]
>>    [Time delta from previous displayed frame: 4.332020528 seconds]
>>    [Time since reference or first frame: 14.352168681 seconds]
>>    Frame Number: 785
>>    Frame Length: 290 bytes (2320 bits)
>>    Capture Length: 290 bytes (2320 bits)
>>    [Frame is marked: False]
>>    [Frame is ignored: False]
>>    [Protocols in frame: erf:infiniband]
>> Extensible Record Format
>>    [ERF Header]
>>        Timestamp: 0x50c99b587008bcf2
>>        [Header type]
>>            .001 0101 = type: INFINIBAND (21)
>>            0... .... = Extension header present: 0
>>        0000 0100 = flags: 4
>>            .... ..00 = capture interface: 0
>>            .... .1.. = varying record length: 1
>>            .... 0... = truncated: 0
>>            ...0 .... = rx error: 0
>>            ..0. .... = ds error: 0
>>            00.. .... = reserved: 0
>>        record length: 306
>>        loss counter: 0
>>        wire length: 290
>> InfiniBand
>>    Local Route Header
>>        0110 .... = Virtual Lane: 0x06
>>        .... 0000 = Link Version: 0
>>        0110 .... = Service Level: 6
>>        .... 00.. = Reserved (2 bits): 0
>>        .... ..10 = Link Next Header: 0x02
>>        Destination Local ID: 19
>>        0000 0... .... .... = Reserved (5 bits): 0
>>        .... .000 0100 1000 = Packet Length: 72
>>        Source Local ID: 16
>>    Base Transport Header
>>        Opcode: 100
>>        1... .... = Solicited Event: True
>>        .1.. .... = MigReq: True
>>        ..00 .... = Pad Count: 0
>>        .... 0000 = Header Version: 0
>>        Partition Key: 65535
>>        Reserved (8 bits): 0
>>        Destination Queue Pair: 0x000001
>>        0... .... = Acknowledge Request: False
>>        .000 0000 = Reserved (7 bits): 0
>>        Packet Sequence Number: 0
>>    DETH - Datagram Extended Transport Header
>>        Queue Key: 2147549184
>>        Reserved (8 bits): 0
>>        Source Queue Pair: 0x00380050
>>    MAD Header - Common Management Datagram
>>        Base Version: 0x01
>>        Management Class: 0x03
>>        Class Version: 0x02
>>        Method: Get() (0x01)
>>        Status: 0x0000
>>        Class Specific: 0x0000
>>        Transaction ID: 0x0010000f38005000
>>        Attribute ID: 0x0035
>>        Reserved: 0x0000
>>        Attribute Modifier: 0x00000000
>>        MAD Data Payload: 000000000000000000000000000000000000000000000000...
>>     Illegal RMPP Type (0)! 
>>        RMPP Type: 0x00
>>        RMPP Type: 0x00
>>        0000 .... = R Resp Time: 0x00
>>        .... 0000 = RMPP Flags: Unknown (0x00)
>>        RMPP Status:  (Normal) (0x00)
>>        RMPP Data 1: 0x00000000
>>        RMPP Data 2: 0x00000000
>>    SMASubnAdmGet(PathRecord)
>>        SM_Key (Verification Key): 0x0000000000000000
>>        Attribute Offset: 0x0000
>>        Reserved: 0x0000
>>        Component Mask: 0x0000003000000000
>>        Attribute (PathRecord)
>>            PathRecord
>>                DGID: :: (::)
>>                SGID: ::0.15.0.16 (::0.15.0.16)
>>                DLID: 0x0000
>>                SLID: 0x0000
>>                0... .... = RawTraffic: 0x00
>>                .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>                HopLimit: 0x00
>>                TClass: 0x00
>>                0... .... = Reversible: 0x00
>>                .000 0000 = NumbPath: 0x00
>>                P_Key: 0x0000
>>                .... .... .... 0000 = SL: 0x0000
>>                00.. .... = MTUSelector: 0x00
>>                ..00 0000 = MTU: 0x00
>>                00.. .... = RateSelector: 0x00
>>                ..00 0000 = Rate: 0x00
>>                00.. .... = PacketLifeTimeSelector: 0x00
>>                ..00 0000 = PacketLifeTime: 0x00
>>                Preference: 0x00
>>    Variant CRC: 0xad4e
>> ======================================================================================
> 
> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
> out that machine and the issue is internal to that machine. It could be
> because of the underlying issue which hangs OpenSM when some IB program
> tried to unregister from the MAD layer but there were outstanding work
> completions. That's based on your original email earlier this AM.
No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI side and the SA uses a SL>0.
> 
>>> 
>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>> port) to SA and back to see whether SL6 would even have a chance of
>>> working (not dropping) aside from whether it's really the correct SL to use.
>> All SL2VL tables look the same. I checked the output of OpenSM.
>> 	SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
>> 	VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
>> But this is also as expected, because I have set the QoS in the opensm config as follows:
>> 	qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>> This was set for "default", "CA" and "Switch external ports". I have not touched the config for "Switch Port 0" and "Router ports", they remained: qos_[sw0 | rtr]_sl2vl (null)
> 
> That works as long as all links have (at least) 8 data VLs (VLCap 4).
Yes, all VL_CAP show 4 in the OpenSM log file.

Regards
Jens



> 
> -- Hal
> 
>> Regards
>> Jens
>> 
>>> 
>>> -- Hal
>>> 
>>>>> 
>>>>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>>>> 
>>>>> So nothing interesting logged relative to the PathRecord queries ?
>>>> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>> 
>>>>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>>>> 
>>>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>>>> that's how SMSL is set by DFSSSP.
>>>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>> 
>>>> Regards
>>>> Jens
>>>> 
>>>> --------------------------------
>>>> Dipl.-Math. Jens Domke
>>>> Researcher - Tokyo Institute of Technology
>>>> Satoshi MATSUOKA Laboratory
>>>> Global Scientific Information and Computing Center
>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>> Tokyo, 152-8550, JAPAN
>>>> Tel/Fax: +81-3-5734-3876
>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>> --------------------------------
>>>> 
>>>> 
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --------------------------------
>> Dipl.-Math. Jens Domke
>> Researcher - Tokyo Institute of Technology
>> Satoshi MATSUOKA Laboratory
>> Global Scientific Information and Computing Center
>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>> Tokyo, 152-8550, JAPAN
>> Tel/Fax: +81-3-5734-3876
>> E-Mail: domke.j.aa@m.titech.ac.jp
>> --------------------------------
>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--------------------------------
Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku, 
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: domke.j.aa@m.titech.ac.jp
--------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
  2012-12-14 20:32                 ` Jens Domke
@ 2012-12-14 20:44                   ` Hal Rosenstock
       [not found]                     ` <50CB8F90.1030701-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Hal Rosenstock @ 2012-12-14 20:44 UTC (permalink / raw)
  To: Jens Domke; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

Hi,

On 12/14/2012 3:32 PM, Jens Domke wrote:
> Hello Hal,
> 
> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
> 
>> Hi,
>>
>> On 12/14/2012 1:24 PM, Jens Domke wrote:
>>> Hello Hal,
>>>
>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>>>
>>>> Hi again,
>>>>
>>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>>> Hello Hal,
>>>>>
>>>>> thank you for the fast response. I will try to clarify some points.
>>>>>
>>>>>>> d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>>>>>
>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>>>>> there should be no need to set this. The proper SL for querying the SA
>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>>>>> (and other QoS based routing algorithms), it calculates that and the SM
>>>>>> pushes this into each port. That should be used. It's possible that SL1
>>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
>>>>> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>>>>>
>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>>>
>>>>>>> As far as I understand the whole system:
>>>>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>>>>> 2. the SA receives the request on QP1
>>>>>>
>>>>>> There is the SL in the query itself. This should be the SMSL that the SM
>>>>>> set for that port.
>>>>> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>>>
>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>>>>>
>>>>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>>>>> than the one the query used and is the one returned inside the
>>>>>> PathRecord attribute/data.
>>>>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
>>>>
>>>> With DFSSSP are all SLs same from source port to get to any destination ?
>>> No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
>>
>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
> True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path.
> So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL.
> 
> I just read the IB Specs and it says, that "SL specified in the received packet is used as the SL in the response packet" for MAD packets.
> So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet.
> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, 

So CompMask in the query has the SL bit on and SL is set to 0 inside the
SubAdmGet of PatchRecord ?

> and sends the packet on SL_b (PortInfo.SMSL).

Good.

> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, for the response.
> If SL_b is not 0, then the packet can't reach the OMPI process. Right?

Depends. It may be that both SLs work but maybe not.

> If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send, and not the SL specified within the packet.
> What do you think?

Yes, it might be better to wildcard the SL in the query. The only
scenario that would fail with the query you are making if there's no SL
0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
If that's the case, SA should return MAD status 0xc (status code 3 -
ERR_NO_RECORDS). But the response doesn't make it back to the requester
OMPI node so it's not even getting that far.

> I can try to change the PathRequest of OMPI tomorrow, so that it matches addr_type.gsi.service_level.
> Maybe, with this change the packets of the SA will reach the OMPI process on a SL>0.
>>
>>>>
>>>>>>
>>>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>>>>>
>>>>>> By the response reversibility rule, I think this is returned on the SL
>>>>>> of the original query but haven't verified this in the code base yet.
>>>>> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
>>>>
>>>> I doubled checked and indeed the SA response does use the SL that the
>>>> incoming request was received on.
>>>>
>>>>>>
>>>>>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>>>>>      /* GS classes */
>>>>>>>      umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>>                        p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>>                        p_mad_addr->addr_type.gsi.service_level,
>>>>>>>                        IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>>>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>>>>>
>>>>>> By not working, what do you mean ? Do you mean it's not received at the
>>>>>> requester with no message in the OpenSM log or not received at the
>>>>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>>>>> the original request (forcing it to SL 1). That could cause it not to be
>>>>>> received at the SM or the response not to make it back to the requester
>>>>>> from the SA if the SL used is not "reversible".
>>>>> By "not working" I mean, that the MPI process does not receive any response from the SA.
>>>>> I get messages from the MPI process like the following:
>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
>>>>> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
>>>>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>>>>>
>>>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>>>>>  at src/umad.c:791
>>>>>>> 791             if (umaddebug > 1)
>>>>>>> (gdb) p *mad
>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>>>  lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>>>>>  hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>>>>>  pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>>>>>
>>>>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>>> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
>>>>> SL=6 indicates, that the MPI process was sending the request on SL 6.
>>>>
>>>> What is SMSL for the requester ? Was it SL 6 ?
>>> Yes, it was SL 6.
>>> Here is a content of a similar packet which was received by the SA. I have used ibdump on the port where the OpenSM was running:
>>> ======================================================================================
>>> No.     Time        Source                Destination           Protocol Length Info
>>>    785 14.352168   LID: 384              LID: 4140             InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
>>>
>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
>>>    Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>>    Epoch Time: 1355389784.437633332 seconds
>>>    [Time delta from previous captured frame: 4.332020528 seconds]
>>>    [Time delta from previous displayed frame: 4.332020528 seconds]
>>>    [Time since reference or first frame: 14.352168681 seconds]
>>>    Frame Number: 785
>>>    Frame Length: 290 bytes (2320 bits)
>>>    Capture Length: 290 bytes (2320 bits)
>>>    [Frame is marked: False]
>>>    [Frame is ignored: False]
>>>    [Protocols in frame: erf:infiniband]
>>> Extensible Record Format
>>>    [ERF Header]
>>>        Timestamp: 0x50c99b587008bcf2
>>>        [Header type]
>>>            .001 0101 = type: INFINIBAND (21)
>>>            0... .... = Extension header present: 0
>>>        0000 0100 = flags: 4
>>>            .... ..00 = capture interface: 0
>>>            .... .1.. = varying record length: 1
>>>            .... 0... = truncated: 0
>>>            ...0 .... = rx error: 0
>>>            ..0. .... = ds error: 0
>>>            00.. .... = reserved: 0
>>>        record length: 306
>>>        loss counter: 0
>>>        wire length: 290
>>> InfiniBand
>>>    Local Route Header
>>>        0110 .... = Virtual Lane: 0x06
>>>        .... 0000 = Link Version: 0
>>>        0110 .... = Service Level: 6
>>>        .... 00.. = Reserved (2 bits): 0
>>>        .... ..10 = Link Next Header: 0x02
>>>        Destination Local ID: 19
>>>        0000 0... .... .... = Reserved (5 bits): 0
>>>        .... .000 0100 1000 = Packet Length: 72
>>>        Source Local ID: 16
>>>    Base Transport Header
>>>        Opcode: 100
>>>        1... .... = Solicited Event: True
>>>        .1.. .... = MigReq: True
>>>        ..00 .... = Pad Count: 0
>>>        .... 0000 = Header Version: 0
>>>        Partition Key: 65535
>>>        Reserved (8 bits): 0
>>>        Destination Queue Pair: 0x000001
>>>        0... .... = Acknowledge Request: False
>>>        .000 0000 = Reserved (7 bits): 0
>>>        Packet Sequence Number: 0
>>>    DETH - Datagram Extended Transport Header
>>>        Queue Key: 2147549184
>>>        Reserved (8 bits): 0
>>>        Source Queue Pair: 0x00380050
>>>    MAD Header - Common Management Datagram
>>>        Base Version: 0x01
>>>        Management Class: 0x03
>>>        Class Version: 0x02
>>>        Method: Get() (0x01)
>>>        Status: 0x0000
>>>        Class Specific: 0x0000
>>>        Transaction ID: 0x0010000f38005000
>>>        Attribute ID: 0x0035
>>>        Reserved: 0x0000
>>>        Attribute Modifier: 0x00000000
>>>        MAD Data Payload: 000000000000000000000000000000000000000000000000...
>>>     Illegal RMPP Type (0)! 
>>>        RMPP Type: 0x00
>>>        RMPP Type: 0x00
>>>        0000 .... = R Resp Time: 0x00
>>>        .... 0000 = RMPP Flags: Unknown (0x00)
>>>        RMPP Status:  (Normal) (0x00)
>>>        RMPP Data 1: 0x00000000
>>>        RMPP Data 2: 0x00000000
>>>    SMASubnAdmGet(PathRecord)
>>>        SM_Key (Verification Key): 0x0000000000000000
>>>        Attribute Offset: 0x0000
>>>        Reserved: 0x0000
>>>        Component Mask: 0x0000003000000000
>>>        Attribute (PathRecord)
>>>            PathRecord
>>>                DGID: :: (::)
>>>                SGID: ::0.15.0.16 (::0.15.0.16)
>>>                DLID: 0x0000
>>>                SLID: 0x0000
>>>                0... .... = RawTraffic: 0x00
>>>                .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>                HopLimit: 0x00
>>>                TClass: 0x00
>>>                0... .... = Reversible: 0x00
>>>                .000 0000 = NumbPath: 0x00
>>>                P_Key: 0x0000
>>>                .... .... .... 0000 = SL: 0x0000
>>>                00.. .... = MTUSelector: 0x00
>>>                ..00 0000 = MTU: 0x00
>>>                00.. .... = RateSelector: 0x00
>>>                ..00 0000 = Rate: 0x00
>>>                00.. .... = PacketLifeTimeSelector: 0x00
>>>                ..00 0000 = PacketLifeTime: 0x00
>>>                Preference: 0x00
>>>    Variant CRC: 0xad4e
>>> ======================================================================================
>>
>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
>> out that machine and the issue is internal to that machine. It could be
>> because of the underlying issue which hangs OpenSM when some IB program
>> tried to unregister from the MAD layer but there were outstanding work
>> completions. That's based on your original email earlier this AM.
> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI side and the SA uses a SL>0.

Can ibdump be used to capture output on the SM port ?

-- Hal

>>
>>>>
>>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>>> port) to SA and back to see whether SL6 would even have a chance of
>>>> working (not dropping) aside from whether it's really the correct SL to use.
>>> All SL2VL tables look the same. I checked the output of OpenSM.
>>> 	SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
>>> 	VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
>>> But this is also as expected, because I have set the QoS in the opensm config as follows:
>>> 	qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>>> This was set for "default", "CA" and "Switch external ports". I have not touched the config for "Switch Port 0" and "Router ports", they remained: qos_[sw0 | rtr]_sl2vl (null)
>>
>> That works as long as all links have (at least) 8 data VLs (VLCap 4).
> Yes, all VL_CAP show 4 in the OpenSM log file.
> 
> Regards
> Jens
> 
> 
> 
>>
>> -- Hal
>>
>>> Regards
>>> Jens
>>>
>>>>
>>>> -- Hal
>>>>
>>>>>>
>>>>>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>>>>>
>>>>>> So nothing interesting logged relative to the PathRecord queries ?
>>>>> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
>>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>>>
>>>>>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>>>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>>>>>
>>>>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>>>>> that's how SMSL is set by DFSSSP.
>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
>>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>>>
>>>>> Regards
>>>>> Jens
>>>>>
>>>>> --------------------------------
>>>>> Dipl.-Math. Jens Domke
>>>>> Researcher - Tokyo Institute of Technology
>>>>> Satoshi MATSUOKA Laboratory
>>>>> Global Scientific Information and Computing Center
>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>> Tokyo, 152-8550, JAPAN
>>>>> Tel/Fax: +81-3-5734-3876
>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>> --------------------------------
>>>>>
>>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --------------------------------
>>> Dipl.-Math. Jens Domke
>>> Researcher - Tokyo Institute of Technology
>>> Satoshi MATSUOKA Laboratory
>>> Global Scientific Information and Computing Center
>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>> Tokyo, 152-8550, JAPAN
>>> Tel/Fax: +81-3-5734-3876
>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>> --------------------------------
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --------------------------------
> Dipl.-Math. Jens Domke
> Researcher - Tokyo Institute of Technology
> Satoshi MATSUOKA Laboratory
> Global Scientific Information and Computing Center
> 2-12-1-E2-7 Ookayama, Meguro-ku, 
> Tokyo, 152-8550, JAPAN
> Tel/Fax: +81-3-5734-3876
> E-Mail: domke.j.aa@m.titech.ac.jp
> --------------------------------
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
       [not found]                     ` <50CB8F90.1030701-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2012-12-16 12:03                       ` Jens Domke
  2012-12-16 12:32                         ` Hal Rosenstock
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Domke @ 2012-12-16 12:03 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

Hello Hal,

On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:

> Hi,
> 
> On 12/14/2012 3:32 PM, Jens Domke wrote:
>> Hello Hal,
>> 
>> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
>> 
>>> Hi,
>>> 
>>> On 12/14/2012 1:24 PM, Jens Domke wrote:
>>>> Hello Hal,
>>>> 
>>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>>>> 
>>>>> Hi again,
>>>>> 
>>>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>>>> Hello Hal,
>>>>>> 
>>>>>> thank you for the fast response. I will try to clarify some points.
>>>>>> 
>>>>>>>> d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>>>>>> 
>>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>>>>>> there should be no need to set this. The proper SL for querying the SA
>>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>>>>>> (and other QoS based routing algorithms), it calculates that and the SM
>>>>>>> pushes this into each port. That should be used. It's possible that SL1
>>>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
>>>>>> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
>>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>>>>>> 
>>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>>>> 
>>>>>>>> As far as I understand the whole system:
>>>>>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>>>>>> 2. the SA receives the request on QP1
>>>>>>> 
>>>>>>> There is the SL in the query itself. This should be the SMSL that the SM
>>>>>>> set for that port.
>>>>>> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
>>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>>>> 
>>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>>>>>> 
>>>>>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>>>>>> than the one the query used and is the one returned inside the
>>>>>>> PathRecord attribute/data.
>>>>>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
>>>>> 
>>>>> With DFSSSP are all SLs same from source port to get to any destination ?
>>>> No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
>>> 
>>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
>> True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path.
>> So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL.
>> 
>> I just read the IB Specs and it says, that "SL specified in the received packet is used as the SL in the response packet" for MAD packets.
>> So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet.
>> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, 
> 
> So CompMask in the query has the SL bit on and SL is set to 0 inside the
> SubAdmGet of PatchRecord ?

No, the CompMask didn't had the SL bit and the SL was set to 0.
I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only reference I found was in osm_sa_path_record.c
The SA just treats the SL in the PathRequest as a "I would like to use this SL" in case the SL bit is set.
But the routing engine can overwrite the requested SL before the reply is send.

Nevertheless, I have changed the code of OMPI so that it sets the SL bit in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == SL_b.
Sadly, the reply send by the SA does not leave the node (for SL_b>0). Only if I change the SL to 0 in the MAD right before umad_send is called by the SA, the paket is able to leave the node and reaches the OMPI process.

> 
>> and sends the packet on SL_b (PortInfo.SMSL).
> 
> Good.
> 
>> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, for the response.
>> If SL_b is not 0, then the packet can't reach the OMPI process. Right?
> 
> Depends. It may be that both SLs work but maybe not.
> 
>> If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send, and not the SL specified within the packet.
>> What do you think?
> 
> Yes, it might be better to wildcard the SL in the query. The only
> scenario that would fail with the query you are making if there's no SL
> 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
> If that's the case, SA should return MAD status 0xc (status code 3 -
> ERR_NO_RECORDS). But the response doesn't make it back to the requester
> OMPI node so it's not even getting that far.

Yes, exactly. So, do you have an idea why the response hands in the SA node?
I have no inside of the underlying layer (kernel driver and fireware). Maybe there are some implementations, which prevent the SA from sending MADs back on SL>0?

> 
>> I can try to change the PathRequest of OMPI tomorrow, so that it matches addr_type.gsi.service_level.
>> Maybe, with this change the packets of the SA will reach the OMPI process on a SL>0.
>>> 
>>>>> 
>>>>>>> 
>>>>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>>>>>> 
>>>>>>> By the response reversibility rule, I think this is returned on the SL
>>>>>>> of the original query but haven't verified this in the code base yet.
>>>>>> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
>>>>> 
>>>>> I doubled checked and indeed the SA response does use the SL that the
>>>>> incoming request was received on.
>>>>> 
>>>>>>> 
>>>>>>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>>>>>>     /* GS classes */
>>>>>>>>     umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>>>                       p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>>>                       p_mad_addr->addr_type.gsi.service_level,
>>>>>>>>                       IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>>>>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>>>>>> 
>>>>>>> By not working, what do you mean ? Do you mean it's not received at the
>>>>>>> requester with no message in the OpenSM log or not received at the
>>>>>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>>>>>> the original request (forcing it to SL 1). That could cause it not to be
>>>>>>> received at the SM or the response not to make it back to the requester
>>>>>>> from the SA if the SL used is not "reversible".
>>>>>> By "not working" I mean, that the MPI process does not receive any response from the SA.
>>>>>> I get messages from the MPI process like the following:
>>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
>>>>>> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
>>>>>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>>>>>> 
>>>>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>>>>>> at src/umad.c:791
>>>>>>>> 791             if (umaddebug > 1)
>>>>>>>> (gdb) p *mad
>>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>>>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>>>>>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>>>>>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>>>>>> 
>>>>>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>>>> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
>>>>>> SL=6 indicates, that the MPI process was sending the request on SL 6.
>>>>> 
>>>>> What is SMSL for the requester ? Was it SL 6 ?
>>>> Yes, it was SL 6.
>>>> Here is a content of a similar packet which was received by the SA. I have used ibdump on the port where the OpenSM was running:
>>>> ======================================================================================
>>>> No.     Time        Source                Destination           Protocol Length Info
>>>>   785 14.352168   LID: 384              LID: 4140             InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
>>>> 
>>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
>>>>   Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>>>   Epoch Time: 1355389784.437633332 seconds
>>>>   [Time delta from previous captured frame: 4.332020528 seconds]
>>>>   [Time delta from previous displayed frame: 4.332020528 seconds]
>>>>   [Time since reference or first frame: 14.352168681 seconds]
>>>>   Frame Number: 785
>>>>   Frame Length: 290 bytes (2320 bits)
>>>>   Capture Length: 290 bytes (2320 bits)
>>>>   [Frame is marked: False]
>>>>   [Frame is ignored: False]
>>>>   [Protocols in frame: erf:infiniband]
>>>> Extensible Record Format
>>>>   [ERF Header]
>>>>       Timestamp: 0x50c99b587008bcf2
>>>>       [Header type]
>>>>           .001 0101 = type: INFINIBAND (21)
>>>>           0... .... = Extension header present: 0
>>>>       0000 0100 = flags: 4
>>>>           .... ..00 = capture interface: 0
>>>>           .... .1.. = varying record length: 1
>>>>           .... 0... = truncated: 0
>>>>           ...0 .... = rx error: 0
>>>>           ..0. .... = ds error: 0
>>>>           00.. .... = reserved: 0
>>>>       record length: 306
>>>>       loss counter: 0
>>>>       wire length: 290
>>>> InfiniBand
>>>>   Local Route Header
>>>>       0110 .... = Virtual Lane: 0x06
>>>>       .... 0000 = Link Version: 0
>>>>       0110 .... = Service Level: 6
>>>>       .... 00.. = Reserved (2 bits): 0
>>>>       .... ..10 = Link Next Header: 0x02
>>>>       Destination Local ID: 19
>>>>       0000 0... .... .... = Reserved (5 bits): 0
>>>>       .... .000 0100 1000 = Packet Length: 72
>>>>       Source Local ID: 16
>>>>   Base Transport Header
>>>>       Opcode: 100
>>>>       1... .... = Solicited Event: True
>>>>       .1.. .... = MigReq: True
>>>>       ..00 .... = Pad Count: 0
>>>>       .... 0000 = Header Version: 0
>>>>       Partition Key: 65535
>>>>       Reserved (8 bits): 0
>>>>       Destination Queue Pair: 0x000001
>>>>       0... .... = Acknowledge Request: False
>>>>       .000 0000 = Reserved (7 bits): 0
>>>>       Packet Sequence Number: 0
>>>>   DETH - Datagram Extended Transport Header
>>>>       Queue Key: 2147549184
>>>>       Reserved (8 bits): 0
>>>>       Source Queue Pair: 0x00380050
>>>>   MAD Header - Common Management Datagram
>>>>       Base Version: 0x01
>>>>       Management Class: 0x03
>>>>       Class Version: 0x02
>>>>       Method: Get() (0x01)
>>>>       Status: 0x0000
>>>>       Class Specific: 0x0000
>>>>       Transaction ID: 0x0010000f38005000
>>>>       Attribute ID: 0x0035
>>>>       Reserved: 0x0000
>>>>       Attribute Modifier: 0x00000000
>>>>       MAD Data Payload: 000000000000000000000000000000000000000000000000...
>>>>    Illegal RMPP Type (0)! 
>>>>       RMPP Type: 0x00
>>>>       RMPP Type: 0x00
>>>>       0000 .... = R Resp Time: 0x00
>>>>       .... 0000 = RMPP Flags: Unknown (0x00)
>>>>       RMPP Status:  (Normal) (0x00)
>>>>       RMPP Data 1: 0x00000000
>>>>       RMPP Data 2: 0x00000000
>>>>   SMASubnAdmGet(PathRecord)
>>>>       SM_Key (Verification Key): 0x0000000000000000
>>>>       Attribute Offset: 0x0000
>>>>       Reserved: 0x0000
>>>>       Component Mask: 0x0000003000000000
>>>>       Attribute (PathRecord)
>>>>           PathRecord
>>>>               DGID: :: (::)
>>>>               SGID: ::0.15.0.16 (::0.15.0.16)
>>>>               DLID: 0x0000
>>>>               SLID: 0x0000
>>>>               0... .... = RawTraffic: 0x00
>>>>               .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>               HopLimit: 0x00
>>>>               TClass: 0x00
>>>>               0... .... = Reversible: 0x00
>>>>               .000 0000 = NumbPath: 0x00
>>>>               P_Key: 0x0000
>>>>               .... .... .... 0000 = SL: 0x0000
>>>>               00.. .... = MTUSelector: 0x00
>>>>               ..00 0000 = MTU: 0x00
>>>>               00.. .... = RateSelector: 0x00
>>>>               ..00 0000 = Rate: 0x00
>>>>               00.. .... = PacketLifeTimeSelector: 0x00
>>>>               ..00 0000 = PacketLifeTime: 0x00
>>>>               Preference: 0x00
>>>>   Variant CRC: 0xad4e
>>>> ======================================================================================
>>> 
>>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
>>> out that machine and the issue is internal to that machine. It could be
>>> because of the underlying issue which hangs OpenSM when some IB program
>>> tried to unregister from the MAD layer but there were outstanding work
>>> completions. That's based on your original email earlier this AM.
>> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI side and the SA uses a SL>0.
> 
> Can ibdump be used to capture output on the SM port ?

Yes, that works quite well, despite the warning in the ibdump manual.
But I have started ibdump before opensm, maybe that makes a difference, not sure.

Regards,
Jens

PS: I have seen a small bug. Not sure if its a bug in wireshark or ibdump, but the response received by the OMPI node isn't shown correctly. The PathRecord contains an offset which is either missing in the dump or is not treated correctly be wireshark. But it causes wireshark to show the PathRecord data with wrong values.
Maybe you could redirect this to the developer of ibdump, so that he can check/fix it.

> 
> -- Hal
> 
>>> 
>>>>> 
>>>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>>>> port) to SA and back to see whether SL6 would even have a chance of
>>>>> working (not dropping) aside from whether it's really the correct SL to use.
>>>> All SL2VL tables look the same. I checked the output of OpenSM.
>>>> 	SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
>>>> 	VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
>>>> But this is also as expected, because I have set the QoS in the opensm config as follows:
>>>> 	qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>>>> This was set for "default", "CA" and "Switch external ports". I have not touched the config for "Switch Port 0" and "Router ports", they remained: qos_[sw0 | rtr]_sl2vl (null)
>>> 
>>> That works as long as all links have (at least) 8 data VLs (VLCap 4).
>> Yes, all VL_CAP show 4 in the OpenSM log file.
>> 
>> Regards
>> Jens
>> 
>> 
>> 
>>> 
>>> -- Hal
>>> 
>>>> Regards
>>>> Jens
>>>> 
>>>>> 
>>>>> -- Hal
>>>>> 
>>>>>>> 
>>>>>>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>>>>>> 
>>>>>>> So nothing interesting logged relative to the PathRecord queries ?
>>>>>> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
>>>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>>>> 
>>>>>>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>>>>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>>>>>> 
>>>>>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>>>>>> that's how SMSL is set by DFSSSP.
>>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
>>>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>>>> 
>>>>>> Regards
>>>>>> Jens
>>>>>> 
>>>>>> --------------------------------
>>>>>> Dipl.-Math. Jens Domke
>>>>>> Researcher - Tokyo Institute of Technology
>>>>>> Satoshi MATSUOKA Laboratory
>>>>>> Global Scientific Information and Computing Center
>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>> Tokyo, 152-8550, JAPAN
>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>> --------------------------------
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> --------------------------------
>>>> Dipl.-Math. Jens Domke
>>>> Researcher - Tokyo Institute of Technology
>>>> Satoshi MATSUOKA Laboratory
>>>> Global Scientific Information and Computing Center
>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>> Tokyo, 152-8550, JAPAN
>>>> Tel/Fax: +81-3-5734-3876
>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>> --------------------------------
>>>> 
>>>> 
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --------------------------------
>> Dipl.-Math. Jens Domke
>> Researcher - Tokyo Institute of Technology
>> Satoshi MATSUOKA Laboratory
>> Global Scientific Information and Computing Center
>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>> Tokyo, 152-8550, JAPAN
>> Tel/Fax: +81-3-5734-3876
>> E-Mail: domke.j.aa@m.titech.ac.jp
>> --------------------------------
>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--------------------------------
Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku, 
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: domke.j.aa@m.titech.ac.jp
--------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
  2012-12-16 12:03                       ` Jens Domke
@ 2012-12-16 12:32                         ` Hal Rosenstock
       [not found]                           ` <50CDBF61.3080100-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Hal Rosenstock @ 2012-12-16 12:32 UTC (permalink / raw)
  To: Jens Domke; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

Hi,

On 12/16/2012 7:03 AM, Jens Domke wrote:
> Hello Hal,
> 
> On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:
> 
>> Hi,
>>
>> On 12/14/2012 3:32 PM, Jens Domke wrote:
>>> Hello Hal,
>>>
>>> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
>>>
>>>> Hi,
>>>>
>>>> On 12/14/2012 1:24 PM, Jens Domke wrote:
>>>>> Hello Hal,
>>>>>
>>>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>>>>>
>>>>>> Hi again,
>>>>>>
>>>>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>>>>> Hello Hal,
>>>>>>>
>>>>>>> thank you for the fast response. I will try to clarify some points.
>>>>>>>
>>>>>>>>> d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>>>>>>>
>>>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>>>>>>> there should be no need to set this. The proper SL for querying the SA
>>>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>>>>>>> (and other QoS based routing algorithms), it calculates that and the SM
>>>>>>>> pushes this into each port. That should be used. It's possible that SL1
>>>>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
>>>>>>> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
>>>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>>>>>>>
>>>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>>>>>
>>>>>>>>> As far as I understand the whole system:
>>>>>>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>>>>>>> 2. the SA receives the request on QP1
>>>>>>>>
>>>>>>>> There is the SL in the query itself. This should be the SMSL that the SM
>>>>>>>> set for that port.
>>>>>>> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
>>>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>>>>>
>>>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>>>>>>>
>>>>>>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>>>>>>> than the one the query used and is the one returned inside the
>>>>>>>> PathRecord attribute/data.
>>>>>>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
>>>>>>
>>>>>> With DFSSSP are all SLs same from source port to get to any destination ?
>>>>> No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
>>>>
>>>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
>>> True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path.
>>> So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL.
>>>
>>> I just read the IB Specs and it says, that "SL specified in the received packet is used as the SL in the response packet" for MAD packets.
>>> So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet.
>>> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, 
>>
>> So CompMask in the query has the SL bit on and SL is set to 0 inside the
>> SubAdmGet of PatchRecord ?
> 
> No, the CompMask didn't had the SL bit and the SL was set to 0.

That means the SL in the request is wildcarded so the SA/SM fills in a
valid one in the response.

> I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only reference I found was in osm_sa_path_record.c
> The SA just treats the SL in the PathRequest as a "I would like to use this SL" in case the SL bit is set.
> But the routing engine can overwrite the requested SL before the reply is send.
>
> Nevertheless, I have changed the code of OMPI so that it sets the SL bit in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == SL_b.
> Sadly, the reply send by the SA does not leave the node (for SL_b>0). Only if I change the SL to 0 in the MAD right before umad_send is called by the SA, the paket is able to leave the node and reaches the OMPI process.

Are you sure the response doesn't leave the SA node or it's not received
at the requester (OMPI node) ?

> 
>>
>>> and sends the packet on SL_b (PortInfo.SMSL).
>>
>> Good.
>>
>>> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, for the response.
>>> If SL_b is not 0, then the packet can't reach the OMPI process. Right?
>>
>> Depends. It may be that both SLs work but maybe not.
>>
>>> If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send, and not the SL specified within the packet.
>>> What do you think?
>>
>> Yes, it might be better to wildcard the SL in the query. The only
>> scenario that would fail with the query you are making if there's no SL
>> 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
>> If that's the case, SA should return MAD status 0xc (status code 3 -
>> ERR_NO_RECORDS). But the response doesn't make it back to the requester
>> OMPI node so it's not even getting that far.
> 
> Yes, exactly. So, do you have an idea why the response hands in the SA node?
> I have no inside of the underlying layer (kernel driver and fireware). Maybe there are some implementations, which prevent the SA from sending MADs back on SL>0?

If you're sure this response doesn't get out of the SA node, please
contact Mellanox support with the details.

>>
>>> I can try to change the PathRequest of OMPI tomorrow, so that it matches addr_type.gsi.service_level.
>>> Maybe, with this change the packets of the SA will reach the OMPI process on a SL>0.
>>>>
>>>>>>
>>>>>>>>
>>>>>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>>>>>>>
>>>>>>>> By the response reversibility rule, I think this is returned on the SL
>>>>>>>> of the original query but haven't verified this in the code base yet.
>>>>>>> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
>>>>>>
>>>>>> I doubled checked and indeed the SA response does use the SL that the
>>>>>> incoming request was received on.
>>>>>>
>>>>>>>>
>>>>>>>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>>>>>>>     /* GS classes */
>>>>>>>>>     umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>>>>                       p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>>>>                       p_mad_addr->addr_type.gsi.service_level,
>>>>>>>>>                       IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>>>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>>>>>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>>>>>>>
>>>>>>>> By not working, what do you mean ? Do you mean it's not received at the
>>>>>>>> requester with no message in the OpenSM log or not received at the
>>>>>>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>>>>>>> the original request (forcing it to SL 1). That could cause it not to be
>>>>>>>> received at the SM or the response not to make it back to the requester
>>>>>>>> from the SA if the SL used is not "reversible".
>>>>>>> By "not working" I mean, that the MPI process does not receive any response from the SA.
>>>>>>> I get messages from the MPI process like the following:
>>>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
>>>>>>> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
>>>>>>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>>>>>>>
>>>>>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>>>>>>> at src/umad.c:791
>>>>>>>>> 791             if (umaddebug > 1)
>>>>>>>>> (gdb) p *mad
>>>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>>>>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>>>>>>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>>>>>>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>>>>>>>
>>>>>>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>>>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>>>>> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
>>>>>>> SL=6 indicates, that the MPI process was sending the request on SL 6.
>>>>>>
>>>>>> What is SMSL for the requester ? Was it SL 6 ?
>>>>> Yes, it was SL 6.
>>>>> Here is a content of a similar packet which was received by the SA. I have used ibdump on the port where the OpenSM was running:
>>>>> ======================================================================================
>>>>> No.     Time        Source                Destination           Protocol Length Info
>>>>>   785 14.352168   LID: 384              LID: 4140             InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
>>>>>
>>>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
>>>>>   Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>>>>   Epoch Time: 1355389784.437633332 seconds
>>>>>   [Time delta from previous captured frame: 4.332020528 seconds]
>>>>>   [Time delta from previous displayed frame: 4.332020528 seconds]
>>>>>   [Time since reference or first frame: 14.352168681 seconds]
>>>>>   Frame Number: 785
>>>>>   Frame Length: 290 bytes (2320 bits)
>>>>>   Capture Length: 290 bytes (2320 bits)
>>>>>   [Frame is marked: False]
>>>>>   [Frame is ignored: False]
>>>>>   [Protocols in frame: erf:infiniband]
>>>>> Extensible Record Format
>>>>>   [ERF Header]
>>>>>       Timestamp: 0x50c99b587008bcf2
>>>>>       [Header type]
>>>>>           .001 0101 = type: INFINIBAND (21)
>>>>>           0... .... = Extension header present: 0
>>>>>       0000 0100 = flags: 4
>>>>>           .... ..00 = capture interface: 0
>>>>>           .... .1.. = varying record length: 1
>>>>>           .... 0... = truncated: 0
>>>>>           ...0 .... = rx error: 0
>>>>>           ..0. .... = ds error: 0
>>>>>           00.. .... = reserved: 0
>>>>>       record length: 306
>>>>>       loss counter: 0
>>>>>       wire length: 290
>>>>> InfiniBand
>>>>>   Local Route Header
>>>>>       0110 .... = Virtual Lane: 0x06
>>>>>       .... 0000 = Link Version: 0
>>>>>       0110 .... = Service Level: 6
>>>>>       .... 00.. = Reserved (2 bits): 0
>>>>>       .... ..10 = Link Next Header: 0x02
>>>>>       Destination Local ID: 19
>>>>>       0000 0... .... .... = Reserved (5 bits): 0
>>>>>       .... .000 0100 1000 = Packet Length: 72
>>>>>       Source Local ID: 16
>>>>>   Base Transport Header
>>>>>       Opcode: 100
>>>>>       1... .... = Solicited Event: True
>>>>>       .1.. .... = MigReq: True
>>>>>       ..00 .... = Pad Count: 0
>>>>>       .... 0000 = Header Version: 0
>>>>>       Partition Key: 65535
>>>>>       Reserved (8 bits): 0
>>>>>       Destination Queue Pair: 0x000001
>>>>>       0... .... = Acknowledge Request: False
>>>>>       .000 0000 = Reserved (7 bits): 0
>>>>>       Packet Sequence Number: 0
>>>>>   DETH - Datagram Extended Transport Header
>>>>>       Queue Key: 2147549184
>>>>>       Reserved (8 bits): 0
>>>>>       Source Queue Pair: 0x00380050
>>>>>   MAD Header - Common Management Datagram
>>>>>       Base Version: 0x01
>>>>>       Management Class: 0x03
>>>>>       Class Version: 0x02
>>>>>       Method: Get() (0x01)
>>>>>       Status: 0x0000
>>>>>       Class Specific: 0x0000
>>>>>       Transaction ID: 0x0010000f38005000
>>>>>       Attribute ID: 0x0035
>>>>>       Reserved: 0x0000
>>>>>       Attribute Modifier: 0x00000000
>>>>>       MAD Data Payload: 000000000000000000000000000000000000000000000000...
>>>>>    Illegal RMPP Type (0)! 
>>>>>       RMPP Type: 0x00
>>>>>       RMPP Type: 0x00
>>>>>       0000 .... = R Resp Time: 0x00
>>>>>       .... 0000 = RMPP Flags: Unknown (0x00)
>>>>>       RMPP Status:  (Normal) (0x00)
>>>>>       RMPP Data 1: 0x00000000
>>>>>       RMPP Data 2: 0x00000000
>>>>>   SMASubnAdmGet(PathRecord)
>>>>>       SM_Key (Verification Key): 0x0000000000000000
>>>>>       Attribute Offset: 0x0000
>>>>>       Reserved: 0x0000
>>>>>       Component Mask: 0x0000003000000000
>>>>>       Attribute (PathRecord)
>>>>>           PathRecord
>>>>>               DGID: :: (::)
>>>>>               SGID: ::0.15.0.16 (::0.15.0.16)
>>>>>               DLID: 0x0000
>>>>>               SLID: 0x0000
>>>>>               0... .... = RawTraffic: 0x00
>>>>>               .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>>               HopLimit: 0x00
>>>>>               TClass: 0x00
>>>>>               0... .... = Reversible: 0x00
>>>>>               .000 0000 = NumbPath: 0x00
>>>>>               P_Key: 0x0000
>>>>>               .... .... .... 0000 = SL: 0x0000
>>>>>               00.. .... = MTUSelector: 0x00
>>>>>               ..00 0000 = MTU: 0x00
>>>>>               00.. .... = RateSelector: 0x00
>>>>>               ..00 0000 = Rate: 0x00
>>>>>               00.. .... = PacketLifeTimeSelector: 0x00
>>>>>               ..00 0000 = PacketLifeTime: 0x00
>>>>>               Preference: 0x00
>>>>>   Variant CRC: 0xad4e
>>>>> ======================================================================================
>>>>
>>>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
>>>> out that machine and the issue is internal to that machine. It could be
>>>> because of the underlying issue which hangs OpenSM when some IB program
>>>> tried to unregister from the MAD layer but there were outstanding work
>>>> completions. That's based on your original email earlier this AM.
>>> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI side and the SA uses a SL>0.
>>
>> Can ibdump be used to capture output on the SM port ?
> 
> Yes, that works quite well, despite the warning in the ibdump manual.
> But I have started ibdump before opensm, maybe that makes a difference, not sure.
> 
> Regards,
> Jens
> 
> PS: I have seen a small bug. Not sure if its a bug in wireshark or ibdump, but the response received by the OMPI node isn't shown correctly. The PathRecord contains an offset which is either missing in the dump or is not treated correctly be wireshark. But it causes wireshark to show the PathRecord data with wrong values.
> Maybe you could redirect this to the developer of ibdump, so that he can check/fix it.

Are you referring to the fields after the SA AttributeOffset or
something else ?

-- Hal

>>
>> -- Hal
>>
>>>>
>>>>>>
>>>>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>>>>> port) to SA and back to see whether SL6 would even have a chance of
>>>>>> working (not dropping) aside from whether it's really the correct SL to use.
>>>>> All SL2VL tables look the same. I checked the output of OpenSM.
>>>>> 	SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
>>>>> 	VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
>>>>> But this is also as expected, because I have set the QoS in the opensm config as follows:
>>>>> 	qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>>>>> This was set for "default", "CA" and "Switch external ports". I have not touched the config for "Switch Port 0" and "Router ports", they remained: qos_[sw0 | rtr]_sl2vl (null)
>>>>
>>>> That works as long as all links have (at least) 8 data VLs (VLCap 4).
>>> Yes, all VL_CAP show 4 in the OpenSM log file.
>>>
>>> Regards
>>> Jens
>>>
>>>
>>>
>>>>
>>>> -- Hal
>>>>
>>>>> Regards
>>>>> Jens
>>>>>
>>>>>>
>>>>>> -- Hal
>>>>>>
>>>>>>>>
>>>>>>>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>>>>>>>
>>>>>>>> So nothing interesting logged relative to the PathRecord queries ?
>>>>>>> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
>>>>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>>>>>
>>>>>>>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>>>>>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>>>>>>>
>>>>>>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>>>>>>> that's how SMSL is set by DFSSSP.
>>>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
>>>>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>>>>>
>>>>>>> Regards
>>>>>>> Jens
>>>>>>>
>>>>>>> --------------------------------
>>>>>>> Dipl.-Math. Jens Domke
>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>> Global Scientific Information and Computing Center
>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>> --------------------------------
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> --------------------------------
>>>>> Dipl.-Math. Jens Domke
>>>>> Researcher - Tokyo Institute of Technology
>>>>> Satoshi MATSUOKA Laboratory
>>>>> Global Scientific Information and Computing Center
>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>> Tokyo, 152-8550, JAPAN
>>>>> Tel/Fax: +81-3-5734-3876
>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>> --------------------------------
>>>>>
>>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --------------------------------
>>> Dipl.-Math. Jens Domke
>>> Researcher - Tokyo Institute of Technology
>>> Satoshi MATSUOKA Laboratory
>>> Global Scientific Information and Computing Center
>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>> Tokyo, 152-8550, JAPAN
>>> Tel/Fax: +81-3-5734-3876
>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>> --------------------------------
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --------------------------------
> Dipl.-Math. Jens Domke
> Researcher - Tokyo Institute of Technology
> Satoshi MATSUOKA Laboratory
> Global Scientific Information and Computing Center
> 2-12-1-E2-7 Ookayama, Meguro-ku, 
> Tokyo, 152-8550, JAPAN
> Tel/Fax: +81-3-5734-3876
> E-Mail: domke.j.aa@m.titech.ac.jp
> --------------------------------
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
       [not found]                           ` <50CDBF61.3080100-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2012-12-16 13:25                             ` Hal Rosenstock
  2012-12-16 13:39                             ` Jens Domke
  1 sibling, 0 replies; 18+ messages in thread
From: Hal Rosenstock @ 2012-12-16 13:25 UTC (permalink / raw)
  To: Jens Domke; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

On 12/16/2012 7:32 AM, Hal Rosenstock wrote:
> Hi,
> 
> On 12/16/2012 7:03 AM, Jens Domke wrote:
>> Hello Hal,
>>
>> On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:
>>
>>> Hi,
>>>
>>> On 12/14/2012 3:32 PM, Jens Domke wrote:
>>>> Hello Hal,
>>>>
>>>> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> On 12/14/2012 1:24 PM, Jens Domke wrote:
>>>>>> Hello Hal,
>>>>>>
>>>>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>>>>>>
>>>>>>> Hi again,
>>>>>>>
>>>>>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>>>>>> Hello Hal,
>>>>>>>>
>>>>>>>> thank you for the fast response. I will try to clarify some points.
>>>>>>>>
>>>>>>>>>> d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>>>>>>>>
>>>>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>>>>>>>> there should be no need to set this. The proper SL for querying the SA
>>>>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>>>>>>>> (and other QoS based routing algorithms), it calculates that and the SM
>>>>>>>>> pushes this into each port. That should be used. It's possible that SL1
>>>>>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
>>>>>>>> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
>>>>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>>>>>>>>
>>>>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>>>>>>
>>>>>>>>>> As far as I understand the whole system:
>>>>>>>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>>>>>>>> 2. the SA receives the request on QP1
>>>>>>>>>
>>>>>>>>> There is the SL in the query itself. This should be the SMSL that the SM
>>>>>>>>> set for that port.
>>>>>>>> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
>>>>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>>>>>>
>>>>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>>>>>>>>
>>>>>>>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>>>>>>>> than the one the query used and is the one returned inside the
>>>>>>>>> PathRecord attribute/data.
>>>>>>>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
>>>>>>>
>>>>>>> With DFSSSP are all SLs same from source port to get to any destination ?
>>>>>> No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
>>>>>
>>>>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
>>>> True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path.
>>>> So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL.
>>>>
>>>> I just read the IB Specs and it says, that "SL specified in the received packet is used as the SL in the response packet" for MAD packets.
>>>> So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet.
>>>> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, 
>>>
>>> So CompMask in the query has the SL bit on and SL is set to 0 inside the
>>> SubAdmGet of PatchRecord ?
>>
>> No, the CompMask didn't had the SL bit and the SL was set to 0.
> 
> That means the SL in the request is wildcarded so the SA/SM fills in a
> valid one in the response.
> 
>> I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only reference I found was in osm_sa_path_record.c
>> The SA just treats the SL in the PathRequest as a "I would like to use this SL" in case the SL bit is set.
>> But the routing engine can overwrite the requested SL before the reply is send.
>>
>> Nevertheless, I have changed the code of OMPI so that it sets the SL bit in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == SL_b.
>> Sadly, the reply send by the SA does not leave the node (for SL_b>0). Only if I change the SL to 0 in the MAD right before umad_send is called by the SA, the paket is able to leave the node and reaches the OMPI process.
> 
> Are you sure the response doesn't leave the SA node or it's not received
> at the requester (OMPI node) ?
> 
>>
>>>
>>>> and sends the packet on SL_b (PortInfo.SMSL).
>>>
>>> Good.
>>>
>>>> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, for the response.
>>>> If SL_b is not 0, then the packet can't reach the OMPI process. Right?
>>>
>>> Depends. It may be that both SLs work but maybe not.
>>>
>>>> If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send, and not the SL specified within the packet.
>>>> What do you think?
>>>
>>> Yes, it might be better to wildcard the SL in the query. The only
>>> scenario that would fail with the query you are making if there's no SL
>>> 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
>>> If that's the case, SA should return MAD status 0xc (status code 3 -
>>> ERR_NO_RECORDS). But the response doesn't make it back to the requester
>>> OMPI node so it's not even getting that far.
>>
>> Yes, exactly. So, do you have an idea why the response hands in the SA node?
>> I have no inside of the underlying layer (kernel driver and fireware). Maybe there are some implementations, which prevent the SA from sending MADs back on SL>0?
> 
> If you're sure this response doesn't get out of the SA node, please
> contact Mellanox support with the details.

A couple of experiments just to be sure:

1. On OpenSM node, smpquery sl2vl and smpquery pi for local SM port

2. On OMPI, saquery -P --src-to-dst <src:dst>

              get a PathRecord for <src:dst> where src and dst are
either node
              names or LIDs

Thanks.

-- Hal

>>>
>>>> I can try to change the PathRequest of OMPI tomorrow, so that it matches addr_type.gsi.service_level.
>>>> Maybe, with this change the packets of the SA will reach the OMPI process on a SL>0.
>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>>>>>>>>
>>>>>>>>> By the response reversibility rule, I think this is returned on the SL
>>>>>>>>> of the original query but haven't verified this in the code base yet.
>>>>>>>> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
>>>>>>>
>>>>>>> I doubled checked and indeed the SA response does use the SL that the
>>>>>>> incoming request was received on.
>>>>>>>
>>>>>>>>>
>>>>>>>>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>>>>>>>>     /* GS classes */
>>>>>>>>>>     umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>>>>>                       p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>>>>>                       p_mad_addr->addr_type.gsi.service_level,
>>>>>>>>>>                       IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>>>>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>>>>>>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>>>>>>>>
>>>>>>>>> By not working, what do you mean ? Do you mean it's not received at the
>>>>>>>>> requester with no message in the OpenSM log or not received at the
>>>>>>>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>>>>>>>> the original request (forcing it to SL 1). That could cause it not to be
>>>>>>>>> received at the SM or the response not to make it back to the requester
>>>>>>>>> from the SA if the SL used is not "reversible".
>>>>>>>> By "not working" I mean, that the MPI process does not receive any response from the SA.
>>>>>>>> I get messages from the MPI process like the following:
>>>>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
>>>>>>>> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
>>>>>>>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>>>>>>>>
>>>>>>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>>>>>>>> at src/umad.c:791
>>>>>>>>>> 791             if (umaddebug > 1)
>>>>>>>>>> (gdb) p *mad
>>>>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>>>>>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>>>>>>>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>>>>>>>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>>>>>>>>
>>>>>>>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>>>>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>>>>>> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
>>>>>>>> SL=6 indicates, that the MPI process was sending the request on SL 6.
>>>>>>>
>>>>>>> What is SMSL for the requester ? Was it SL 6 ?
>>>>>> Yes, it was SL 6.
>>>>>> Here is a content of a similar packet which was received by the SA. I have used ibdump on the port where the OpenSM was running:
>>>>>> ======================================================================================
>>>>>> No.     Time        Source                Destination           Protocol Length Info
>>>>>>   785 14.352168   LID: 384              LID: 4140             InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
>>>>>>
>>>>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
>>>>>>   Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>>>>>   Epoch Time: 1355389784.437633332 seconds
>>>>>>   [Time delta from previous captured frame: 4.332020528 seconds]
>>>>>>   [Time delta from previous displayed frame: 4.332020528 seconds]
>>>>>>   [Time since reference or first frame: 14.352168681 seconds]
>>>>>>   Frame Number: 785
>>>>>>   Frame Length: 290 bytes (2320 bits)
>>>>>>   Capture Length: 290 bytes (2320 bits)
>>>>>>   [Frame is marked: False]
>>>>>>   [Frame is ignored: False]
>>>>>>   [Protocols in frame: erf:infiniband]
>>>>>> Extensible Record Format
>>>>>>   [ERF Header]
>>>>>>       Timestamp: 0x50c99b587008bcf2
>>>>>>       [Header type]
>>>>>>           .001 0101 = type: INFINIBAND (21)
>>>>>>           0... .... = Extension header present: 0
>>>>>>       0000 0100 = flags: 4
>>>>>>           .... ..00 = capture interface: 0
>>>>>>           .... .1.. = varying record length: 1
>>>>>>           .... 0... = truncated: 0
>>>>>>           ...0 .... = rx error: 0
>>>>>>           ..0. .... = ds error: 0
>>>>>>           00.. .... = reserved: 0
>>>>>>       record length: 306
>>>>>>       loss counter: 0
>>>>>>       wire length: 290
>>>>>> InfiniBand
>>>>>>   Local Route Header
>>>>>>       0110 .... = Virtual Lane: 0x06
>>>>>>       .... 0000 = Link Version: 0
>>>>>>       0110 .... = Service Level: 6
>>>>>>       .... 00.. = Reserved (2 bits): 0
>>>>>>       .... ..10 = Link Next Header: 0x02
>>>>>>       Destination Local ID: 19
>>>>>>       0000 0... .... .... = Reserved (5 bits): 0
>>>>>>       .... .000 0100 1000 = Packet Length: 72
>>>>>>       Source Local ID: 16
>>>>>>   Base Transport Header
>>>>>>       Opcode: 100
>>>>>>       1... .... = Solicited Event: True
>>>>>>       .1.. .... = MigReq: True
>>>>>>       ..00 .... = Pad Count: 0
>>>>>>       .... 0000 = Header Version: 0
>>>>>>       Partition Key: 65535
>>>>>>       Reserved (8 bits): 0
>>>>>>       Destination Queue Pair: 0x000001
>>>>>>       0... .... = Acknowledge Request: False
>>>>>>       .000 0000 = Reserved (7 bits): 0
>>>>>>       Packet Sequence Number: 0
>>>>>>   DETH - Datagram Extended Transport Header
>>>>>>       Queue Key: 2147549184
>>>>>>       Reserved (8 bits): 0
>>>>>>       Source Queue Pair: 0x00380050
>>>>>>   MAD Header - Common Management Datagram
>>>>>>       Base Version: 0x01
>>>>>>       Management Class: 0x03
>>>>>>       Class Version: 0x02
>>>>>>       Method: Get() (0x01)
>>>>>>       Status: 0x0000
>>>>>>       Class Specific: 0x0000
>>>>>>       Transaction ID: 0x0010000f38005000
>>>>>>       Attribute ID: 0x0035
>>>>>>       Reserved: 0x0000
>>>>>>       Attribute Modifier: 0x00000000
>>>>>>       MAD Data Payload: 000000000000000000000000000000000000000000000000...
>>>>>>    Illegal RMPP Type (0)! 
>>>>>>       RMPP Type: 0x00
>>>>>>       RMPP Type: 0x00
>>>>>>       0000 .... = R Resp Time: 0x00
>>>>>>       .... 0000 = RMPP Flags: Unknown (0x00)
>>>>>>       RMPP Status:  (Normal) (0x00)
>>>>>>       RMPP Data 1: 0x00000000
>>>>>>       RMPP Data 2: 0x00000000
>>>>>>   SMASubnAdmGet(PathRecord)
>>>>>>       SM_Key (Verification Key): 0x0000000000000000
>>>>>>       Attribute Offset: 0x0000
>>>>>>       Reserved: 0x0000
>>>>>>       Component Mask: 0x0000003000000000
>>>>>>       Attribute (PathRecord)
>>>>>>           PathRecord
>>>>>>               DGID: :: (::)
>>>>>>               SGID: ::0.15.0.16 (::0.15.0.16)
>>>>>>               DLID: 0x0000
>>>>>>               SLID: 0x0000
>>>>>>               0... .... = RawTraffic: 0x00
>>>>>>               .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>>>               HopLimit: 0x00
>>>>>>               TClass: 0x00
>>>>>>               0... .... = Reversible: 0x00
>>>>>>               .000 0000 = NumbPath: 0x00
>>>>>>               P_Key: 0x0000
>>>>>>               .... .... .... 0000 = SL: 0x0000
>>>>>>               00.. .... = MTUSelector: 0x00
>>>>>>               ..00 0000 = MTU: 0x00
>>>>>>               00.. .... = RateSelector: 0x00
>>>>>>               ..00 0000 = Rate: 0x00
>>>>>>               00.. .... = PacketLifeTimeSelector: 0x00
>>>>>>               ..00 0000 = PacketLifeTime: 0x00
>>>>>>               Preference: 0x00
>>>>>>   Variant CRC: 0xad4e
>>>>>> ======================================================================================
>>>>>
>>>>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
>>>>> out that machine and the issue is internal to that machine. It could be
>>>>> because of the underlying issue which hangs OpenSM when some IB program
>>>>> tried to unregister from the MAD layer but there were outstanding work
>>>>> completions. That's based on your original email earlier this AM.
>>>> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI side and the SA uses a SL>0.
>>>
>>> Can ibdump be used to capture output on the SM port ?
>>
>> Yes, that works quite well, despite the warning in the ibdump manual.
>> But I have started ibdump before opensm, maybe that makes a difference, not sure.
>>
>> Regards,
>> Jens
>>
>> PS: I have seen a small bug. Not sure if its a bug in wireshark or ibdump, but the response received by the OMPI node isn't shown correctly. The PathRecord contains an offset which is either missing in the dump or is not treated correctly be wireshark. But it causes wireshark to show the PathRecord data with wrong values.
>> Maybe you could redirect this to the developer of ibdump, so that he can check/fix it.
> 
> Are you referring to the fields after the SA AttributeOffset or
> something else ?
> 
> -- Hal
> 
>>>
>>> -- Hal
>>>
>>>>>
>>>>>>>
>>>>>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>>>>>> port) to SA and back to see whether SL6 would even have a chance of
>>>>>>> working (not dropping) aside from whether it's really the correct SL to use.
>>>>>> All SL2VL tables look the same. I checked the output of OpenSM.
>>>>>> 	SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
>>>>>> 	VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
>>>>>> But this is also as expected, because I have set the QoS in the opensm config as follows:
>>>>>> 	qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>>>>>> This was set for "default", "CA" and "Switch external ports". I have not touched the config for "Switch Port 0" and "Router ports", they remained: qos_[sw0 | rtr]_sl2vl (null)
>>>>>
>>>>> That works as long as all links have (at least) 8 data VLs (VLCap 4).
>>>> Yes, all VL_CAP show 4 in the OpenSM log file.
>>>>
>>>> Regards
>>>> Jens
>>>>
>>>>
>>>>
>>>>>
>>>>> -- Hal
>>>>>
>>>>>> Regards
>>>>>> Jens
>>>>>>
>>>>>>>
>>>>>>> -- Hal
>>>>>>>
>>>>>>>>>
>>>>>>>>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>>>>>>>>
>>>>>>>>> So nothing interesting logged relative to the PathRecord queries ?
>>>>>>>> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
>>>>>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>>>>>>
>>>>>>>>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>>>>>>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>>>>>>>>
>>>>>>>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>>>>>>>> that's how SMSL is set by DFSSSP.
>>>>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
>>>>>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Jens
>>>>>>>>
>>>>>>>> --------------------------------
>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>>> --------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>> --------------------------------
>>>>>> Dipl.-Math. Jens Domke
>>>>>> Researcher - Tokyo Institute of Technology
>>>>>> Satoshi MATSUOKA Laboratory
>>>>>> Global Scientific Information and Computing Center
>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>> Tokyo, 152-8550, JAPAN
>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>> --------------------------------
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>> --------------------------------
>>>> Dipl.-Math. Jens Domke
>>>> Researcher - Tokyo Institute of Technology
>>>> Satoshi MATSUOKA Laboratory
>>>> Global Scientific Information and Computing Center
>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>> Tokyo, 152-8550, JAPAN
>>>> Tel/Fax: +81-3-5734-3876
>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>> --------------------------------
>>>>
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --------------------------------
>> Dipl.-Math. Jens Domke
>> Researcher - Tokyo Institute of Technology
>> Satoshi MATSUOKA Laboratory
>> Global Scientific Information and Computing Center
>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>> Tokyo, 152-8550, JAPAN
>> Tel/Fax: +81-3-5734-3876
>> E-Mail: domke.j.aa@m.titech.ac.jp
>> --------------------------------
>>
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
       [not found]                           ` <50CDBF61.3080100-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2012-12-16 13:25                             ` Hal Rosenstock
@ 2012-12-16 13:39                             ` Jens Domke
  2012-12-16 13:48                               ` Hal Rosenstock
  1 sibling, 1 reply; 18+ messages in thread
From: Jens Domke @ 2012-12-16 13:39 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

Hi,

On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote:

> Hi,
> 
> On 12/16/2012 7:03 AM, Jens Domke wrote:
>> Hello Hal,
>> 
>> On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:
>> 
>>> Hi,
>>> 
>>> On 12/14/2012 3:32 PM, Jens Domke wrote:
>>>> Hello Hal,
>>>> 
>>>> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> On 12/14/2012 1:24 PM, Jens Domke wrote:
>>>>>> Hello Hal,
>>>>>> 
>>>>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>>>>>> 
>>>>>>> Hi again,
>>>>>>> 
>>>>>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>>>>>> Hello Hal,
>>>>>>>> 
>>>>>>>> thank you for the fast response. I will try to clarify some points.
>>>>>>>> 
>>>>>>>>>> d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>>>>>>>> 
>>>>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>>>>>>>> there should be no need to set this. The proper SL for querying the SA
>>>>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>>>>>>>> (and other QoS based routing algorithms), it calculates that and the SM
>>>>>>>>> pushes this into each port. That should be used. It's possible that SL1
>>>>>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
>>>>>>>> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
>>>>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>>>>>>>> 
>>>>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>>>>>> 
>>>>>>>>>> As far as I understand the whole system:
>>>>>>>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>>>>>>>> 2. the SA receives the request on QP1
>>>>>>>>> 
>>>>>>>>> There is the SL in the query itself. This should be the SMSL that the SM
>>>>>>>>> set for that port.
>>>>>>>> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
>>>>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>>>>>> 
>>>>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>>>>>>>> 
>>>>>>>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>>>>>>>> than the one the query used and is the one returned inside the
>>>>>>>>> PathRecord attribute/data.
>>>>>>>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
>>>>>>> 
>>>>>>> With DFSSSP are all SLs same from source port to get to any destination ?
>>>>>> No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
>>>>> 
>>>>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
>>>> True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path.
>>>> So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL.
>>>> 
>>>> I just read the IB Specs and it says, that "SL specified in the received packet is used as the SL in the response packet" for MAD packets.
>>>> So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet.
>>>> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, 
>>> 
>>> So CompMask in the query has the SL bit on and SL is set to 0 inside the
>>> SubAdmGet of PatchRecord ?
>> 
>> No, the CompMask didn't had the SL bit and the SL was set to 0.
> 
> That means the SL in the request is wildcarded so the SA/SM fills in a
> valid one in the response.
Ok.
> 
>> I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only reference I found was in osm_sa_path_record.c
>> The SA just treats the SL in the PathRequest as a "I would like to use this SL" in case the SL bit is set.
>> But the routing engine can overwrite the requested SL before the reply is send.
>> 
>> Nevertheless, I have changed the code of OMPI so that it sets the SL bit in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == SL_b.
>> Sadly, the reply send by the SA does not leave the node (for SL_b>0). Only if I change the SL to 0 in the MAD right before umad_send is called by the SA, the paket is able to leave the node and reaches the OMPI process.
> 
> Are you sure the response doesn't leave the SA node or it's not received
> at the requester (OMPI node) ?
No, I'm not sure. Is there any possibility to check that? As far as I know, ibdump does not show MAD pakets which leave a port, it only shows the pakets when they are received on the other end.
> 
>> 
>>> 
>>>> and sends the packet on SL_b (PortInfo.SMSL).
>>> 
>>> Good.
>>> 
>>>> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, for the response.
>>>> If SL_b is not 0, then the packet can't reach the OMPI process. Right?
>>> 
>>> Depends. It may be that both SLs work but maybe not.
>>> 
>>>> If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send, and not the SL specified within the packet.
>>>> What do you think?
>>> 
>>> Yes, it might be better to wildcard the SL in the query. The only
>>> scenario that would fail with the query you are making if there's no SL
>>> 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
>>> If that's the case, SA should return MAD status 0xc (status code 3 -
>>> ERR_NO_RECORDS). But the response doesn't make it back to the requester
>>> OMPI node so it's not even getting that far.
>> 
>> Yes, exactly. So, do you have an idea why the response hands in the SA node?
>> I have no inside of the underlying layer (kernel driver and fireware). Maybe there are some implementations, which prevent the SA from sending MADs back on SL>0?
> 
> If you're sure this response doesn't get out of the SA node, please
> contact Mellanox support with the details.
Ok, I can do this, if it turns out to be true.
> 
>>> 
>>>> I can try to change the PathRequest of OMPI tomorrow, so that it matches addr_type.gsi.service_level.
>>>> Maybe, with this change the packets of the SA will reach the OMPI process on a SL>0.
>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>>>>>>>> 
>>>>>>>>> By the response reversibility rule, I think this is returned on the SL
>>>>>>>>> of the original query but haven't verified this in the code base yet.
>>>>>>>> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
>>>>>>> 
>>>>>>> I doubled checked and indeed the SA response does use the SL that the
>>>>>>> incoming request was received on.
>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>>>>>>>>    /* GS classes */
>>>>>>>>>>    umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>>>>>                      p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>>>>>                      p_mad_addr->addr_type.gsi.service_level,
>>>>>>>>>>                      IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>>>>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>>>>>>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>>>>>>>> 
>>>>>>>>> By not working, what do you mean ? Do you mean it's not received at the
>>>>>>>>> requester with no message in the OpenSM log or not received at the
>>>>>>>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>>>>>>>> the original request (forcing it to SL 1). That could cause it not to be
>>>>>>>>> received at the SM or the response not to make it back to the requester
>>>>>>>>> from the SA if the SL used is not "reversible".
>>>>>>>> By "not working" I mean, that the MPI process does not receive any response from the SA.
>>>>>>>> I get messages from the MPI process like the following:
>>>>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
>>>>>>>> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
>>>>>>>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>>>>>>>> 
>>>>>>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>>>>>>>> at src/umad.c:791
>>>>>>>>>> 791             if (umaddebug > 1)
>>>>>>>>>> (gdb) p *mad
>>>>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>>>>>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>>>>>>>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>>>>>>>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>>>>>>>> 
>>>>>>>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>>>>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>>>>>> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
>>>>>>>> SL=6 indicates, that the MPI process was sending the request on SL 6.
>>>>>>> 
>>>>>>> What is SMSL for the requester ? Was it SL 6 ?
>>>>>> Yes, it was SL 6.
>>>>>> Here is a content of a similar packet which was received by the SA. I have used ibdump on the port where the OpenSM was running:
>>>>>> ======================================================================================
>>>>>> No.     Time        Source                Destination           Protocol Length Info
>>>>>>  785 14.352168   LID: 384              LID: 4140             InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
>>>>>> 
>>>>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
>>>>>>  Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>>>>>  Epoch Time: 1355389784.437633332 seconds
>>>>>>  [Time delta from previous captured frame: 4.332020528 seconds]
>>>>>>  [Time delta from previous displayed frame: 4.332020528 seconds]
>>>>>>  [Time since reference or first frame: 14.352168681 seconds]
>>>>>>  Frame Number: 785
>>>>>>  Frame Length: 290 bytes (2320 bits)
>>>>>>  Capture Length: 290 bytes (2320 bits)
>>>>>>  [Frame is marked: False]
>>>>>>  [Frame is ignored: False]
>>>>>>  [Protocols in frame: erf:infiniband]
>>>>>> Extensible Record Format
>>>>>>  [ERF Header]
>>>>>>      Timestamp: 0x50c99b587008bcf2
>>>>>>      [Header type]
>>>>>>          .001 0101 = type: INFINIBAND (21)
>>>>>>          0... .... = Extension header present: 0
>>>>>>      0000 0100 = flags: 4
>>>>>>          .... ..00 = capture interface: 0
>>>>>>          .... .1.. = varying record length: 1
>>>>>>          .... 0... = truncated: 0
>>>>>>          ...0 .... = rx error: 0
>>>>>>          ..0. .... = ds error: 0
>>>>>>          00.. .... = reserved: 0
>>>>>>      record length: 306
>>>>>>      loss counter: 0
>>>>>>      wire length: 290
>>>>>> InfiniBand
>>>>>>  Local Route Header
>>>>>>      0110 .... = Virtual Lane: 0x06
>>>>>>      .... 0000 = Link Version: 0
>>>>>>      0110 .... = Service Level: 6
>>>>>>      .... 00.. = Reserved (2 bits): 0
>>>>>>      .... ..10 = Link Next Header: 0x02
>>>>>>      Destination Local ID: 19
>>>>>>      0000 0... .... .... = Reserved (5 bits): 0
>>>>>>      .... .000 0100 1000 = Packet Length: 72
>>>>>>      Source Local ID: 16
>>>>>>  Base Transport Header
>>>>>>      Opcode: 100
>>>>>>      1... .... = Solicited Event: True
>>>>>>      .1.. .... = MigReq: True
>>>>>>      ..00 .... = Pad Count: 0
>>>>>>      .... 0000 = Header Version: 0
>>>>>>      Partition Key: 65535
>>>>>>      Reserved (8 bits): 0
>>>>>>      Destination Queue Pair: 0x000001
>>>>>>      0... .... = Acknowledge Request: False
>>>>>>      .000 0000 = Reserved (7 bits): 0
>>>>>>      Packet Sequence Number: 0
>>>>>>  DETH - Datagram Extended Transport Header
>>>>>>      Queue Key: 2147549184
>>>>>>      Reserved (8 bits): 0
>>>>>>      Source Queue Pair: 0x00380050
>>>>>>  MAD Header - Common Management Datagram
>>>>>>      Base Version: 0x01
>>>>>>      Management Class: 0x03
>>>>>>      Class Version: 0x02
>>>>>>      Method: Get() (0x01)
>>>>>>      Status: 0x0000
>>>>>>      Class Specific: 0x0000
>>>>>>      Transaction ID: 0x0010000f38005000
>>>>>>      Attribute ID: 0x0035
>>>>>>      Reserved: 0x0000
>>>>>>      Attribute Modifier: 0x00000000
>>>>>>      MAD Data Payload: 000000000000000000000000000000000000000000000000...
>>>>>>   Illegal RMPP Type (0)! 
>>>>>>      RMPP Type: 0x00
>>>>>>      RMPP Type: 0x00
>>>>>>      0000 .... = R Resp Time: 0x00
>>>>>>      .... 0000 = RMPP Flags: Unknown (0x00)
>>>>>>      RMPP Status:  (Normal) (0x00)
>>>>>>      RMPP Data 1: 0x00000000
>>>>>>      RMPP Data 2: 0x00000000
>>>>>>  SMASubnAdmGet(PathRecord)
>>>>>>      SM_Key (Verification Key): 0x0000000000000000
>>>>>>      Attribute Offset: 0x0000
>>>>>>      Reserved: 0x0000
>>>>>>      Component Mask: 0x0000003000000000
>>>>>>      Attribute (PathRecord)
>>>>>>          PathRecord
>>>>>>              DGID: :: (::)
>>>>>>              SGID: ::0.15.0.16 (::0.15.0.16)
>>>>>>              DLID: 0x0000
>>>>>>              SLID: 0x0000
>>>>>>              0... .... = RawTraffic: 0x00
>>>>>>              .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>>>              HopLimit: 0x00
>>>>>>              TClass: 0x00
>>>>>>              0... .... = Reversible: 0x00
>>>>>>              .000 0000 = NumbPath: 0x00
>>>>>>              P_Key: 0x0000
>>>>>>              .... .... .... 0000 = SL: 0x0000
>>>>>>              00.. .... = MTUSelector: 0x00
>>>>>>              ..00 0000 = MTU: 0x00
>>>>>>              00.. .... = RateSelector: 0x00
>>>>>>              ..00 0000 = Rate: 0x00
>>>>>>              00.. .... = PacketLifeTimeSelector: 0x00
>>>>>>              ..00 0000 = PacketLifeTime: 0x00
>>>>>>              Preference: 0x00
>>>>>>  Variant CRC: 0xad4e
>>>>>> ======================================================================================
>>>>> 
>>>>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
>>>>> out that machine and the issue is internal to that machine. It could be
>>>>> because of the underlying issue which hangs OpenSM when some IB program
>>>>> tried to unregister from the MAD layer but there were outstanding work
>>>>> completions. That's based on your original email earlier this AM.
>>>> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI side and the SA uses a SL>0.
>>> 
>>> Can ibdump be used to capture output on the SM port ?
>> 
>> Yes, that works quite well, despite the warning in the ibdump manual.
>> But I have started ibdump before opensm, maybe that makes a difference, not sure.
>> 
>> Regards,
>> Jens
>> 
>> PS: I have seen a small bug. Not sure if its a bug in wireshark or ibdump, but the response received by the OMPI node isn't shown correctly. The PathRecord contains an offset which is either missing in the dump or is not treated correctly be wireshark. But it causes wireshark to show the PathRecord data with wrong values.
>> Maybe you could redirect this to the developer of ibdump, so that he can check/fix it.
> 
> Are you referring to the fields after the SA AttributeOffset or
> something else ?
Yes, after the SMASubnAdmGet Attribute Offset. Here an example:
I get on the OMPI side:
    SMASubnAdmGetResp(PathRecord)
        SM_Key (Verification Key): 0x0000000000000000
        Attribute Offset: 0x0008
        Reserved: 0x0000
        Component Mask: 0x0000803000000000
        Attribute (PathRecord)
            PathRecord
                DGID: ::8:f104:399:ebb5:fe80:0 (::8:f104:399:ebb5:fe80:0)
                SGID: ::8:f104:399:ecd5:4:8 (::8:f104:399:ecd5:4:8)
                DLID: 0x0000
                SLID: 0x0000
                0... .... = RawTraffic: 0x00
                .... 0000 1000 0000 1111 1111 = FlowLabel: 0x0080ff
                HopLimit: 0xff
                TClass: 0x00
                0... .... = Reversible: 0x00
                .000 0011 = NumbPath: 0x03
                P_Key: 0x8486
                .... .... .... 0000 = SL: 0x0000
                00.. .... = MTUSelector: 0x00
                ..00 0000 = MTU: 0x00
                00.. .... = RateSelector: 0x00
                ..00 0000 = Rate: 0x00
                00.. .... = PacketLifeTimeSelector: 0x00
                ..00 0000 = PacketLifeTime: 0x00
                Preference: 0x00

But it should show (see the difference in SLID, DLID, SL which are now correct):
    SMASubnAdmGetResp(PathRecord)
        SM_Key (Verification Key): 0x0000000000000000
        Attribute Offset: 0x0008
        Reserved: 0x0000
        Component Mask: 0x0000803000000000
        Attribute (PathRecord)
            PathRecord
                DGID: ::8:f104:399:ebb5 (::8:f104:399:ebb5)
                SGID: fe80::8:f104:399:ecd5 (fe80::8:f104:399:ecd5)
                DLID: 0x0004
                SLID: 0x0008
                0... .... = RawTraffic: 0x00
                .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
                HopLimit: 0x00
                TClass: 0x00
                1... .... = Reversible: 0x01
                .000 0000 = NumbPath: 0x00
                P_Key: 0xffff
                .... .... .... 0011 = SL: 0x0003
                10.. .... = MTUSelector: 0x02
                ..00 0100 = MTU: 0x04
                10.. .... = RateSelector: 0x02
                ..00 0110 = Rate: 0x06
                10.. .... = PacketLifeTimeSelector: 0x02
                ..01 0010 = PacketLifeTime: 0x12
                Preference: 0x00


Regards,
Jens

> 
> -- Hal
> 
>>> 
>>> -- Hal
>>> 
>>>>> 
>>>>>>> 
>>>>>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>>>>>> port) to SA and back to see whether SL6 would even have a chance of
>>>>>>> working (not dropping) aside from whether it's really the correct SL to use.
>>>>>> All SL2VL tables look the same. I checked the output of OpenSM.
>>>>>> 	SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
>>>>>> 	VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
>>>>>> But this is also as expected, because I have set the QoS in the opensm config as follows:
>>>>>> 	qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>>>>>> This was set for "default", "CA" and "Switch external ports". I have not touched the config for "Switch Port 0" and "Router ports", they remained: qos_[sw0 | rtr]_sl2vl (null)
>>>>> 
>>>>> That works as long as all links have (at least) 8 data VLs (VLCap 4).
>>>> Yes, all VL_CAP show 4 in the OpenSM log file.
>>>> 
>>>> Regards
>>>> Jens
>>>> 
>>>> 
>>>> 
>>>>> 
>>>>> -- Hal
>>>>> 
>>>>>> Regards
>>>>>> Jens
>>>>>> 
>>>>>>> 
>>>>>>> -- Hal
>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>>>>>>>> 
>>>>>>>>> So nothing interesting logged relative to the PathRecord queries ?
>>>>>>>> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
>>>>>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>>>>>> 
>>>>>>>>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>>>>>>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>>>>>>>> 
>>>>>>>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>>>>>>>> that's how SMSL is set by DFSSSP.
>>>>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
>>>>>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>>>>>> 
>>>>>>>> Regards
>>>>>>>> Jens
>>>>>>>> 
>>>>>>>> --------------------------------
>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>>> --------------------------------
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> 
>>>>>> --------------------------------
>>>>>> Dipl.-Math. Jens Domke
>>>>>> Researcher - Tokyo Institute of Technology
>>>>>> Satoshi MATSUOKA Laboratory
>>>>>> Global Scientific Information and Computing Center
>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>> Tokyo, 152-8550, JAPAN
>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>> --------------------------------
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> --------------------------------
>>>> Dipl.-Math. Jens Domke
>>>> Researcher - Tokyo Institute of Technology
>>>> Satoshi MATSUOKA Laboratory
>>>> Global Scientific Information and Computing Center
>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>> Tokyo, 152-8550, JAPAN
>>>> Tel/Fax: +81-3-5734-3876
>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>> --------------------------------
>>>> 
>>>> 
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --------------------------------
>> Dipl.-Math. Jens Domke
>> Researcher - Tokyo Institute of Technology
>> Satoshi MATSUOKA Laboratory
>> Global Scientific Information and Computing Center
>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>> Tokyo, 152-8550, JAPAN
>> Tel/Fax: +81-3-5734-3876
>> E-Mail: domke.j.aa@m.titech.ac.jp
>> --------------------------------
>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
  2012-12-16 13:39                             ` Jens Domke
@ 2012-12-16 13:48                               ` Hal Rosenstock
       [not found]                                 ` <50CDD114.2090706-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Hal Rosenstock @ 2012-12-16 13:48 UTC (permalink / raw)
  To: Jens Domke; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

On 12/16/2012 8:39 AM, Jens Domke wrote:
> Hi,
> 
> On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote:
> 
>> Hi,
>>
>> On 12/16/2012 7:03 AM, Jens Domke wrote:
>>> Hello Hal,
>>>
>>> On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:
>>>
>>>> Hi,
>>>>
>>>> On 12/14/2012 3:32 PM, Jens Domke wrote:
>>>>> Hello Hal,
>>>>>
>>>>> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> On 12/14/2012 1:24 PM, Jens Domke wrote:
>>>>>>> Hello Hal,
>>>>>>>
>>>>>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>>>>>>>
>>>>>>>> Hi again,
>>>>>>>>
>>>>>>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>>>>>>> Hello Hal,
>>>>>>>>>
>>>>>>>>> thank you for the fast response. I will try to clarify some points.
>>>>>>>>>
>>>>>>>>>>> d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>>>>>>>>>
>>>>>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>>>>>>>>> there should be no need to set this. The proper SL for querying the SA
>>>>>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>>>>>>>>> (and other QoS based routing algorithms), it calculates that and the SM
>>>>>>>>>> pushes this into each port. That should be used. It's possible that SL1
>>>>>>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>>>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
>>>>>>>>> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
>>>>>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>>>>>>>>>
>>>>>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>>>>>>>
>>>>>>>>>>> As far as I understand the whole system:
>>>>>>>>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>>>>>>>>> 2. the SA receives the request on QP1
>>>>>>>>>>
>>>>>>>>>> There is the SL in the query itself. This should be the SMSL that the SM
>>>>>>>>>> set for that port.
>>>>>>>>> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
>>>>>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>>>>>>>
>>>>>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>>>>>>>>>
>>>>>>>>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>>>>>>>>> than the one the query used and is the one returned inside the
>>>>>>>>>> PathRecord attribute/data.
>>>>>>>>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
>>>>>>>>
>>>>>>>> With DFSSSP are all SLs same from source port to get to any destination ?
>>>>>>> No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
>>>>>>
>>>>>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
>>>>> True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path.
>>>>> So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL.
>>>>>
>>>>> I just read the IB Specs and it says, that "SL specified in the received packet is used as the SL in the response packet" for MAD packets.
>>>>> So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet.
>>>>> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, 
>>>>
>>>> So CompMask in the query has the SL bit on and SL is set to 0 inside the
>>>> SubAdmGet of PatchRecord ?
>>>
>>> No, the CompMask didn't had the SL bit and the SL was set to 0.
>>
>> That means the SL in the request is wildcarded so the SA/SM fills in a
>> valid one in the response.
> Ok.
>>
>>> I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only reference I found was in osm_sa_path_record.c
>>> The SA just treats the SL in the PathRequest as a "I would like to use this SL" in case the SL bit is set.
>>> But the routing engine can overwrite the requested SL before the reply is send.
>>>
>>> Nevertheless, I have changed the code of OMPI so that it sets the SL bit in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == SL_b.
>>> Sadly, the reply send by the SA does not leave the node (for SL_b>0). Only if I change the SL to 0 in the MAD right before umad_send is called by the SA, the paket is able to leave the node and reaches the OMPI process.
>>
>> Are you sure the response doesn't leave the SA node or it's not received
>> at the requester (OMPI node) ?
> No, I'm not sure. Is there any possibility to check that? As far as I know, ibdump does not show MAD pakets which leave a port, it only shows the pakets when they are received on the other end.
>>
>>>
>>>>
>>>>> and sends the packet on SL_b (PortInfo.SMSL).
>>>>
>>>> Good.
>>>>
>>>>> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, for the response.
>>>>> If SL_b is not 0, then the packet can't reach the OMPI process. Right?
>>>>
>>>> Depends. It may be that both SLs work but maybe not.
>>>>
>>>>> If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send, and not the SL specified within the packet.
>>>>> What do you think?
>>>>
>>>> Yes, it might be better to wildcard the SL in the query. The only
>>>> scenario that would fail with the query you are making if there's no SL
>>>> 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
>>>> If that's the case, SA should return MAD status 0xc (status code 3 -
>>>> ERR_NO_RECORDS). But the response doesn't make it back to the requester
>>>> OMPI node so it's not even getting that far.
>>>
>>> Yes, exactly. So, do you have an idea why the response hands in the SA node?
>>> I have no inside of the underlying layer (kernel driver and fireware). Maybe there are some implementations, which prevent the SA from sending MADs back on SL>0?
>>
>> If you're sure this response doesn't get out of the SA node, please
>> contact Mellanox support with the details.
> Ok, I can do this, if it turns out to be true.
>>
>>>>
>>>>> I can try to change the PathRequest of OMPI tomorrow, so that it matches addr_type.gsi.service_level.
>>>>> Maybe, with this change the packets of the SA will reach the OMPI process on a SL>0.
>>>>>>
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>>>>>>>>>
>>>>>>>>>> By the response reversibility rule, I think this is returned on the SL
>>>>>>>>>> of the original query but haven't verified this in the code base yet.
>>>>>>>>> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
>>>>>>>>
>>>>>>>> I doubled checked and indeed the SA response does use the SL that the
>>>>>>>> incoming request was received on.
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>>>>>>>>>    /* GS classes */
>>>>>>>>>>>    umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>>>>>>                      p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>>>>>>                      p_mad_addr->addr_type.gsi.service_level,
>>>>>>>>>>>                      IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>>>>>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>>>>>>>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>>>>>>>>>
>>>>>>>>>> By not working, what do you mean ? Do you mean it's not received at the
>>>>>>>>>> requester with no message in the OpenSM log or not received at the
>>>>>>>>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>>>>>>>>> the original request (forcing it to SL 1). That could cause it not to be
>>>>>>>>>> received at the SM or the response not to make it back to the requester
>>>>>>>>>> from the SA if the SL used is not "reversible".
>>>>>>>>> By "not working" I mean, that the MPI process does not receive any response from the SA.
>>>>>>>>> I get messages from the MPI process like the following:
>>>>>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
>>>>>>>>> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
>>>>>>>>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>>>>>>>>>
>>>>>>>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>>>>>>>>> at src/umad.c:791
>>>>>>>>>>> 791             if (umaddebug > 1)
>>>>>>>>>>> (gdb) p *mad
>>>>>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>>>>>>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>>>>>>>>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>>>>>>>>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>>>>>>>>>
>>>>>>>>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>>>>>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>>>>>>> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
>>>>>>>>> SL=6 indicates, that the MPI process was sending the request on SL 6.
>>>>>>>>
>>>>>>>> What is SMSL for the requester ? Was it SL 6 ?
>>>>>>> Yes, it was SL 6.
>>>>>>> Here is a content of a similar packet which was received by the SA. I have used ibdump on the port where the OpenSM was running:
>>>>>>> ======================================================================================
>>>>>>> No.     Time        Source                Destination           Protocol Length Info
>>>>>>>  785 14.352168   LID: 384              LID: 4140             InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
>>>>>>>
>>>>>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
>>>>>>>  Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>>>>>>  Epoch Time: 1355389784.437633332 seconds
>>>>>>>  [Time delta from previous captured frame: 4.332020528 seconds]
>>>>>>>  [Time delta from previous displayed frame: 4.332020528 seconds]
>>>>>>>  [Time since reference or first frame: 14.352168681 seconds]
>>>>>>>  Frame Number: 785
>>>>>>>  Frame Length: 290 bytes (2320 bits)
>>>>>>>  Capture Length: 290 bytes (2320 bits)
>>>>>>>  [Frame is marked: False]
>>>>>>>  [Frame is ignored: False]
>>>>>>>  [Protocols in frame: erf:infiniband]
>>>>>>> Extensible Record Format
>>>>>>>  [ERF Header]
>>>>>>>      Timestamp: 0x50c99b587008bcf2
>>>>>>>      [Header type]
>>>>>>>          .001 0101 = type: INFINIBAND (21)
>>>>>>>          0... .... = Extension header present: 0
>>>>>>>      0000 0100 = flags: 4
>>>>>>>          .... ..00 = capture interface: 0
>>>>>>>          .... .1.. = varying record length: 1
>>>>>>>          .... 0... = truncated: 0
>>>>>>>          ...0 .... = rx error: 0
>>>>>>>          ..0. .... = ds error: 0
>>>>>>>          00.. .... = reserved: 0
>>>>>>>      record length: 306
>>>>>>>      loss counter: 0
>>>>>>>      wire length: 290
>>>>>>> InfiniBand
>>>>>>>  Local Route Header
>>>>>>>      0110 .... = Virtual Lane: 0x06
>>>>>>>      .... 0000 = Link Version: 0
>>>>>>>      0110 .... = Service Level: 6
>>>>>>>      .... 00.. = Reserved (2 bits): 0
>>>>>>>      .... ..10 = Link Next Header: 0x02
>>>>>>>      Destination Local ID: 19
>>>>>>>      0000 0... .... .... = Reserved (5 bits): 0
>>>>>>>      .... .000 0100 1000 = Packet Length: 72
>>>>>>>      Source Local ID: 16
>>>>>>>  Base Transport Header
>>>>>>>      Opcode: 100
>>>>>>>      1... .... = Solicited Event: True
>>>>>>>      .1.. .... = MigReq: True
>>>>>>>      ..00 .... = Pad Count: 0
>>>>>>>      .... 0000 = Header Version: 0
>>>>>>>      Partition Key: 65535
>>>>>>>      Reserved (8 bits): 0
>>>>>>>      Destination Queue Pair: 0x000001
>>>>>>>      0... .... = Acknowledge Request: False
>>>>>>>      .000 0000 = Reserved (7 bits): 0
>>>>>>>      Packet Sequence Number: 0
>>>>>>>  DETH - Datagram Extended Transport Header
>>>>>>>      Queue Key: 2147549184
>>>>>>>      Reserved (8 bits): 0
>>>>>>>      Source Queue Pair: 0x00380050
>>>>>>>  MAD Header - Common Management Datagram
>>>>>>>      Base Version: 0x01
>>>>>>>      Management Class: 0x03
>>>>>>>      Class Version: 0x02
>>>>>>>      Method: Get() (0x01)
>>>>>>>      Status: 0x0000
>>>>>>>      Class Specific: 0x0000
>>>>>>>      Transaction ID: 0x0010000f38005000
>>>>>>>      Attribute ID: 0x0035
>>>>>>>      Reserved: 0x0000
>>>>>>>      Attribute Modifier: 0x00000000
>>>>>>>      MAD Data Payload: 000000000000000000000000000000000000000000000000...
>>>>>>>   Illegal RMPP Type (0)! 
>>>>>>>      RMPP Type: 0x00
>>>>>>>      RMPP Type: 0x00
>>>>>>>      0000 .... = R Resp Time: 0x00
>>>>>>>      .... 0000 = RMPP Flags: Unknown (0x00)
>>>>>>>      RMPP Status:  (Normal) (0x00)
>>>>>>>      RMPP Data 1: 0x00000000
>>>>>>>      RMPP Data 2: 0x00000000
>>>>>>>  SMASubnAdmGet(PathRecord)
>>>>>>>      SM_Key (Verification Key): 0x0000000000000000
>>>>>>>      Attribute Offset: 0x0000
>>>>>>>      Reserved: 0x0000
>>>>>>>      Component Mask: 0x0000003000000000
>>>>>>>      Attribute (PathRecord)
>>>>>>>          PathRecord
>>>>>>>              DGID: :: (::)
>>>>>>>              SGID: ::0.15.0.16 (::0.15.0.16)
>>>>>>>              DLID: 0x0000
>>>>>>>              SLID: 0x0000
>>>>>>>              0... .... = RawTraffic: 0x00
>>>>>>>              .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>>>>              HopLimit: 0x00
>>>>>>>              TClass: 0x00
>>>>>>>              0... .... = Reversible: 0x00
>>>>>>>              .000 0000 = NumbPath: 0x00
>>>>>>>              P_Key: 0x0000
>>>>>>>              .... .... .... 0000 = SL: 0x0000
>>>>>>>              00.. .... = MTUSelector: 0x00
>>>>>>>              ..00 0000 = MTU: 0x00
>>>>>>>              00.. .... = RateSelector: 0x00
>>>>>>>              ..00 0000 = Rate: 0x00
>>>>>>>              00.. .... = PacketLifeTimeSelector: 0x00
>>>>>>>              ..00 0000 = PacketLifeTime: 0x00
>>>>>>>              Preference: 0x00
>>>>>>>  Variant CRC: 0xad4e
>>>>>>> ======================================================================================
>>>>>>
>>>>>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
>>>>>> out that machine and the issue is internal to that machine. It could be
>>>>>> because of the underlying issue which hangs OpenSM when some IB program
>>>>>> tried to unregister from the MAD layer but there were outstanding work
>>>>>> completions. That's based on your original email earlier this AM.
>>>>> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI side and the SA uses a SL>0.
>>>>
>>>> Can ibdump be used to capture output on the SM port ?
>>>
>>> Yes, that works quite well, despite the warning in the ibdump manual.
>>> But I have started ibdump before opensm, maybe that makes a difference, not sure.
>>>
>>> Regards,
>>> Jens
>>>
>>> PS: I have seen a small bug. Not sure if its a bug in wireshark or ibdump, but the response received by the OMPI node isn't shown correctly. The PathRecord contains an offset which is either missing in the dump or is not treated correctly be wireshark. But it causes wireshark to show the PathRecord data with wrong values.
>>> Maybe you could redirect this to the developer of ibdump, so that he can check/fix it.
>>
>> Are you referring to the fields after the SA AttributeOffset or
>> something else ?
> Yes, after the SMASubnAdmGet Attribute Offset. Here an example:
> I get on the OMPI side:
>     SMASubnAdmGetResp(PathRecord)
>         SM_Key (Verification Key): 0x0000000000000000
>         Attribute Offset: 0x0008
>         Reserved: 0x0000
>         Component Mask: 0x0000803000000000
>         Attribute (PathRecord)
>             PathRecord
>                 DGID: ::8:f104:399:ebb5:fe80:0 (::8:f104:399:ebb5:fe80:0)
>                 SGID: ::8:f104:399:ecd5:4:8 (::8:f104:399:ecd5:4:8)
>                 DLID: 0x0000
>                 SLID: 0x0000
>                 0... .... = RawTraffic: 0x00
>                 .... 0000 1000 0000 1111 1111 = FlowLabel: 0x0080ff
>                 HopLimit: 0xff
>                 TClass: 0x00
>                 0... .... = Reversible: 0x00
>                 .000 0011 = NumbPath: 0x03
>                 P_Key: 0x8486
>                 .... .... .... 0000 = SL: 0x0000
>                 00.. .... = MTUSelector: 0x00
>                 ..00 0000 = MTU: 0x00
>                 00.. .... = RateSelector: 0x00
>                 ..00 0000 = Rate: 0x00
>                 00.. .... = PacketLifeTimeSelector: 0x00
>                 ..00 0000 = PacketLifeTime: 0x00
>                 Preference: 0x00
> 
> But it should show (see the difference in SLID, DLID, SL which are now correct):
>     SMASubnAdmGetResp(PathRecord)
>         SM_Key (Verification Key): 0x0000000000000000
>         Attribute Offset: 0x0008
>         Reserved: 0x0000
>         Component Mask: 0x0000803000000000
>         Attribute (PathRecord)
>             PathRecord
>                 DGID: ::8:f104:399:ebb5 (::8:f104:399:ebb5)
>                 SGID: fe80::8:f104:399:ecd5 (fe80::8:f104:399:ecd5)
>                 DLID: 0x0004
>                 SLID: 0x0008
>                 0... .... = RawTraffic: 0x00
>                 .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>                 HopLimit: 0x00
>                 TClass: 0x00
>                 1... .... = Reversible: 0x01
>                 .000 0000 = NumbPath: 0x00
>                 P_Key: 0xffff
>                 .... .... .... 0011 = SL: 0x0003
>                 10.. .... = MTUSelector: 0x02
>                 ..00 0100 = MTU: 0x04
>                 10.. .... = RateSelector: 0x02
>                 ..00 0110 = Rate: 0x06
>                 10.. .... = PacketLifeTimeSelector: 0x02
>                 ..01 0010 = PacketLifeTime: 0x12
>                 Preference: 0x00


I think everything after AttributeOffset is off by 2 bytes. DGID doesn't
look right to me (no subnet prefix fe80:: in front of GUID).

-- Hal

> 
> Regards,
> Jens
> 
>>
>> -- Hal
>>
>>>>
>>>> -- Hal
>>>>
>>>>>>
>>>>>>>>
>>>>>>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>>>>>>> port) to SA and back to see whether SL6 would even have a chance of
>>>>>>>> working (not dropping) aside from whether it's really the correct SL to use.
>>>>>>> All SL2VL tables look the same. I checked the output of OpenSM.
>>>>>>> 	SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
>>>>>>> 	VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
>>>>>>> But this is also as expected, because I have set the QoS in the opensm config as follows:
>>>>>>> 	qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>>>>>>> This was set for "default", "CA" and "Switch external ports". I have not touched the config for "Switch Port 0" and "Router ports", they remained: qos_[sw0 | rtr]_sl2vl (null)
>>>>>>
>>>>>> That works as long as all links have (at least) 8 data VLs (VLCap 4).
>>>>> Yes, all VL_CAP show 4 in the OpenSM log file.
>>>>>
>>>>> Regards
>>>>> Jens
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> -- Hal
>>>>>>
>>>>>>> Regards
>>>>>>> Jens
>>>>>>>
>>>>>>>>
>>>>>>>> -- Hal
>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>>>>>>>>>
>>>>>>>>>> So nothing interesting logged relative to the PathRecord queries ?
>>>>>>>>> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
>>>>>>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>>>>>>>
>>>>>>>>>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>>>>>>>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>>>>>>>>>
>>>>>>>>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>>>>>>>>> that's how SMSL is set by DFSSSP.
>>>>>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
>>>>>>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>> Jens
>>>>>>>>>
>>>>>>>>> --------------------------------
>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>>>> --------------------------------
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>
>>>>>>> --------------------------------
>>>>>>> Dipl.-Math. Jens Domke
>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>> Global Scientific Information and Computing Center
>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>> --------------------------------
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>> --------------------------------
>>>>> Dipl.-Math. Jens Domke
>>>>> Researcher - Tokyo Institute of Technology
>>>>> Satoshi MATSUOKA Laboratory
>>>>> Global Scientific Information and Computing Center
>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>> Tokyo, 152-8550, JAPAN
>>>>> Tel/Fax: +81-3-5734-3876
>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>> --------------------------------
>>>>>
>>>>>
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> --------------------------------
>>> Dipl.-Math. Jens Domke
>>> Researcher - Tokyo Institute of Technology
>>> Satoshi MATSUOKA Laboratory
>>> Global Scientific Information and Computing Center
>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>> Tokyo, 152-8550, JAPAN
>>> Tel/Fax: +81-3-5734-3876
>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>> --------------------------------
>>>
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
       [not found]                                 ` <50CDD114.2090706-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2012-12-16 14:59                                   ` Jens Domke
  2012-12-17  6:16                                     ` Jens Domke
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Domke @ 2012-12-16 14:59 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler


On Dec 16, 2012, at 10:48 PM, Hal Rosenstock wrote:

> On 12/16/2012 8:39 AM, Jens Domke wrote:
>> Hi,
>> 
>> On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote:
>> 
>>> Hi,
>>> 
>>> On 12/16/2012 7:03 AM, Jens Domke wrote:
>>>> Hello Hal,
>>>> 
>>>> On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> On 12/14/2012 3:32 PM, Jens Domke wrote:
>>>>>> Hello Hal,
>>>>>> 
>>>>>> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
>>>>>> 
>>>>>>> Hi,
>>>>>>> 
>>>>>>> On 12/14/2012 1:24 PM, Jens Domke wrote:
>>>>>>>> Hello Hal,
>>>>>>>> 
>>>>>>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>>>>>>>> 
>>>>>>>>> Hi again,
>>>>>>>>> 
>>>>>>>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>>>>>>>> Hello Hal,
>>>>>>>>>> 
>>>>>>>>>> thank you for the fast response. I will try to clarify some points.
>>>>>>>>>> 
>>>>>>>>>>>> d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>>>>>>>>>> 
>>>>>>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>>>>>>>>>> there should be no need to set this. The proper SL for querying the SA
>>>>>>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>>>>>>>>>> (and other QoS based routing algorithms), it calculates that and the SM
>>>>>>>>>>> pushes this into each port. That should be used. It's possible that SL1
>>>>>>>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>>>>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
>>>>>>>>>> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
>>>>>>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>>>>>>>>>> 
>>>>>>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>>>>>>>> 
>>>>>>>>>>>> As far as I understand the whole system:
>>>>>>>>>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>>>>>>>>>> 2. the SA receives the request on QP1
>>>>>>>>>>> 
>>>>>>>>>>> There is the SL in the query itself. This should be the SMSL that the SM
>>>>>>>>>>> set for that port.
>>>>>>>>>> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
>>>>>>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>>>>>>>> 
>>>>>>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>>>>>>>>>> 
>>>>>>>>>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>>>>>>>>>> than the one the query used and is the one returned inside the
>>>>>>>>>>> PathRecord attribute/data.
>>>>>>>>>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
>>>>>>>>> 
>>>>>>>>> With DFSSSP are all SLs same from source port to get to any destination ?
>>>>>>>> No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
>>>>>>> 
>>>>>>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
>>>>>> True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path.
>>>>>> So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL.
>>>>>> 
>>>>>> I just read the IB Specs and it says, that "SL specified in the received packet is used as the SL in the response packet" for MAD packets.
>>>>>> So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet.
>>>>>> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, 
>>>>> 
>>>>> So CompMask in the query has the SL bit on and SL is set to 0 inside the
>>>>> SubAdmGet of PatchRecord ?
>>>> 
>>>> No, the CompMask didn't had the SL bit and the SL was set to 0.
>>> 
>>> That means the SL in the request is wildcarded so the SA/SM fills in a
>>> valid one in the response.
>> Ok.
>>> 
>>>> I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only reference I found was in osm_sa_path_record.c
>>>> The SA just treats the SL in the PathRequest as a "I would like to use this SL" in case the SL bit is set.
>>>> But the routing engine can overwrite the requested SL before the reply is send.
>>>> 
>>>> Nevertheless, I have changed the code of OMPI so that it sets the SL bit in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == SL_b.
>>>> Sadly, the reply send by the SA does not leave the node (for SL_b>0). Only if I change the SL to 0 in the MAD right before umad_send is called by the SA, the paket is able to leave the node and reaches the OMPI process.
>>> 
>>> Are you sure the response doesn't leave the SA node or it's not received
>>> at the requester (OMPI node) ?
>> No, I'm not sure. Is there any possibility to check that? As far as I know, ibdump does not show MAD pakets which leave a port, it only shows the pakets when they are received on the other end.
>>> 
>>>> 
>>>>> 
>>>>>> and sends the packet on SL_b (PortInfo.SMSL).
>>>>> 
>>>>> Good.
>>>>> 
>>>>>> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, for the response.
>>>>>> If SL_b is not 0, then the packet can't reach the OMPI process. Right?
>>>>> 
>>>>> Depends. It may be that both SLs work but maybe not.
>>>>> 
>>>>>> If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send, and not the SL specified within the packet.
>>>>>> What do you think?
>>>>> 
>>>>> Yes, it might be better to wildcard the SL in the query. The only
>>>>> scenario that would fail with the query you are making if there's no SL
>>>>> 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
>>>>> If that's the case, SA should return MAD status 0xc (status code 3 -
>>>>> ERR_NO_RECORDS). But the response doesn't make it back to the requester
>>>>> OMPI node so it's not even getting that far.
>>>> 
>>>> Yes, exactly. So, do you have an idea why the response hands in the SA node?
>>>> I have no inside of the underlying layer (kernel driver and fireware). Maybe there are some implementations, which prevent the SA from sending MADs back on SL>0?
>>> 
>>> If you're sure this response doesn't get out of the SA node, please
>>> contact Mellanox support with the details.
>> Ok, I can do this, if it turns out to be true.
>>> 
>>>>> 
>>>>>> I can try to change the PathRequest of OMPI tomorrow, so that it matches addr_type.gsi.service_level.
>>>>>> Maybe, with this change the packets of the SA will reach the OMPI process on a SL>0.
>>>>>>> 
>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>>>>>>>>>> 
>>>>>>>>>>> By the response reversibility rule, I think this is returned on the SL
>>>>>>>>>>> of the original query but haven't verified this in the code base yet.
>>>>>>>>>> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
>>>>>>>>> 
>>>>>>>>> I doubled checked and indeed the SA response does use the SL that the
>>>>>>>>> incoming request was received on.
>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>>>>>>>>>>   /* GS classes */
>>>>>>>>>>>>   umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>>>>>>>                     p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>>>>>>>                     p_mad_addr->addr_type.gsi.service_level,
>>>>>>>>>>>>                     IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>>>>>>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>>>>>>>>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>>>>>>>>>> 
>>>>>>>>>>> By not working, what do you mean ? Do you mean it's not received at the
>>>>>>>>>>> requester with no message in the OpenSM log or not received at the
>>>>>>>>>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>>>>>>>>>> the original request (forcing it to SL 1). That could cause it not to be
>>>>>>>>>>> received at the SM or the response not to make it back to the requester
>>>>>>>>>>> from the SA if the SL used is not "reversible".
>>>>>>>>>> By "not working" I mean, that the MPI process does not receive any response from the SA.
>>>>>>>>>> I get messages from the MPI process like the following:
>>>>>>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
>>>>>>>>>> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
>>>>>>>>>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>>>>>>>>>> 
>>>>>>>>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>>>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>>>>>>>>>> at src/umad.c:791
>>>>>>>>>>>> 791             if (umaddebug > 1)
>>>>>>>>>>>> (gdb) p *mad
>>>>>>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>>>>>>>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>>>>>>>>>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>>>>>>>>>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>>>>>>>>>> 
>>>>>>>>>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>>>>>>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>>>>>>>> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
>>>>>>>>>> SL=6 indicates, that the MPI process was sending the request on SL 6.
>>>>>>>>> 
>>>>>>>>> What is SMSL for the requester ? Was it SL 6 ?
>>>>>>>> Yes, it was SL 6.
>>>>>>>> Here is a content of a similar packet which was received by the SA. I have used ibdump on the port where the OpenSM was running:
>>>>>>>> ======================================================================================
>>>>>>>> No.     Time        Source                Destination           Protocol Length Info
>>>>>>>> 785 14.352168   LID: 384              LID: 4140             InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
>>>>>>>> 
>>>>>>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
>>>>>>>> Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>>>>>>> Epoch Time: 1355389784.437633332 seconds
>>>>>>>> [Time delta from previous captured frame: 4.332020528 seconds]
>>>>>>>> [Time delta from previous displayed frame: 4.332020528 seconds]
>>>>>>>> [Time since reference or first frame: 14.352168681 seconds]
>>>>>>>> Frame Number: 785
>>>>>>>> Frame Length: 290 bytes (2320 bits)
>>>>>>>> Capture Length: 290 bytes (2320 bits)
>>>>>>>> [Frame is marked: False]
>>>>>>>> [Frame is ignored: False]
>>>>>>>> [Protocols in frame: erf:infiniband]
>>>>>>>> Extensible Record Format
>>>>>>>> [ERF Header]
>>>>>>>>     Timestamp: 0x50c99b587008bcf2
>>>>>>>>     [Header type]
>>>>>>>>         .001 0101 = type: INFINIBAND (21)
>>>>>>>>         0... .... = Extension header present: 0
>>>>>>>>     0000 0100 = flags: 4
>>>>>>>>         .... ..00 = capture interface: 0
>>>>>>>>         .... .1.. = varying record length: 1
>>>>>>>>         .... 0... = truncated: 0
>>>>>>>>         ...0 .... = rx error: 0
>>>>>>>>         ..0. .... = ds error: 0
>>>>>>>>         00.. .... = reserved: 0
>>>>>>>>     record length: 306
>>>>>>>>     loss counter: 0
>>>>>>>>     wire length: 290
>>>>>>>> InfiniBand
>>>>>>>> Local Route Header
>>>>>>>>     0110 .... = Virtual Lane: 0x06
>>>>>>>>     .... 0000 = Link Version: 0
>>>>>>>>     0110 .... = Service Level: 6
>>>>>>>>     .... 00.. = Reserved (2 bits): 0
>>>>>>>>     .... ..10 = Link Next Header: 0x02
>>>>>>>>     Destination Local ID: 19
>>>>>>>>     0000 0... .... .... = Reserved (5 bits): 0
>>>>>>>>     .... .000 0100 1000 = Packet Length: 72
>>>>>>>>     Source Local ID: 16
>>>>>>>> Base Transport Header
>>>>>>>>     Opcode: 100
>>>>>>>>     1... .... = Solicited Event: True
>>>>>>>>     .1.. .... = MigReq: True
>>>>>>>>     ..00 .... = Pad Count: 0
>>>>>>>>     .... 0000 = Header Version: 0
>>>>>>>>     Partition Key: 65535
>>>>>>>>     Reserved (8 bits): 0
>>>>>>>>     Destination Queue Pair: 0x000001
>>>>>>>>     0... .... = Acknowledge Request: False
>>>>>>>>     .000 0000 = Reserved (7 bits): 0
>>>>>>>>     Packet Sequence Number: 0
>>>>>>>> DETH - Datagram Extended Transport Header
>>>>>>>>     Queue Key: 2147549184
>>>>>>>>     Reserved (8 bits): 0
>>>>>>>>     Source Queue Pair: 0x00380050
>>>>>>>> MAD Header - Common Management Datagram
>>>>>>>>     Base Version: 0x01
>>>>>>>>     Management Class: 0x03
>>>>>>>>     Class Version: 0x02
>>>>>>>>     Method: Get() (0x01)
>>>>>>>>     Status: 0x0000
>>>>>>>>     Class Specific: 0x0000
>>>>>>>>     Transaction ID: 0x0010000f38005000
>>>>>>>>     Attribute ID: 0x0035
>>>>>>>>     Reserved: 0x0000
>>>>>>>>     Attribute Modifier: 0x00000000
>>>>>>>>     MAD Data Payload: 000000000000000000000000000000000000000000000000...
>>>>>>>>  Illegal RMPP Type (0)! 
>>>>>>>>     RMPP Type: 0x00
>>>>>>>>     RMPP Type: 0x00
>>>>>>>>     0000 .... = R Resp Time: 0x00
>>>>>>>>     .... 0000 = RMPP Flags: Unknown (0x00)
>>>>>>>>     RMPP Status:  (Normal) (0x00)
>>>>>>>>     RMPP Data 1: 0x00000000
>>>>>>>>     RMPP Data 2: 0x00000000
>>>>>>>> SMASubnAdmGet(PathRecord)
>>>>>>>>     SM_Key (Verification Key): 0x0000000000000000
>>>>>>>>     Attribute Offset: 0x0000
>>>>>>>>     Reserved: 0x0000
>>>>>>>>     Component Mask: 0x0000003000000000
>>>>>>>>     Attribute (PathRecord)
>>>>>>>>         PathRecord
>>>>>>>>             DGID: :: (::)
>>>>>>>>             SGID: ::0.15.0.16 (::0.15.0.16)
>>>>>>>>             DLID: 0x0000
>>>>>>>>             SLID: 0x0000
>>>>>>>>             0... .... = RawTraffic: 0x00
>>>>>>>>             .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>>>>>             HopLimit: 0x00
>>>>>>>>             TClass: 0x00
>>>>>>>>             0... .... = Reversible: 0x00
>>>>>>>>             .000 0000 = NumbPath: 0x00
>>>>>>>>             P_Key: 0x0000
>>>>>>>>             .... .... .... 0000 = SL: 0x0000
>>>>>>>>             00.. .... = MTUSelector: 0x00
>>>>>>>>             ..00 0000 = MTU: 0x00
>>>>>>>>             00.. .... = RateSelector: 0x00
>>>>>>>>             ..00 0000 = Rate: 0x00
>>>>>>>>             00.. .... = PacketLifeTimeSelector: 0x00
>>>>>>>>             ..00 0000 = PacketLifeTime: 0x00
>>>>>>>>             Preference: 0x00
>>>>>>>> Variant CRC: 0xad4e
>>>>>>>> ======================================================================================
>>>>>>> 
>>>>>>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
>>>>>>> out that machine and the issue is internal to that machine. It could be
>>>>>>> because of the underlying issue which hangs OpenSM when some IB program
>>>>>>> tried to unregister from the MAD layer but there were outstanding work
>>>>>>> completions. That's based on your original email earlier this AM.
>>>>>> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI side and the SA uses a SL>0.
>>>>> 
>>>>> Can ibdump be used to capture output on the SM port ?
>>>> 
>>>> Yes, that works quite well, despite the warning in the ibdump manual.
>>>> But I have started ibdump before opensm, maybe that makes a difference, not sure.
>>>> 
>>>> Regards,
>>>> Jens
>>>> 
>>>> PS: I have seen a small bug. Not sure if its a bug in wireshark or ibdump, but the response received by the OMPI node isn't shown correctly. The PathRecord contains an offset which is either missing in the dump or is not treated correctly be wireshark. But it causes wireshark to show the PathRecord data with wrong values.
>>>> Maybe you could redirect this to the developer of ibdump, so that he can check/fix it.
>>> 
>>> Are you referring to the fields after the SA AttributeOffset or
>>> something else ?
>> Yes, after the SMASubnAdmGet Attribute Offset. Here an example:
>> I get on the OMPI side:
>>    SMASubnAdmGetResp(PathRecord)
>>        SM_Key (Verification Key): 0x0000000000000000
>>        Attribute Offset: 0x0008
>>        Reserved: 0x0000
>>        Component Mask: 0x0000803000000000
>>        Attribute (PathRecord)
>>            PathRecord
>>                DGID: ::8:f104:399:ebb5:fe80:0 (::8:f104:399:ebb5:fe80:0)
>>                SGID: ::8:f104:399:ecd5:4:8 (::8:f104:399:ecd5:4:8)
>>                DLID: 0x0000
>>                SLID: 0x0000
>>                0... .... = RawTraffic: 0x00
>>                .... 0000 1000 0000 1111 1111 = FlowLabel: 0x0080ff
>>                HopLimit: 0xff
>>                TClass: 0x00
>>                0... .... = Reversible: 0x00
>>                .000 0011 = NumbPath: 0x03
>>                P_Key: 0x8486
>>                .... .... .... 0000 = SL: 0x0000
>>                00.. .... = MTUSelector: 0x00
>>                ..00 0000 = MTU: 0x00
>>                00.. .... = RateSelector: 0x00
>>                ..00 0000 = Rate: 0x00
>>                00.. .... = PacketLifeTimeSelector: 0x00
>>                ..00 0000 = PacketLifeTime: 0x00
>>                Preference: 0x00
>> 
>> But it should show (see the difference in SLID, DLID, SL which are now correct):
>>    SMASubnAdmGetResp(PathRecord)
>>        SM_Key (Verification Key): 0x0000000000000000
>>        Attribute Offset: 0x0008
>>        Reserved: 0x0000
>>        Component Mask: 0x0000803000000000
>>        Attribute (PathRecord)
>>            PathRecord
>>                DGID: ::8:f104:399:ebb5 (::8:f104:399:ebb5)
>>                SGID: fe80::8:f104:399:ecd5 (fe80::8:f104:399:ecd5)
>>                DLID: 0x0004
>>                SLID: 0x0008
>>                0... .... = RawTraffic: 0x00
>>                .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>                HopLimit: 0x00
>>                TClass: 0x00
>>                1... .... = Reversible: 0x01
>>                .000 0000 = NumbPath: 0x00
>>                P_Key: 0xffff
>>                .... .... .... 0011 = SL: 0x0003
>>                10.. .... = MTUSelector: 0x02
>>                ..00 0100 = MTU: 0x04
>>                10.. .... = RateSelector: 0x02
>>                ..00 0110 = Rate: 0x06
>>                10.. .... = PacketLifeTimeSelector: 0x02
>>                ..01 0010 = PacketLifeTime: 0x12
>>                Preference: 0x00
> 
> 
> I think everything after AttributeOffset is off by 2 bytes. DGID doesn't
> look right to me (no subnet prefix fe80:: in front of GUID).

Yes, I made a small mistake with the hexeditor. I started the shift after the subnet prefix.
Sorry for the confusion.

Thank you for the hint with smpquery and saquery, I will check that tomorrow.

Jens

> 
> -- Hal
> 
>> 
>> Regards,
>> Jens
>> 
>>> 
>>> -- Hal
>>> 
>>>>> 
>>>>> -- Hal
>>>>> 
>>>>>>> 
>>>>>>>>> 
>>>>>>>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>>>>>>>> port) to SA and back to see whether SL6 would even have a chance of
>>>>>>>>> working (not dropping) aside from whether it's really the correct SL to use.
>>>>>>>> All SL2VL tables look the same. I checked the output of OpenSM.
>>>>>>>> 	SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
>>>>>>>> 	VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
>>>>>>>> But this is also as expected, because I have set the QoS in the opensm config as follows:
>>>>>>>> 	qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>>>>>>>> This was set for "default", "CA" and "Switch external ports". I have not touched the config for "Switch Port 0" and "Router ports", they remained: qos_[sw0 | rtr]_sl2vl (null)
>>>>>>> 
>>>>>>> That works as long as all links have (at least) 8 data VLs (VLCap 4).
>>>>>> Yes, all VL_CAP show 4 in the OpenSM log file.
>>>>>> 
>>>>>> Regards
>>>>>> Jens
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> -- Hal
>>>>>>> 
>>>>>>>> Regards
>>>>>>>> Jens
>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> -- Hal
>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>>>>>>>>>> 
>>>>>>>>>>> So nothing interesting logged relative to the PathRecord queries ?
>>>>>>>>>> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
>>>>>>>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>>>>>>>> 
>>>>>>>>>>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>>>>>>>>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>>>>>>>>>> 
>>>>>>>>>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>>>>>>>>>> that's how SMSL is set by DFSSSP.
>>>>>>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
>>>>>>>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>>>>>>>> 
>>>>>>>>>> Regards
>>>>>>>>>> Jens
>>>>>>>>>> 
>>>>>>>>>> --------------------------------
>>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>>>>> --------------------------------
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>> 
>>>>>>>> --------------------------------
>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>>> --------------------------------
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>> 
>>>>>> --------------------------------
>>>>>> Dipl.-Math. Jens Domke
>>>>>> Researcher - Tokyo Institute of Technology
>>>>>> Satoshi MATSUOKA Laboratory
>>>>>> Global Scientific Information and Computing Center
>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>> Tokyo, 152-8550, JAPAN
>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>> --------------------------------
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>> 
>>>> --------------------------------
>>>> Dipl.-Math. Jens Domke
>>>> Researcher - Tokyo Institute of Technology
>>>> Satoshi MATSUOKA Laboratory
>>>> Global Scientific Information and Computing Center
>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>> Tokyo, 152-8550, JAPAN
>>>> Tel/Fax: +81-3-5734-3876
>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>> --------------------------------
>>>> 
>>>> 
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> 
>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
  2012-12-16 14:59                                   ` Jens Domke
@ 2012-12-17  6:16                                     ` Jens Domke
  2012-12-17 12:04                                       ` Hal Rosenstock
  0 siblings, 1 reply; 18+ messages in thread
From: Jens Domke @ 2012-12-17  6:16 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

Hello Hal,

I have checked the smpquery and saquery command today.

The smpquery SL2VL and PI commands for the opensm port work fine, and I get the expected results:
======================================================
# SL2VL table: Lid 19
#                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in  0, out  0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
======================================================
# Port info: Lid 19 port 0
Mkey:............................<not displayed>
GidPrefix:.......................0xfe80000000000000
Lid:.............................19
SMLid:...........................19
CapMask:.........................0x251086a
                                IsSM
                                IsTrapSupported
                                IsAutomaticMigrationSupported
                                IsSLMappingSupported
                                IsSystemImageGUIDsupported
                                IsCommunicatonManagementSupported
                                IsVendorClassSupported
                                IsCapabilityMaskNoticeSupported
                                IsClientRegistrationSupported
DiagCode:........................0x0000
MkeyLeasePeriod:.................0
LocalPort:.......................1
LinkWidthEnabled:................1X or 4X
LinkWidthSupported:..............1X or 4X
LinkWidthActive:.................4X
LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps
LinkState:.......................Active
PhysLinkState:...................LinkUp
LinkDownDefState:................Polling
ProtectBits:.....................0
LMC:.............................0
LinkSpeedActive:.................5.0 Gbps
LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps
NeighborMTU:.....................2048
SMSL:............................0
VLCap:...........................VL0-7
InitType:........................0x00
VLHighLimit:.....................0
VLArbHighCap:....................8
VLArbLowCap:.....................8
InitReply:.......................0x00
MtuCap:..........................2048
VLStallCount:....................0
HoqLife:.........................31
OperVLs:.........................VL0-7
PartEnforceInb:..................0
PartEnforceOutb:.................0
FilterRawInb:....................0
FilterRawOutb:...................0
MkeyViolations:..................0
PkeyViolations:..................0
QkeyViolations:..................0
GuidCap:.........................32
ClientReregister:................0
McastPkeyTrapSuppressionEnabled:.0
SubnetTimeout:...................18
RespTimeVal:.....................16
LocalPhysErr:....................8
OverrunErr:......................8
MaxCreditHint:...................0
RoundTrip:.......................0
CapabilityMask2:.................0x0000
LinkSpeedExtActive:..............No Extended Speed
LinkSpeedExtSupported:...........0
LinkSpeedExtEnabled:.............0
======================================================


The problem are the saquery commands on other nodes.
In most cases the executions fails, and the node shows the same behaviour like the OpenSM node, when it trys to send on SL>0. The PathRequest paket does not arrive at the node with the running OpenSM (checked with ibdumb). At some point of the execution the saquery binary hangs, the kernel log indicates errors and the only option is to reboot. 
This is the output I see for the saquery:
======================================================
saquery -P --src-to-dst 4:8
ibwarn: [2535] sa_query: umad_recv failed: attr 0x11: Connection timed out

Query SA failed: Connection timed out
======================================================
(In really rar cases I get the PathRequest back and see the dump, but the saquery binary stalls afterwards, too.)


I did some debugging with gdb again, and stepped thru the saquery code.
When I change the SL to 0 in the addr vector of the MAD right before umad_send is called, then everthing works.
So, the saquery on the compute nodes shows the same behaviour as the opensm with respect to the SL value for umad_send.


At the end I tried to run MinHop instead of DFSSSP, and specified sm_sl 1 in the config file of opensm.
Sadly, this configuration results in the same crashes of the saquery commands.
For the runs with MinHop I used also a different SL2VL mapping, just to be sure, that there is no problem with VL>0 and every SL travels on VL=0:
======================================================
#                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
ports: in  0, out  0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
======================================================


Regards,
Jens


On Dec 16, 2012, at 11:59 PM, Jens Domke wrote:

> 
> On Dec 16, 2012, at 10:48 PM, Hal Rosenstock wrote:
> 
>> On 12/16/2012 8:39 AM, Jens Domke wrote:
>>> Hi,
>>> 
>>> On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote:
>>> 
>>>> Hi,
>>>> 
>>>> On 12/16/2012 7:03 AM, Jens Domke wrote:
>>>>> Hello Hal,
>>>>> 
>>>>> On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> On 12/14/2012 3:32 PM, Jens Domke wrote:
>>>>>>> Hello Hal,
>>>>>>> 
>>>>>>> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> On 12/14/2012 1:24 PM, Jens Domke wrote:
>>>>>>>>> Hello Hal,
>>>>>>>>> 
>>>>>>>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>>>>>>>>> 
>>>>>>>>>> Hi again,
>>>>>>>>>> 
>>>>>>>>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>>>>>>>>> Hello Hal,
>>>>>>>>>>> 
>>>>>>>>>>> thank you for the fast response. I will try to clarify some points.
>>>>>>>>>>> 
>>>>>>>>>>>>> d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>>>>>>>>>>> 
>>>>>>>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>>>>>>>>>>> there should be no need to set this. The proper SL for querying the SA
>>>>>>>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>>>>>>>>>>> (and other QoS based routing algorithms), it calculates that and the SM
>>>>>>>>>>>> pushes this into each port. That should be used. It's possible that SL1
>>>>>>>>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>>>>>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
>>>>>>>>>>> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
>>>>>>>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>>>>>>>>>>> 
>>>>>>>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>>>>>>>>> 
>>>>>>>>>>>>> As far as I understand the whole system:
>>>>>>>>>>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>>>>>>>>>>> 2. the SA receives the request on QP1
>>>>>>>>>>>> 
>>>>>>>>>>>> There is the SL in the query itself. This should be the SMSL that the SM
>>>>>>>>>>>> set for that port.
>>>>>>>>>>> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
>>>>>>>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>>>>>>>>> 
>>>>>>>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>>>>>>>>>>> 
>>>>>>>>>>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>>>>>>>>>>> than the one the query used and is the one returned inside the
>>>>>>>>>>>> PathRecord attribute/data.
>>>>>>>>>>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
>>>>>>>>>> 
>>>>>>>>>> With DFSSSP are all SLs same from source port to get to any destination ?
>>>>>>>>> No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
>>>>>>>> 
>>>>>>>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
>>>>>>> True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path.
>>>>>>> So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL.
>>>>>>> 
>>>>>>> I just read the IB Specs and it says, that "SL specified in the received packet is used as the SL in the response packet" for MAD packets.
>>>>>>> So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet.
>>>>>>> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, 
>>>>>> 
>>>>>> So CompMask in the query has the SL bit on and SL is set to 0 inside the
>>>>>> SubAdmGet of PatchRecord ?
>>>>> 
>>>>> No, the CompMask didn't had the SL bit and the SL was set to 0.
>>>> 
>>>> That means the SL in the request is wildcarded so the SA/SM fills in a
>>>> valid one in the response.
>>> Ok.
>>>> 
>>>>> I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only reference I found was in osm_sa_path_record.c
>>>>> The SA just treats the SL in the PathRequest as a "I would like to use this SL" in case the SL bit is set.
>>>>> But the routing engine can overwrite the requested SL before the reply is send.
>>>>> 
>>>>> Nevertheless, I have changed the code of OMPI so that it sets the SL bit in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == SL_b.
>>>>> Sadly, the reply send by the SA does not leave the node (for SL_b>0). Only if I change the SL to 0 in the MAD right before umad_send is called by the SA, the paket is able to leave the node and reaches the OMPI process.
>>>> 
>>>> Are you sure the response doesn't leave the SA node or it's not received
>>>> at the requester (OMPI node) ?
>>> No, I'm not sure. Is there any possibility to check that? As far as I know, ibdump does not show MAD pakets which leave a port, it only shows the pakets when they are received on the other end.
>>>> 
>>>>> 
>>>>>> 
>>>>>>> and sends the packet on SL_b (PortInfo.SMSL).
>>>>>> 
>>>>>> Good.
>>>>>> 
>>>>>>> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, for the response.
>>>>>>> If SL_b is not 0, then the packet can't reach the OMPI process. Right?
>>>>>> 
>>>>>> Depends. It may be that both SLs work but maybe not.
>>>>>> 
>>>>>>> If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send, and not the SL specified within the packet.
>>>>>>> What do you think?
>>>>>> 
>>>>>> Yes, it might be better to wildcard the SL in the query. The only
>>>>>> scenario that would fail with the query you are making if there's no SL
>>>>>> 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
>>>>>> If that's the case, SA should return MAD status 0xc (status code 3 -
>>>>>> ERR_NO_RECORDS). But the response doesn't make it back to the requester
>>>>>> OMPI node so it's not even getting that far.
>>>>> 
>>>>> Yes, exactly. So, do you have an idea why the response hands in the SA node?
>>>>> I have no inside of the underlying layer (kernel driver and fireware). Maybe there are some implementations, which prevent the SA from sending MADs back on SL>0?
>>>> 
>>>> If you're sure this response doesn't get out of the SA node, please
>>>> contact Mellanox support with the details.
>>> Ok, I can do this, if it turns out to be true.
>>>> 
>>>>>> 
>>>>>>> I can try to change the PathRequest of OMPI tomorrow, so that it matches addr_type.gsi.service_level.
>>>>>>> Maybe, with this change the packets of the SA will reach the OMPI process on a SL>0.
>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>>>>>>>>>>> 
>>>>>>>>>>>> By the response reversibility rule, I think this is returned on the SL
>>>>>>>>>>>> of the original query but haven't verified this in the code base yet.
>>>>>>>>>>> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
>>>>>>>>>> 
>>>>>>>>>> I doubled checked and indeed the SA response does use the SL that the
>>>>>>>>>> incoming request was received on.
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>>>>>>>>>>>  /* GS classes */
>>>>>>>>>>>>>  umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>>>>>>>>                    p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>>>>>>>>                    p_mad_addr->addr_type.gsi.service_level,
>>>>>>>>>>>>>                    IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>>>>>>>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>>>>>>>>>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>>>>>>>>>>> 
>>>>>>>>>>>> By not working, what do you mean ? Do you mean it's not received at the
>>>>>>>>>>>> requester with no message in the OpenSM log or not received at the
>>>>>>>>>>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>>>>>>>>>>> the original request (forcing it to SL 1). That could cause it not to be
>>>>>>>>>>>> received at the SM or the response not to make it back to the requester
>>>>>>>>>>>> from the SA if the SL used is not "reversible".
>>>>>>>>>>> By "not working" I mean, that the MPI process does not receive any response from the SA.
>>>>>>>>>>> I get messages from the MPI process like the following:
>>>>>>>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
>>>>>>>>>>> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
>>>>>>>>>>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>>>>>>>>>>> 
>>>>>>>>>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>>>>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>>>>>>>>>>> at src/umad.c:791
>>>>>>>>>>>>> 791             if (umaddebug > 1)
>>>>>>>>>>>>> (gdb) p *mad
>>>>>>>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>>>>>>>>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>>>>>>>>>>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>>>>>>>>>>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>>>>>>>>>>> 
>>>>>>>>>>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>>>>>>>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>>>>>>>>> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
>>>>>>>>>>> SL=6 indicates, that the MPI process was sending the request on SL 6.
>>>>>>>>>> 
>>>>>>>>>> What is SMSL for the requester ? Was it SL 6 ?
>>>>>>>>> Yes, it was SL 6.
>>>>>>>>> Here is a content of a similar packet which was received by the SA. I have used ibdump on the port where the OpenSM was running:
>>>>>>>>> ======================================================================================
>>>>>>>>> No.     Time        Source                Destination           Protocol Length Info
>>>>>>>>> 785 14.352168   LID: 384              LID: 4140             InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
>>>>>>>>> 
>>>>>>>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
>>>>>>>>> Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>>>>>>>> Epoch Time: 1355389784.437633332 seconds
>>>>>>>>> [Time delta from previous captured frame: 4.332020528 seconds]
>>>>>>>>> [Time delta from previous displayed frame: 4.332020528 seconds]
>>>>>>>>> [Time since reference or first frame: 14.352168681 seconds]
>>>>>>>>> Frame Number: 785
>>>>>>>>> Frame Length: 290 bytes (2320 bits)
>>>>>>>>> Capture Length: 290 bytes (2320 bits)
>>>>>>>>> [Frame is marked: False]
>>>>>>>>> [Frame is ignored: False]
>>>>>>>>> [Protocols in frame: erf:infiniband]
>>>>>>>>> Extensible Record Format
>>>>>>>>> [ERF Header]
>>>>>>>>>    Timestamp: 0x50c99b587008bcf2
>>>>>>>>>    [Header type]
>>>>>>>>>        .001 0101 = type: INFINIBAND (21)
>>>>>>>>>        0... .... = Extension header present: 0
>>>>>>>>>    0000 0100 = flags: 4
>>>>>>>>>        .... ..00 = capture interface: 0
>>>>>>>>>        .... .1.. = varying record length: 1
>>>>>>>>>        .... 0... = truncated: 0
>>>>>>>>>        ...0 .... = rx error: 0
>>>>>>>>>        ..0. .... = ds error: 0
>>>>>>>>>        00.. .... = reserved: 0
>>>>>>>>>    record length: 306
>>>>>>>>>    loss counter: 0
>>>>>>>>>    wire length: 290
>>>>>>>>> InfiniBand
>>>>>>>>> Local Route Header
>>>>>>>>>    0110 .... = Virtual Lane: 0x06
>>>>>>>>>    .... 0000 = Link Version: 0
>>>>>>>>>    0110 .... = Service Level: 6
>>>>>>>>>    .... 00.. = Reserved (2 bits): 0
>>>>>>>>>    .... ..10 = Link Next Header: 0x02
>>>>>>>>>    Destination Local ID: 19
>>>>>>>>>    0000 0... .... .... = Reserved (5 bits): 0
>>>>>>>>>    .... .000 0100 1000 = Packet Length: 72
>>>>>>>>>    Source Local ID: 16
>>>>>>>>> Base Transport Header
>>>>>>>>>    Opcode: 100
>>>>>>>>>    1... .... = Solicited Event: True
>>>>>>>>>    .1.. .... = MigReq: True
>>>>>>>>>    ..00 .... = Pad Count: 0
>>>>>>>>>    .... 0000 = Header Version: 0
>>>>>>>>>    Partition Key: 65535
>>>>>>>>>    Reserved (8 bits): 0
>>>>>>>>>    Destination Queue Pair: 0x000001
>>>>>>>>>    0... .... = Acknowledge Request: False
>>>>>>>>>    .000 0000 = Reserved (7 bits): 0
>>>>>>>>>    Packet Sequence Number: 0
>>>>>>>>> DETH - Datagram Extended Transport Header
>>>>>>>>>    Queue Key: 2147549184
>>>>>>>>>    Reserved (8 bits): 0
>>>>>>>>>    Source Queue Pair: 0x00380050
>>>>>>>>> MAD Header - Common Management Datagram
>>>>>>>>>    Base Version: 0x01
>>>>>>>>>    Management Class: 0x03
>>>>>>>>>    Class Version: 0x02
>>>>>>>>>    Method: Get() (0x01)
>>>>>>>>>    Status: 0x0000
>>>>>>>>>    Class Specific: 0x0000
>>>>>>>>>    Transaction ID: 0x0010000f38005000
>>>>>>>>>    Attribute ID: 0x0035
>>>>>>>>>    Reserved: 0x0000
>>>>>>>>>    Attribute Modifier: 0x00000000
>>>>>>>>>    MAD Data Payload: 000000000000000000000000000000000000000000000000...
>>>>>>>>> Illegal RMPP Type (0)! 
>>>>>>>>>    RMPP Type: 0x00
>>>>>>>>>    RMPP Type: 0x00
>>>>>>>>>    0000 .... = R Resp Time: 0x00
>>>>>>>>>    .... 0000 = RMPP Flags: Unknown (0x00)
>>>>>>>>>    RMPP Status:  (Normal) (0x00)
>>>>>>>>>    RMPP Data 1: 0x00000000
>>>>>>>>>    RMPP Data 2: 0x00000000
>>>>>>>>> SMASubnAdmGet(PathRecord)
>>>>>>>>>    SM_Key (Verification Key): 0x0000000000000000
>>>>>>>>>    Attribute Offset: 0x0000
>>>>>>>>>    Reserved: 0x0000
>>>>>>>>>    Component Mask: 0x0000003000000000
>>>>>>>>>    Attribute (PathRecord)
>>>>>>>>>        PathRecord
>>>>>>>>>            DGID: :: (::)
>>>>>>>>>            SGID: ::0.15.0.16 (::0.15.0.16)
>>>>>>>>>            DLID: 0x0000
>>>>>>>>>            SLID: 0x0000
>>>>>>>>>            0... .... = RawTraffic: 0x00
>>>>>>>>>            .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>>>>>>            HopLimit: 0x00
>>>>>>>>>            TClass: 0x00
>>>>>>>>>            0... .... = Reversible: 0x00
>>>>>>>>>            .000 0000 = NumbPath: 0x00
>>>>>>>>>            P_Key: 0x0000
>>>>>>>>>            .... .... .... 0000 = SL: 0x0000
>>>>>>>>>            00.. .... = MTUSelector: 0x00
>>>>>>>>>            ..00 0000 = MTU: 0x00
>>>>>>>>>            00.. .... = RateSelector: 0x00
>>>>>>>>>            ..00 0000 = Rate: 0x00
>>>>>>>>>            00.. .... = PacketLifeTimeSelector: 0x00
>>>>>>>>>            ..00 0000 = PacketLifeTime: 0x00
>>>>>>>>>            Preference: 0x00
>>>>>>>>> Variant CRC: 0xad4e
>>>>>>>>> ======================================================================================
>>>>>>>> 
>>>>>>>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
>>>>>>>> out that machine and the issue is internal to that machine. It could be
>>>>>>>> because of the underlying issue which hangs OpenSM when some IB program
>>>>>>>> tried to unregister from the MAD layer but there were outstanding work
>>>>>>>> completions. That's based on your original email earlier this AM.
>>>>>>> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI side and the SA uses a SL>0.
>>>>>> 
>>>>>> Can ibdump be used to capture output on the SM port ?
>>>>> 
>>>>> Yes, that works quite well, despite the warning in the ibdump manual.
>>>>> But I have started ibdump before opensm, maybe that makes a difference, not sure.
>>>>> 
>>>>> Regards,
>>>>> Jens
>>>>> 
>>>>> PS: I have seen a small bug. Not sure if its a bug in wireshark or ibdump, but the response received by the OMPI node isn't shown correctly. The PathRecord contains an offset which is either missing in the dump or is not treated correctly be wireshark. But it causes wireshark to show the PathRecord data with wrong values.
>>>>> Maybe you could redirect this to the developer of ibdump, so that he can check/fix it.
>>>> 
>>>> Are you referring to the fields after the SA AttributeOffset or
>>>> something else ?
>>> Yes, after the SMASubnAdmGet Attribute Offset. Here an example:
>>> I get on the OMPI side:
>>>   SMASubnAdmGetResp(PathRecord)
>>>       SM_Key (Verification Key): 0x0000000000000000
>>>       Attribute Offset: 0x0008
>>>       Reserved: 0x0000
>>>       Component Mask: 0x0000803000000000
>>>       Attribute (PathRecord)
>>>           PathRecord
>>>               DGID: ::8:f104:399:ebb5:fe80:0 (::8:f104:399:ebb5:fe80:0)
>>>               SGID: ::8:f104:399:ecd5:4:8 (::8:f104:399:ecd5:4:8)
>>>               DLID: 0x0000
>>>               SLID: 0x0000
>>>               0... .... = RawTraffic: 0x00
>>>               .... 0000 1000 0000 1111 1111 = FlowLabel: 0x0080ff
>>>               HopLimit: 0xff
>>>               TClass: 0x00
>>>               0... .... = Reversible: 0x00
>>>               .000 0011 = NumbPath: 0x03
>>>               P_Key: 0x8486
>>>               .... .... .... 0000 = SL: 0x0000
>>>               00.. .... = MTUSelector: 0x00
>>>               ..00 0000 = MTU: 0x00
>>>               00.. .... = RateSelector: 0x00
>>>               ..00 0000 = Rate: 0x00
>>>               00.. .... = PacketLifeTimeSelector: 0x00
>>>               ..00 0000 = PacketLifeTime: 0x00
>>>               Preference: 0x00
>>> 
>>> But it should show (see the difference in SLID, DLID, SL which are now correct):
>>>   SMASubnAdmGetResp(PathRecord)
>>>       SM_Key (Verification Key): 0x0000000000000000
>>>       Attribute Offset: 0x0008
>>>       Reserved: 0x0000
>>>       Component Mask: 0x0000803000000000
>>>       Attribute (PathRecord)
>>>           PathRecord
>>>               DGID: ::8:f104:399:ebb5 (::8:f104:399:ebb5)
>>>               SGID: fe80::8:f104:399:ecd5 (fe80::8:f104:399:ecd5)
>>>               DLID: 0x0004
>>>               SLID: 0x0008
>>>               0... .... = RawTraffic: 0x00
>>>               .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>               HopLimit: 0x00
>>>               TClass: 0x00
>>>               1... .... = Reversible: 0x01
>>>               .000 0000 = NumbPath: 0x00
>>>               P_Key: 0xffff
>>>               .... .... .... 0011 = SL: 0x0003
>>>               10.. .... = MTUSelector: 0x02
>>>               ..00 0100 = MTU: 0x04
>>>               10.. .... = RateSelector: 0x02
>>>               ..00 0110 = Rate: 0x06
>>>               10.. .... = PacketLifeTimeSelector: 0x02
>>>               ..01 0010 = PacketLifeTime: 0x12
>>>               Preference: 0x00
>> 
>> 
>> I think everything after AttributeOffset is off by 2 bytes. DGID doesn't
>> look right to me (no subnet prefix fe80:: in front of GUID).
> 
> Yes, I made a small mistake with the hexeditor. I started the shift after the subnet prefix.
> Sorry for the confusion.
> 
> Thank you for the hint with smpquery and saquery, I will check that tomorrow.
> 
> Jens
> 
>> 
>> -- Hal
>> 
>>> 
>>> Regards,
>>> Jens
>>> 
>>>> 
>>>> -- Hal
>>>> 
>>>>>> 
>>>>>> -- Hal
>>>>>> 
>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>>>>>>>>> port) to SA and back to see whether SL6 would even have a chance of
>>>>>>>>>> working (not dropping) aside from whether it's really the correct SL to use.
>>>>>>>>> All SL2VL tables look the same. I checked the output of OpenSM.
>>>>>>>>> 	SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
>>>>>>>>> 	VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
>>>>>>>>> But this is also as expected, because I have set the QoS in the opensm config as follows:
>>>>>>>>> 	qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>>>>>>>>> This was set for "default", "CA" and "Switch external ports". I have not touched the config for "Switch Port 0" and "Router ports", they remained: qos_[sw0 | rtr]_sl2vl (null)
>>>>>>>> 
>>>>>>>> That works as long as all links have (at least) 8 data VLs (VLCap 4).
>>>>>>> Yes, all VL_CAP show 4 in the OpenSM log file.
>>>>>>> 
>>>>>>> Regards
>>>>>>> Jens
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> -- Hal
>>>>>>>> 
>>>>>>>>> Regards
>>>>>>>>> Jens
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -- Hal
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>>>>>>>>>>> 
>>>>>>>>>>>> So nothing interesting logged relative to the PathRecord queries ?
>>>>>>>>>>> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
>>>>>>>>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>>>>>>>>> 
>>>>>>>>>>>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>>>>>>>>>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>>>>>>>>>>> 
>>>>>>>>>>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>>>>>>>>>>> that's how SMSL is set by DFSSSP.
>>>>>>>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
>>>>>>>>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>>>>>>>>> 
>>>>>>>>>>> Regards
>>>>>>>>>>> Jens
>>>>>>>>>>> 
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>> 
>>>>>>>>> --------------------------------
>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>>>> --------------------------------
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>> 
>>>>>>> --------------------------------
>>>>>>> Dipl.-Math. Jens Domke
>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>> Global Scientific Information and Computing Center
>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>> --------------------------------
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> 
>>>>> --------------------------------
>>>>> Dipl.-Math. Jens Domke
>>>>> Researcher - Tokyo Institute of Technology
>>>>> Satoshi MATSUOKA Laboratory
>>>>> Global Scientific Information and Computing Center
>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>> Tokyo, 152-8550, JAPAN
>>>>> Tel/Fax: +81-3-5734-3876
>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>> --------------------------------
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>>> 
>>> 
>>> 
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--------------------------------
Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku, 
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: domke.j.aa@m.titech.ac.jp
--------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
  2012-12-17  6:16                                     ` Jens Domke
@ 2012-12-17 12:04                                       ` Hal Rosenstock
       [not found]                                         ` <50CF0A33.1030809-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 18+ messages in thread
From: Hal Rosenstock @ 2012-12-17 12:04 UTC (permalink / raw)
  To: Jens Domke; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

Hi,

On 12/17/2012 1:16 AM, Jens Domke wrote:
> Hello Hal,
> 
> I have checked the smpquery and saquery command today.
> 
> The smpquery SL2VL and PI commands for the opensm port work fine, and I get the expected results:
> ======================================================
> # SL2VL table: Lid 19
> #                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
> ports: in  0, out  0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
> ======================================================
> # Port info: Lid 19 port 0
> Mkey:............................<not displayed>
> GidPrefix:.......................0xfe80000000000000
> Lid:.............................19
> SMLid:...........................19
> CapMask:.........................0x251086a
>                                 IsSM
>                                 IsTrapSupported
>                                 IsAutomaticMigrationSupported
>                                 IsSLMappingSupported
>                                 IsSystemImageGUIDsupported
>                                 IsCommunicatonManagementSupported
>                                 IsVendorClassSupported
>                                 IsCapabilityMaskNoticeSupported
>                                 IsClientRegistrationSupported
> DiagCode:........................0x0000
> MkeyLeasePeriod:.................0
> LocalPort:.......................1
> LinkWidthEnabled:................1X or 4X
> LinkWidthSupported:..............1X or 4X
> LinkWidthActive:.................4X
> LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps
> LinkState:.......................Active
> PhysLinkState:...................LinkUp
> LinkDownDefState:................Polling
> ProtectBits:.....................0
> LMC:.............................0
> LinkSpeedActive:.................5.0 Gbps
> LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps
> NeighborMTU:.....................2048
> SMSL:............................0
> VLCap:...........................VL0-7
> InitType:........................0x00
> VLHighLimit:.....................0
> VLArbHighCap:....................8
> VLArbLowCap:.....................8
> InitReply:.......................0x00
> MtuCap:..........................2048
> VLStallCount:....................0
> HoqLife:.........................31
> OperVLs:.........................VL0-7
> PartEnforceInb:..................0
> PartEnforceOutb:.................0
> FilterRawInb:....................0
> FilterRawOutb:...................0
> MkeyViolations:..................0
> PkeyViolations:..................0
> QkeyViolations:..................0
> GuidCap:.........................32
> ClientReregister:................0
> McastPkeyTrapSuppressionEnabled:.0
> SubnetTimeout:...................18
> RespTimeVal:.....................16
> LocalPhysErr:....................8
> OverrunErr:......................8
> MaxCreditHint:...................0
> RoundTrip:.......................0
> CapabilityMask2:.................0x0000
> LinkSpeedExtActive:..............No Extended Speed
> LinkSpeedExtSupported:...........0
> LinkSpeedExtEnabled:.............0
> ======================================================
> 
> 
> The problem are the saquery commands on other nodes.
> In most cases the executions fails, and the node shows the same behaviour like the OpenSM node, when it trys to send on SL>0. The PathRequest paket does not arrive at the node with the running OpenSM (checked with ibdumb). At some point of the execution the saquery binary hangs, the kernel log indicates errors and the only option is to reboot. 
> This is the output I see for the saquery:
> ======================================================
> saquery -P --src-to-dst 4:8
> ibwarn: [2535] sa_query: umad_recv failed: attr 0x11: Connection timed out
> 
> Query SA failed: Connection timed out
> ======================================================
> (In really rar cases I get the PathRequest back and see the dump, but the saquery binary stalls afterwards, too.)
> 
> 
> I did some debugging with gdb again, and stepped thru the saquery code.
> When I change the SL to 0 in the addr vector of the MAD right before umad_send is called, then everthing works.
> So, the saquery on the compute nodes shows the same behaviour as the opensm with respect to the SL value for umad_send.
> 
> 
> At the end I tried to run MinHop instead of DFSSSP, and specified sm_sl 1 in the config file of opensm.
> Sadly, this configuration results in the same crashes of the saquery commands.
> For the runs with MinHop I used also a different SL2VL mapping, just to be sure, that there is no problem with VL>0 and every SL travels on VL=0:
> ======================================================
> #                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
> ports: in  0, out  0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
> ======================================================

Non QoS routing algorithms still need -Q otherwise the full range of QoS
is not available. Was OpenSM started with -Q for this test ?

-- Hal
> 
> Regards,
> Jens
> 
> 
> On Dec 16, 2012, at 11:59 PM, Jens Domke wrote:
> 
>>
>> On Dec 16, 2012, at 10:48 PM, Hal Rosenstock wrote:
>>
>>> On 12/16/2012 8:39 AM, Jens Domke wrote:
>>>> Hi,
>>>>
>>>> On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> On 12/16/2012 7:03 AM, Jens Domke wrote:
>>>>>> Hello Hal,
>>>>>>
>>>>>> On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> On 12/14/2012 3:32 PM, Jens Domke wrote:
>>>>>>>> Hello Hal,
>>>>>>>>
>>>>>>>> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> On 12/14/2012 1:24 PM, Jens Domke wrote:
>>>>>>>>>> Hello Hal,
>>>>>>>>>>
>>>>>>>>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi again,
>>>>>>>>>>>
>>>>>>>>>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>>>>>>>>>> Hello Hal,
>>>>>>>>>>>>
>>>>>>>>>>>> thank you for the fast response. I will try to clarify some points.
>>>>>>>>>>>>
>>>>>>>>>>>>>> d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>>>>>>>>>>>> there should be no need to set this. The proper SL for querying the SA
>>>>>>>>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>>>>>>>>>>>> (and other QoS based routing algorithms), it calculates that and the SM
>>>>>>>>>>>>> pushes this into each port. That should be used. It's possible that SL1
>>>>>>>>>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>>>>>>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
>>>>>>>>>>>> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
>>>>>>>>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>>>>>>>>>>>>
>>>>>>>>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As far as I understand the whole system:
>>>>>>>>>>>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>>>>>>>>>>>> 2. the SA receives the request on QP1
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is the SL in the query itself. This should be the SMSL that the SM
>>>>>>>>>>>>> set for that port.
>>>>>>>>>>>> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
>>>>>>>>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>>>>>>>>>>>> than the one the query used and is the one returned inside the
>>>>>>>>>>>>> PathRecord attribute/data.
>>>>>>>>>>>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
>>>>>>>>>>>
>>>>>>>>>>> With DFSSSP are all SLs same from source port to get to any destination ?
>>>>>>>>>> No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
>>>>>>>>>
>>>>>>>>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
>>>>>>>> True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path.
>>>>>>>> So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL.
>>>>>>>>
>>>>>>>> I just read the IB Specs and it says, that "SL specified in the received packet is used as the SL in the response packet" for MAD packets.
>>>>>>>> So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet.
>>>>>>>> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, 
>>>>>>>
>>>>>>> So CompMask in the query has the SL bit on and SL is set to 0 inside the
>>>>>>> SubAdmGet of PatchRecord ?
>>>>>>
>>>>>> No, the CompMask didn't had the SL bit and the SL was set to 0.
>>>>>
>>>>> That means the SL in the request is wildcarded so the SA/SM fills in a
>>>>> valid one in the response.
>>>> Ok.
>>>>>
>>>>>> I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only reference I found was in osm_sa_path_record.c
>>>>>> The SA just treats the SL in the PathRequest as a "I would like to use this SL" in case the SL bit is set.
>>>>>> But the routing engine can overwrite the requested SL before the reply is send.
>>>>>>
>>>>>> Nevertheless, I have changed the code of OMPI so that it sets the SL bit in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == SL_b.
>>>>>> Sadly, the reply send by the SA does not leave the node (for SL_b>0). Only if I change the SL to 0 in the MAD right before umad_send is called by the SA, the paket is able to leave the node and reaches the OMPI process.
>>>>>
>>>>> Are you sure the response doesn't leave the SA node or it's not received
>>>>> at the requester (OMPI node) ?
>>>> No, I'm not sure. Is there any possibility to check that? As far as I know, ibdump does not show MAD pakets which leave a port, it only shows the pakets when they are received on the other end.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>> and sends the packet on SL_b (PortInfo.SMSL).
>>>>>>>
>>>>>>> Good.
>>>>>>>
>>>>>>>> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, for the response.
>>>>>>>> If SL_b is not 0, then the packet can't reach the OMPI process. Right?
>>>>>>>
>>>>>>> Depends. It may be that both SLs work but maybe not.
>>>>>>>
>>>>>>>> If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send, and not the SL specified within the packet.
>>>>>>>> What do you think?
>>>>>>>
>>>>>>> Yes, it might be better to wildcard the SL in the query. The only
>>>>>>> scenario that would fail with the query you are making if there's no SL
>>>>>>> 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
>>>>>>> If that's the case, SA should return MAD status 0xc (status code 3 -
>>>>>>> ERR_NO_RECORDS). But the response doesn't make it back to the requester
>>>>>>> OMPI node so it's not even getting that far.
>>>>>>
>>>>>> Yes, exactly. So, do you have an idea why the response hands in the SA node?
>>>>>> I have no inside of the underlying layer (kernel driver and fireware). Maybe there are some implementations, which prevent the SA from sending MADs back on SL>0?
>>>>>
>>>>> If you're sure this response doesn't get out of the SA node, please
>>>>> contact Mellanox support with the details.
>>>> Ok, I can do this, if it turns out to be true.
>>>>>
>>>>>>>
>>>>>>>> I can try to change the PathRequest of OMPI tomorrow, so that it matches addr_type.gsi.service_level.
>>>>>>>> Maybe, with this change the packets of the SA will reach the OMPI process on a SL>0.
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>>>>>>>>>>>>
>>>>>>>>>>>>> By the response reversibility rule, I think this is returned on the SL
>>>>>>>>>>>>> of the original query but haven't verified this in the code base yet.
>>>>>>>>>>>> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
>>>>>>>>>>>
>>>>>>>>>>> I doubled checked and indeed the SA response does use the SL that the
>>>>>>>>>>> incoming request was received on.
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>>>>>>>>>>>>  /* GS classes */
>>>>>>>>>>>>>>  umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>>>>>>>>>                    p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>>>>>>>>>                    p_mad_addr->addr_type.gsi.service_level,
>>>>>>>>>>>>>>                    IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>>>>>>>>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>>>>>>>>>>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>>>>>>>>>>>>
>>>>>>>>>>>>> By not working, what do you mean ? Do you mean it's not received at the
>>>>>>>>>>>>> requester with no message in the OpenSM log or not received at the
>>>>>>>>>>>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>>>>>>>>>>>> the original request (forcing it to SL 1). That could cause it not to be
>>>>>>>>>>>>> received at the SM or the response not to make it back to the requester
>>>>>>>>>>>>> from the SA if the SL used is not "reversible".
>>>>>>>>>>>> By "not working" I mean, that the MPI process does not receive any response from the SA.
>>>>>>>>>>>> I get messages from the MPI process like the following:
>>>>>>>>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
>>>>>>>>>>>> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
>>>>>>>>>>>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>>>>>>>>>>>>
>>>>>>>>>>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>>>>>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>>>>>>>>>>>> at src/umad.c:791
>>>>>>>>>>>>>> 791             if (umaddebug > 1)
>>>>>>>>>>>>>> (gdb) p *mad
>>>>>>>>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>>>>>>>>>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>>>>>>>>>>>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>>>>>>>>>>>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>>>>>>>>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>>>>>>>>>> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
>>>>>>>>>>>> SL=6 indicates, that the MPI process was sending the request on SL 6.
>>>>>>>>>>>
>>>>>>>>>>> What is SMSL for the requester ? Was it SL 6 ?
>>>>>>>>>> Yes, it was SL 6.
>>>>>>>>>> Here is a content of a similar packet which was received by the SA. I have used ibdump on the port where the OpenSM was running:
>>>>>>>>>> ======================================================================================
>>>>>>>>>> No.     Time        Source                Destination           Protocol Length Info
>>>>>>>>>> 785 14.352168   LID: 384              LID: 4140             InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
>>>>>>>>>>
>>>>>>>>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
>>>>>>>>>> Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>>>>>>>>> Epoch Time: 1355389784.437633332 seconds
>>>>>>>>>> [Time delta from previous captured frame: 4.332020528 seconds]
>>>>>>>>>> [Time delta from previous displayed frame: 4.332020528 seconds]
>>>>>>>>>> [Time since reference or first frame: 14.352168681 seconds]
>>>>>>>>>> Frame Number: 785
>>>>>>>>>> Frame Length: 290 bytes (2320 bits)
>>>>>>>>>> Capture Length: 290 bytes (2320 bits)
>>>>>>>>>> [Frame is marked: False]
>>>>>>>>>> [Frame is ignored: False]
>>>>>>>>>> [Protocols in frame: erf:infiniband]
>>>>>>>>>> Extensible Record Format
>>>>>>>>>> [ERF Header]
>>>>>>>>>>    Timestamp: 0x50c99b587008bcf2
>>>>>>>>>>    [Header type]
>>>>>>>>>>        .001 0101 = type: INFINIBAND (21)
>>>>>>>>>>        0... .... = Extension header present: 0
>>>>>>>>>>    0000 0100 = flags: 4
>>>>>>>>>>        .... ..00 = capture interface: 0
>>>>>>>>>>        .... .1.. = varying record length: 1
>>>>>>>>>>        .... 0... = truncated: 0
>>>>>>>>>>        ...0 .... = rx error: 0
>>>>>>>>>>        ..0. .... = ds error: 0
>>>>>>>>>>        00.. .... = reserved: 0
>>>>>>>>>>    record length: 306
>>>>>>>>>>    loss counter: 0
>>>>>>>>>>    wire length: 290
>>>>>>>>>> InfiniBand
>>>>>>>>>> Local Route Header
>>>>>>>>>>    0110 .... = Virtual Lane: 0x06
>>>>>>>>>>    .... 0000 = Link Version: 0
>>>>>>>>>>    0110 .... = Service Level: 6
>>>>>>>>>>    .... 00.. = Reserved (2 bits): 0
>>>>>>>>>>    .... ..10 = Link Next Header: 0x02
>>>>>>>>>>    Destination Local ID: 19
>>>>>>>>>>    0000 0... .... .... = Reserved (5 bits): 0
>>>>>>>>>>    .... .000 0100 1000 = Packet Length: 72
>>>>>>>>>>    Source Local ID: 16
>>>>>>>>>> Base Transport Header
>>>>>>>>>>    Opcode: 100
>>>>>>>>>>    1... .... = Solicited Event: True
>>>>>>>>>>    .1.. .... = MigReq: True
>>>>>>>>>>    ..00 .... = Pad Count: 0
>>>>>>>>>>    .... 0000 = Header Version: 0
>>>>>>>>>>    Partition Key: 65535
>>>>>>>>>>    Reserved (8 bits): 0
>>>>>>>>>>    Destination Queue Pair: 0x000001
>>>>>>>>>>    0... .... = Acknowledge Request: False
>>>>>>>>>>    .000 0000 = Reserved (7 bits): 0
>>>>>>>>>>    Packet Sequence Number: 0
>>>>>>>>>> DETH - Datagram Extended Transport Header
>>>>>>>>>>    Queue Key: 2147549184
>>>>>>>>>>    Reserved (8 bits): 0
>>>>>>>>>>    Source Queue Pair: 0x00380050
>>>>>>>>>> MAD Header - Common Management Datagram
>>>>>>>>>>    Base Version: 0x01
>>>>>>>>>>    Management Class: 0x03
>>>>>>>>>>    Class Version: 0x02
>>>>>>>>>>    Method: Get() (0x01)
>>>>>>>>>>    Status: 0x0000
>>>>>>>>>>    Class Specific: 0x0000
>>>>>>>>>>    Transaction ID: 0x0010000f38005000
>>>>>>>>>>    Attribute ID: 0x0035
>>>>>>>>>>    Reserved: 0x0000
>>>>>>>>>>    Attribute Modifier: 0x00000000
>>>>>>>>>>    MAD Data Payload: 000000000000000000000000000000000000000000000000...
>>>>>>>>>> Illegal RMPP Type (0)! 
>>>>>>>>>>    RMPP Type: 0x00
>>>>>>>>>>    RMPP Type: 0x00
>>>>>>>>>>    0000 .... = R Resp Time: 0x00
>>>>>>>>>>    .... 0000 = RMPP Flags: Unknown (0x00)
>>>>>>>>>>    RMPP Status:  (Normal) (0x00)
>>>>>>>>>>    RMPP Data 1: 0x00000000
>>>>>>>>>>    RMPP Data 2: 0x00000000
>>>>>>>>>> SMASubnAdmGet(PathRecord)
>>>>>>>>>>    SM_Key (Verification Key): 0x0000000000000000
>>>>>>>>>>    Attribute Offset: 0x0000
>>>>>>>>>>    Reserved: 0x0000
>>>>>>>>>>    Component Mask: 0x0000003000000000
>>>>>>>>>>    Attribute (PathRecord)
>>>>>>>>>>        PathRecord
>>>>>>>>>>            DGID: :: (::)
>>>>>>>>>>            SGID: ::0.15.0.16 (::0.15.0.16)
>>>>>>>>>>            DLID: 0x0000
>>>>>>>>>>            SLID: 0x0000
>>>>>>>>>>            0... .... = RawTraffic: 0x00
>>>>>>>>>>            .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>>>>>>>            HopLimit: 0x00
>>>>>>>>>>            TClass: 0x00
>>>>>>>>>>            0... .... = Reversible: 0x00
>>>>>>>>>>            .000 0000 = NumbPath: 0x00
>>>>>>>>>>            P_Key: 0x0000
>>>>>>>>>>            .... .... .... 0000 = SL: 0x0000
>>>>>>>>>>            00.. .... = MTUSelector: 0x00
>>>>>>>>>>            ..00 0000 = MTU: 0x00
>>>>>>>>>>            00.. .... = RateSelector: 0x00
>>>>>>>>>>            ..00 0000 = Rate: 0x00
>>>>>>>>>>            00.. .... = PacketLifeTimeSelector: 0x00
>>>>>>>>>>            ..00 0000 = PacketLifeTime: 0x00
>>>>>>>>>>            Preference: 0x00
>>>>>>>>>> Variant CRC: 0xad4e
>>>>>>>>>> ======================================================================================
>>>>>>>>>
>>>>>>>>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
>>>>>>>>> out that machine and the issue is internal to that machine. It could be
>>>>>>>>> because of the underlying issue which hangs OpenSM when some IB program
>>>>>>>>> tried to unregister from the MAD layer but there were outstanding work
>>>>>>>>> completions. That's based on your original email earlier this AM.
>>>>>>>> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI side and the SA uses a SL>0.
>>>>>>>
>>>>>>> Can ibdump be used to capture output on the SM port ?
>>>>>>
>>>>>> Yes, that works quite well, despite the warning in the ibdump manual.
>>>>>> But I have started ibdump before opensm, maybe that makes a difference, not sure.
>>>>>>
>>>>>> Regards,
>>>>>> Jens
>>>>>>
>>>>>> PS: I have seen a small bug. Not sure if its a bug in wireshark or ibdump, but the response received by the OMPI node isn't shown correctly. The PathRecord contains an offset which is either missing in the dump or is not treated correctly be wireshark. But it causes wireshark to show the PathRecord data with wrong values.
>>>>>> Maybe you could redirect this to the developer of ibdump, so that he can check/fix it.
>>>>>
>>>>> Are you referring to the fields after the SA AttributeOffset or
>>>>> something else ?
>>>> Yes, after the SMASubnAdmGet Attribute Offset. Here an example:
>>>> I get on the OMPI side:
>>>>   SMASubnAdmGetResp(PathRecord)
>>>>       SM_Key (Verification Key): 0x0000000000000000
>>>>       Attribute Offset: 0x0008
>>>>       Reserved: 0x0000
>>>>       Component Mask: 0x0000803000000000
>>>>       Attribute (PathRecord)
>>>>           PathRecord
>>>>               DGID: ::8:f104:399:ebb5:fe80:0 (::8:f104:399:ebb5:fe80:0)
>>>>               SGID: ::8:f104:399:ecd5:4:8 (::8:f104:399:ecd5:4:8)
>>>>               DLID: 0x0000
>>>>               SLID: 0x0000
>>>>               0... .... = RawTraffic: 0x00
>>>>               .... 0000 1000 0000 1111 1111 = FlowLabel: 0x0080ff
>>>>               HopLimit: 0xff
>>>>               TClass: 0x00
>>>>               0... .... = Reversible: 0x00
>>>>               .000 0011 = NumbPath: 0x03
>>>>               P_Key: 0x8486
>>>>               .... .... .... 0000 = SL: 0x0000
>>>>               00.. .... = MTUSelector: 0x00
>>>>               ..00 0000 = MTU: 0x00
>>>>               00.. .... = RateSelector: 0x00
>>>>               ..00 0000 = Rate: 0x00
>>>>               00.. .... = PacketLifeTimeSelector: 0x00
>>>>               ..00 0000 = PacketLifeTime: 0x00
>>>>               Preference: 0x00
>>>>
>>>> But it should show (see the difference in SLID, DLID, SL which are now correct):
>>>>   SMASubnAdmGetResp(PathRecord)
>>>>       SM_Key (Verification Key): 0x0000000000000000
>>>>       Attribute Offset: 0x0008
>>>>       Reserved: 0x0000
>>>>       Component Mask: 0x0000803000000000
>>>>       Attribute (PathRecord)
>>>>           PathRecord
>>>>               DGID: ::8:f104:399:ebb5 (::8:f104:399:ebb5)
>>>>               SGID: fe80::8:f104:399:ecd5 (fe80::8:f104:399:ecd5)
>>>>               DLID: 0x0004
>>>>               SLID: 0x0008
>>>>               0... .... = RawTraffic: 0x00
>>>>               .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>               HopLimit: 0x00
>>>>               TClass: 0x00
>>>>               1... .... = Reversible: 0x01
>>>>               .000 0000 = NumbPath: 0x00
>>>>               P_Key: 0xffff
>>>>               .... .... .... 0011 = SL: 0x0003
>>>>               10.. .... = MTUSelector: 0x02
>>>>               ..00 0100 = MTU: 0x04
>>>>               10.. .... = RateSelector: 0x02
>>>>               ..00 0110 = Rate: 0x06
>>>>               10.. .... = PacketLifeTimeSelector: 0x02
>>>>               ..01 0010 = PacketLifeTime: 0x12
>>>>               Preference: 0x00
>>>
>>>
>>> I think everything after AttributeOffset is off by 2 bytes. DGID doesn't
>>> look right to me (no subnet prefix fe80:: in front of GUID).
>>
>> Yes, I made a small mistake with the hexeditor. I started the shift after the subnet prefix.
>> Sorry for the confusion.
>>
>> Thank you for the hint with smpquery and saquery, I will check that tomorrow.
>>
>> Jens
>>
>>>
>>> -- Hal
>>>
>>>>
>>>> Regards,
>>>> Jens
>>>>
>>>>>
>>>>> -- Hal
>>>>>
>>>>>>>
>>>>>>> -- Hal
>>>>>>>
>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>>>>>>>>>> port) to SA and back to see whether SL6 would even have a chance of
>>>>>>>>>>> working (not dropping) aside from whether it's really the correct SL to use.
>>>>>>>>>> All SL2VL tables look the same. I checked the output of OpenSM.
>>>>>>>>>> 	SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
>>>>>>>>>> 	VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
>>>>>>>>>> But this is also as expected, because I have set the QoS in the opensm config as follows:
>>>>>>>>>> 	qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>>>>>>>>>> This was set for "default", "CA" and "Switch external ports". I have not touched the config for "Switch Port 0" and "Router ports", they remained: qos_[sw0 | rtr]_sl2vl (null)
>>>>>>>>>
>>>>>>>>> That works as long as all links have (at least) 8 data VLs (VLCap 4).
>>>>>>>> Yes, all VL_CAP show 4 in the OpenSM log file.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Jens
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> -- Hal
>>>>>>>>>
>>>>>>>>>> Regards
>>>>>>>>>> Jens
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -- Hal
>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So nothing interesting logged relative to the PathRecord queries ?
>>>>>>>>>>>> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
>>>>>>>>>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>>>>>>>>>>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>>>>>>>>>>>>
>>>>>>>>>>>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>>>>>>>>>>>> that's how SMSL is set by DFSSSP.
>>>>>>>>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
>>>>>>>>>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards
>>>>>>>>>>>> Jens
>>>>>>>>>>>>
>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>> --------------------------------
>>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>>>>> --------------------------------
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>> --------------------------------
>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>>> --------------------------------
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>> --------------------------------
>>>>>> Dipl.-Math. Jens Domke
>>>>>> Researcher - Tokyo Institute of Technology
>>>>>> Satoshi MATSUOKA Laboratory
>>>>>> Global Scientific Information and Computing Center
>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>> Tokyo, 152-8550, JAPAN
>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>> --------------------------------
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --------------------------------
> Dipl.-Math. Jens Domke
> Researcher - Tokyo Institute of Technology
> Satoshi MATSUOKA Laboratory
> Global Scientific Information and Computing Center
> 2-12-1-E2-7 Ookayama, Meguro-ku, 
> Tokyo, 152-8550, JAPAN
> Tel/Fax: +81-3-5734-3876
> E-Mail: domke.j.aa@m.titech.ac.jp
> --------------------------------
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: umad_send with service level higher than 0 does not work
       [not found]                                         ` <50CF0A33.1030809-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2012-12-18  2:26                                           ` Jens Domke
  0 siblings, 0 replies; 18+ messages in thread
From: Jens Domke @ 2012-12-18  2:26 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, Torsten Hoefler

Hello Hal,

On Dec 17, 2012, at 9:04 PM, Hal Rosenstock wrote:

> Hi,
> 
> On 12/17/2012 1:16 AM, Jens Domke wrote:
>> Hello Hal,
>> 
>> I have checked the smpquery and saquery command today.
>> 
>> The smpquery SL2VL and PI commands for the opensm port work fine, and I get the expected results:
>> ======================================================
>> # SL2VL table: Lid 19
>> #                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
>> ports: in  0, out  0: | 0| 1| 2| 3| 4| 5| 6| 7| 0| 1| 2| 3| 4| 5| 6| 7|
>> ======================================================
>> # Port info: Lid 19 port 0
>> Mkey:............................<not displayed>
>> GidPrefix:.......................0xfe80000000000000
>> Lid:.............................19
>> SMLid:...........................19
>> CapMask:.........................0x251086a
>>                                IsSM
>>                                IsTrapSupported
>>                                IsAutomaticMigrationSupported
>>                                IsSLMappingSupported
>>                                IsSystemImageGUIDsupported
>>                                IsCommunicatonManagementSupported
>>                                IsVendorClassSupported
>>                                IsCapabilityMaskNoticeSupported
>>                                IsClientRegistrationSupported
>> DiagCode:........................0x0000
>> MkeyLeasePeriod:.................0
>> LocalPort:.......................1
>> LinkWidthEnabled:................1X or 4X
>> LinkWidthSupported:..............1X or 4X
>> LinkWidthActive:.................4X
>> LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps
>> LinkState:.......................Active
>> PhysLinkState:...................LinkUp
>> LinkDownDefState:................Polling
>> ProtectBits:.....................0
>> LMC:.............................0
>> LinkSpeedActive:.................5.0 Gbps
>> LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps
>> NeighborMTU:.....................2048
>> SMSL:............................0
>> VLCap:...........................VL0-7
>> InitType:........................0x00
>> VLHighLimit:.....................0
>> VLArbHighCap:....................8
>> VLArbLowCap:.....................8
>> InitReply:.......................0x00
>> MtuCap:..........................2048
>> VLStallCount:....................0
>> HoqLife:.........................31
>> OperVLs:.........................VL0-7
>> PartEnforceInb:..................0
>> PartEnforceOutb:.................0
>> FilterRawInb:....................0
>> FilterRawOutb:...................0
>> MkeyViolations:..................0
>> PkeyViolations:..................0
>> QkeyViolations:..................0
>> GuidCap:.........................32
>> ClientReregister:................0
>> McastPkeyTrapSuppressionEnabled:.0
>> SubnetTimeout:...................18
>> RespTimeVal:.....................16
>> LocalPhysErr:....................8
>> OverrunErr:......................8
>> MaxCreditHint:...................0
>> RoundTrip:.......................0
>> CapabilityMask2:.................0x0000
>> LinkSpeedExtActive:..............No Extended Speed
>> LinkSpeedExtSupported:...........0
>> LinkSpeedExtEnabled:.............0
>> ======================================================
>> 
>> 
>> The problem are the saquery commands on other nodes.
>> In most cases the executions fails, and the node shows the same behaviour like the OpenSM node, when it trys to send on SL>0. The PathRequest paket does not arrive at the node with the running OpenSM (checked with ibdumb). At some point of the execution the saquery binary hangs, the kernel log indicates errors and the only option is to reboot. 
>> This is the output I see for the saquery:
>> ======================================================
>> saquery -P --src-to-dst 4:8
>> ibwarn: [2535] sa_query: umad_recv failed: attr 0x11: Connection timed out
>> 
>> Query SA failed: Connection timed out
>> ======================================================
>> (In really rar cases I get the PathRequest back and see the dump, but the saquery binary stalls afterwards, too.)
>> 
>> 
>> I did some debugging with gdb again, and stepped thru the saquery code.
>> When I change the SL to 0 in the addr vector of the MAD right before umad_send is called, then everthing works.
>> So, the saquery on the compute nodes shows the same behaviour as the opensm with respect to the SL value for umad_send.
>> 
>> 
>> At the end I tried to run MinHop instead of DFSSSP, and specified sm_sl 1 in the config file of opensm.
>> Sadly, this configuration results in the same crashes of the saquery commands.
>> For the runs with MinHop I used also a different SL2VL mapping, just to be sure, that there is no problem with VL>0 and every SL travels on VL=0:
>> ======================================================
>> #                 SL: | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
>> ports: in  0, out  0: | 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0| 0|
>> ======================================================
> 
> Non QoS routing algorithms still need -Q otherwise the full range of QoS
> is not available. Was OpenSM started with -Q for this test ?

Yes I had QoS enabled in my configuration file with "qos TRUE".

Jens

> 
> -- Hal
>> 
>> Regards,
>> Jens
>> 
>> 
>> On Dec 16, 2012, at 11:59 PM, Jens Domke wrote:
>> 
>>> 
>>> On Dec 16, 2012, at 10:48 PM, Hal Rosenstock wrote:
>>> 
>>>> On 12/16/2012 8:39 AM, Jens Domke wrote:
>>>>> Hi,
>>>>> 
>>>>> On Dec 16, 2012, at 9:32 PM, Hal Rosenstock wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> On 12/16/2012 7:03 AM, Jens Domke wrote:
>>>>>>> Hello Hal,
>>>>>>> 
>>>>>>> On Dec 15, 2012, at 5:44 AM, Hal Rosenstock wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> On 12/14/2012 3:32 PM, Jens Domke wrote:
>>>>>>>>> Hello Hal,
>>>>>>>>> 
>>>>>>>>> On Dec 15, 2012, at 3:58 AM, Hal Rosenstock wrote:
>>>>>>>>> 
>>>>>>>>>> Hi,
>>>>>>>>>> 
>>>>>>>>>> On 12/14/2012 1:24 PM, Jens Domke wrote:
>>>>>>>>>>> Hello Hal,
>>>>>>>>>>> 
>>>>>>>>>>> On Dec 15, 2012, at 1:42 AM, Hal Rosenstock wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> Hi again,
>>>>>>>>>>>> 
>>>>>>>>>>>> On 12/14/2012 10:17 AM, Jens Domke wrote:
>>>>>>>>>>>>> Hello Hal,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> thank you for the fast response. I will try to clarify some points.
>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> d) OpenMPI runs are executed with "--mca btl_openib_ib_path_record_service_level 1"
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I'm not familiar with what DFSSSP does to figure out SLs exactly but
>>>>>>>>>>>>>> there should be no need to set this. The proper SL for querying the SA
>>>>>>>>>>>>>> for PathRecords, etc. is always in PortInfo.SMSL. In the case of DFSSSP
>>>>>>>>>>>>>> (and other QoS based routing algorithms), it calculates that and the SM
>>>>>>>>>>>>>> pushes this into each port. That should be used. It's possible that SL1
>>>>>>>>>>>>>> is not a valid SL for port <-> SA querying using DFSSSP.
>>>>>>>>>>>>> The OpenMPI parameter btl_openib_ib_path_record_service_level does not specify the SL for querying the PathRecords.
>>>>>>>>>>>>> It just enables the functionality. And the ompi processes use the PortInfo.SMSL to send the request.
>>>>>>>>>>>>> For the request "port -> SA" every 0<=SL<=7 was used in the test, and the SA received the requests.  
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> e) kernel 2.6.32-220.13.1.el6.x86_64
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> As far as I understand the whole system:
>>>>>>>>>>>>>>> 1. the OMPI processes are sending MAD requests (SubnAdmGet:PathRecord) to the OpenSM
>>>>>>>>>>>>>>> 2. the SA receives the request on QP1
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> There is the SL in the query itself. This should be the SMSL that the SM
>>>>>>>>>>>>>> set for that port.
>>>>>>>>>>>>> Hmm, there you might have a point. I think I saw that the query itself had SL=0 specified.
>>>>>>>>>>>>> In fact OpenMPI sets everthing to 0 except for slid and dlid.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 3. SA asks the routing algorithm (like LASH, DFSSSP or Torus_2QoS) about a special service level for the slid/dlid path
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> This is a (potentially) different SL (for MPI<->MPI port communication)
>>>>>>>>>>>>>> than the one the query used and is the one returned inside the
>>>>>>>>>>>>>> PathRecord attribute/data.
>>>>>>>>>>>>> Yes, it can be different, but DFSSSP sets the same SL, because the SM is running on a port which is also used for MPI comm.
>>>>>>>>>>>> 
>>>>>>>>>>>> With DFSSSP are all SLs same from source port to get to any destination ?
>>>>>>>>>>> No, not necessarily. In general DFSSSP does not enforce SL(LID1->LID2) == SL(LID2->LID1) or SL(LID1->LID2) == SL(LID1->LID3).
>>>>>>>>>> 
>>>>>>>>>> If SL(LID1->LID2) != SL(LID2->LID1), that's not a reversible path.
>>>>>>>>> True. But i don't think that the SA asks the DFSSSP routing about the SL for the reversible path.
>>>>>>>>> So, the SA could use any SL which is a valid SL, even if the DFSSSP would recommend another SL.
>>>>>>>>> 
>>>>>>>>> I just read the IB Specs and it says, that "SL specified in the received packet is used as the SL in the response packet" for MAD packets.
>>>>>>>>> So, its most likely, that there is a mismatch in the way how OMPI does the setup of the PathRequest and the way how the SA does build the respond packet.
>>>>>>>>> OMPI always specifies SL=0 (lets say SL_a) inside of the PathRequest packet, 
>>>>>>>> 
>>>>>>>> So CompMask in the query has the SL bit on and SL is set to 0 inside the
>>>>>>>> SubAdmGet of PatchRecord ?
>>>>>>> 
>>>>>>> No, the CompMask didn't had the SL bit and the SL was set to 0.
>>>>>> 
>>>>>> That means the SL in the request is wildcarded so the SA/SM fills in a
>>>>>> valid one in the response.
>>>>> Ok.
>>>>>> 
>>>>>>> I tried to follow the path of the SL bit (IB_PR_COMPMASK_SL) and the only reference I found was in osm_sa_path_record.c
>>>>>>> The SA just treats the SL in the PathRequest as a "I would like to use this SL" in case the SL bit is set.
>>>>>>> But the routing engine can overwrite the requested SL before the reply is send.
>>>>>>> 
>>>>>>> Nevertheless, I have changed the code of OMPI so that it sets the SL bit in the CompMask and sets the SL to SMSL for the PathRequest, so that SL_a == SL_b.
>>>>>>> Sadly, the reply send by the SA does not leave the node (for SL_b>0). Only if I change the SL to 0 in the MAD right before umad_send is called by the SA, the paket is able to leave the node and reaches the OMPI process.
>>>>>> 
>>>>>> Are you sure the response doesn't leave the SA node or it's not received
>>>>>> at the requester (OMPI node) ?
>>>>> No, I'm not sure. Is there any possibility to check that? As far as I know, ibdump does not show MAD pakets which leave a port, it only shows the pakets when they are received on the other end.
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>>> and sends the packet on SL_b (PortInfo.SMSL).
>>>>>>>> 
>>>>>>>> Good.
>>>>>>>> 
>>>>>>>>> The SA uses p_mad_addr->addr_type.gsi.service_level, which is SL_b, for the response.
>>>>>>>>> If SL_b is not 0, then the packet can't reach the OMPI process. Right?
>>>>>>>> 
>>>>>>>> Depends. It may be that both SLs work but maybe not.
>>>>>>>> 
>>>>>>>>> If I analyse this correctly, then there are two bugs. One is in OMPI, that it does not specify the SL within the PathRequest in a appropriate way (which would be a SL suggested by DFSSSP for the reversible path). And the second bug is that the SA uses the SL, on which the PathRequest packet was send, and not the SL specified within the packet.
>>>>>>>>> What do you think?
>>>>>>>> 
>>>>>>>> Yes, it might be better to wildcard the SL in the query. The only
>>>>>>>> scenario that would fail with the query you are making if there's no SL
>>>>>>>> 0 path between the src/dest LIDs or GIDs in the OMPI PathRecord query.
>>>>>>>> If that's the case, SA should return MAD status 0xc (status code 3 -
>>>>>>>> ERR_NO_RECORDS). But the response doesn't make it back to the requester
>>>>>>>> OMPI node so it's not even getting that far.
>>>>>>> 
>>>>>>> Yes, exactly. So, do you have an idea why the response hands in the SA node?
>>>>>>> I have no inside of the underlying layer (kernel driver and fireware). Maybe there are some implementations, which prevent the SA from sending MADs back on SL>0?
>>>>>> 
>>>>>> If you're sure this response doesn't get out of the SA node, please
>>>>>> contact Mellanox support with the details.
>>>>> Ok, I can do this, if it turns out to be true.
>>>>>> 
>>>>>>>> 
>>>>>>>>> I can try to change the PathRequest of OMPI tomorrow, so that it matches addr_type.gsi.service_level.
>>>>>>>>> Maybe, with this change the packets of the SA will reach the OMPI process on a SL>0.
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 4. SA sends the PathRecord back to the OMPI process via umad_send in libvendor/osm_vendor_ibumad.c
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> By the response reversibility rule, I think this is returned on the SL
>>>>>>>>>>>>>> of the original query but haven't verified this in the code base yet.
>>>>>>>>>>>>> Ok, I was not aware of that rule. But if this is true, then the SA should also be able to send via SL>0.
>>>>>>>>>>>> 
>>>>>>>>>>>> I doubled checked and indeed the SA response does use the SL that the
>>>>>>>>>>>> incoming request was received on.
>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The osm_vendor_send() function builds the MAD packet with the following attributes:
>>>>>>>>>>>>>>> /* GS classes */
>>>>>>>>>>>>>>> umad_set_addr_net(p_vw->umad, p_mad_addr->dest_lid,
>>>>>>>>>>>>>>>                   p_mad_addr->addr_type.gsi.remote_qp,
>>>>>>>>>>>>>>>                   p_mad_addr->addr_type.gsi.service_level,
>>>>>>>>>>>>>>>                   IB_QP1_WELL_KNOWN_Q_KEY);
>>>>>>>>>>>>>>> So, the SL is the same like the one which was used by the OMPI process. The Q_Key matches the Q_key on the OMPI process, and remote_qp and dest_lid is correct, too.
>>>>>>>>>>>>>>> Afterwards umad_send(…) is used to send the reply with the PathRecord, and this send does not work (except for SL=0).
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> By not working, what do you mean ? Do you mean it's not received at the
>>>>>>>>>>>>>> requester with no message in the OpenSM log or not received at the
>>>>>>>>>>>>>> OpenSM or something else ? It could be due to the wrong SL being used in
>>>>>>>>>>>>>> the original request (forcing it to SL 1). That could cause it not to be
>>>>>>>>>>>>>> received at the SM or the response not to make it back to the requester
>>>>>>>>>>>>>> from the SA if the SL used is not "reversible".
>>>>>>>>>>>>> By "not working" I mean, that the MPI process does not receive any response from the SA.
>>>>>>>>>>>>> I get messages from the MPI process like the following:
>>>>>>>>>>>>> [rc011][[14851,1],1][connect/btl_openib_connect_sl.c:301:get_pathrecord_info] No response from SA after 20 retries
>>>>>>>>>>>>> The log of OpenSM shows that the SA received the PathRequest query, dumps the query into the log, and sends the reply back.
>>>>>>>>>>>>> And I think I was some messages in the log about "…1 outstanding MAD…".
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If I look into the MAD before it is send, then it looks like this:
>>>>>>>>>>>>>>> Breakpoint 2, umad_send (fd=9, agentid=2, umad=0x7fffe8012530, length=120, timeout_ms=0, retries=3)
>>>>>>>>>>>>>>> at src/umad.c:791
>>>>>>>>>>>>>>> 791             if (umaddebug > 1)
>>>>>>>>>>>>>>> (gdb) p *mad
>>>>>>>>>>>>>>> $1 = {agent_id = 2, status = 0, timeout_ms = 0, retries = 3, length = 0, addr = {qpn = 1325427712, qkey = 384, 
>>>>>>>>>>>>>>> lid = 4096, sl = 6 '\006', path_bits = 0 '\000', grh_present = 0 '\000', gid_index = 0 '\000', 
>>>>>>>>>>>>>>> hop_limit = 0 '\000', traffic_class = 0 '\000', gid = '\000' <repeats 15 times>, flow_label = 0, 
>>>>>>>>>>>>>>> pkey_index = 0, reserved = "\000\000\000\000\000"}, data = 0x7fffe8012530 "\002"}
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Is this the PathRecord query on the OpenMPI side or the response on the
>>>>>>>>>>>>>> OpenSM side ? SL is 6 rather than 1 here.
>>>>>>>>>>>>> This is the response on the OpenSM side (inside the umad_send function, right before it is written to the device with write(fd, …).
>>>>>>>>>>>>> SL=6 indicates, that the MPI process was sending the request on SL 6.
>>>>>>>>>>>> 
>>>>>>>>>>>> What is SMSL for the requester ? Was it SL 6 ?
>>>>>>>>>>> Yes, it was SL 6.
>>>>>>>>>>> Here is a content of a similar packet which was received by the SA. I have used ibdump on the port where the OpenSM was running:
>>>>>>>>>>> ======================================================================================
>>>>>>>>>>> No.     Time        Source                Destination           Protocol Length Info
>>>>>>>>>>> 785 14.352168   LID: 384              LID: 4140             InfiniBand 290    UD Send Only SubnAdmGet(PathRecord)
>>>>>>>>>>> 
>>>>>>>>>>> Frame 785: 290 bytes on wire (2320 bits), 290 bytes captured (2320 bits)
>>>>>>>>>>> Arrival Time: Dec 13, 2012 18:09:44.437633332 JST
>>>>>>>>>>> Epoch Time: 1355389784.437633332 seconds
>>>>>>>>>>> [Time delta from previous captured frame: 4.332020528 seconds]
>>>>>>>>>>> [Time delta from previous displayed frame: 4.332020528 seconds]
>>>>>>>>>>> [Time since reference or first frame: 14.352168681 seconds]
>>>>>>>>>>> Frame Number: 785
>>>>>>>>>>> Frame Length: 290 bytes (2320 bits)
>>>>>>>>>>> Capture Length: 290 bytes (2320 bits)
>>>>>>>>>>> [Frame is marked: False]
>>>>>>>>>>> [Frame is ignored: False]
>>>>>>>>>>> [Protocols in frame: erf:infiniband]
>>>>>>>>>>> Extensible Record Format
>>>>>>>>>>> [ERF Header]
>>>>>>>>>>>   Timestamp: 0x50c99b587008bcf2
>>>>>>>>>>>   [Header type]
>>>>>>>>>>>       .001 0101 = type: INFINIBAND (21)
>>>>>>>>>>>       0... .... = Extension header present: 0
>>>>>>>>>>>   0000 0100 = flags: 4
>>>>>>>>>>>       .... ..00 = capture interface: 0
>>>>>>>>>>>       .... .1.. = varying record length: 1
>>>>>>>>>>>       .... 0... = truncated: 0
>>>>>>>>>>>       ...0 .... = rx error: 0
>>>>>>>>>>>       ..0. .... = ds error: 0
>>>>>>>>>>>       00.. .... = reserved: 0
>>>>>>>>>>>   record length: 306
>>>>>>>>>>>   loss counter: 0
>>>>>>>>>>>   wire length: 290
>>>>>>>>>>> InfiniBand
>>>>>>>>>>> Local Route Header
>>>>>>>>>>>   0110 .... = Virtual Lane: 0x06
>>>>>>>>>>>   .... 0000 = Link Version: 0
>>>>>>>>>>>   0110 .... = Service Level: 6
>>>>>>>>>>>   .... 00.. = Reserved (2 bits): 0
>>>>>>>>>>>   .... ..10 = Link Next Header: 0x02
>>>>>>>>>>>   Destination Local ID: 19
>>>>>>>>>>>   0000 0... .... .... = Reserved (5 bits): 0
>>>>>>>>>>>   .... .000 0100 1000 = Packet Length: 72
>>>>>>>>>>>   Source Local ID: 16
>>>>>>>>>>> Base Transport Header
>>>>>>>>>>>   Opcode: 100
>>>>>>>>>>>   1... .... = Solicited Event: True
>>>>>>>>>>>   .1.. .... = MigReq: True
>>>>>>>>>>>   ..00 .... = Pad Count: 0
>>>>>>>>>>>   .... 0000 = Header Version: 0
>>>>>>>>>>>   Partition Key: 65535
>>>>>>>>>>>   Reserved (8 bits): 0
>>>>>>>>>>>   Destination Queue Pair: 0x000001
>>>>>>>>>>>   0... .... = Acknowledge Request: False
>>>>>>>>>>>   .000 0000 = Reserved (7 bits): 0
>>>>>>>>>>>   Packet Sequence Number: 0
>>>>>>>>>>> DETH - Datagram Extended Transport Header
>>>>>>>>>>>   Queue Key: 2147549184
>>>>>>>>>>>   Reserved (8 bits): 0
>>>>>>>>>>>   Source Queue Pair: 0x00380050
>>>>>>>>>>> MAD Header - Common Management Datagram
>>>>>>>>>>>   Base Version: 0x01
>>>>>>>>>>>   Management Class: 0x03
>>>>>>>>>>>   Class Version: 0x02
>>>>>>>>>>>   Method: Get() (0x01)
>>>>>>>>>>>   Status: 0x0000
>>>>>>>>>>>   Class Specific: 0x0000
>>>>>>>>>>>   Transaction ID: 0x0010000f38005000
>>>>>>>>>>>   Attribute ID: 0x0035
>>>>>>>>>>>   Reserved: 0x0000
>>>>>>>>>>>   Attribute Modifier: 0x00000000
>>>>>>>>>>>   MAD Data Payload: 000000000000000000000000000000000000000000000000...
>>>>>>>>>>> Illegal RMPP Type (0)! 
>>>>>>>>>>>   RMPP Type: 0x00
>>>>>>>>>>>   RMPP Type: 0x00
>>>>>>>>>>>   0000 .... = R Resp Time: 0x00
>>>>>>>>>>>   .... 0000 = RMPP Flags: Unknown (0x00)
>>>>>>>>>>>   RMPP Status:  (Normal) (0x00)
>>>>>>>>>>>   RMPP Data 1: 0x00000000
>>>>>>>>>>>   RMPP Data 2: 0x00000000
>>>>>>>>>>> SMASubnAdmGet(PathRecord)
>>>>>>>>>>>   SM_Key (Verification Key): 0x0000000000000000
>>>>>>>>>>>   Attribute Offset: 0x0000
>>>>>>>>>>>   Reserved: 0x0000
>>>>>>>>>>>   Component Mask: 0x0000003000000000
>>>>>>>>>>>   Attribute (PathRecord)
>>>>>>>>>>>       PathRecord
>>>>>>>>>>>           DGID: :: (::)
>>>>>>>>>>>           SGID: ::0.15.0.16 (::0.15.0.16)
>>>>>>>>>>>           DLID: 0x0000
>>>>>>>>>>>           SLID: 0x0000
>>>>>>>>>>>           0... .... = RawTraffic: 0x00
>>>>>>>>>>>           .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>>>>>>>>           HopLimit: 0x00
>>>>>>>>>>>           TClass: 0x00
>>>>>>>>>>>           0... .... = Reversible: 0x00
>>>>>>>>>>>           .000 0000 = NumbPath: 0x00
>>>>>>>>>>>           P_Key: 0x0000
>>>>>>>>>>>           .... .... .... 0000 = SL: 0x0000
>>>>>>>>>>>           00.. .... = MTUSelector: 0x00
>>>>>>>>>>>           ..00 0000 = MTU: 0x00
>>>>>>>>>>>           00.. .... = RateSelector: 0x00
>>>>>>>>>>>           ..00 0000 = Rate: 0x00
>>>>>>>>>>>           00.. .... = PacketLifeTimeSelector: 0x00
>>>>>>>>>>>           ..00 0000 = PacketLifeTime: 0x00
>>>>>>>>>>>           Preference: 0x00
>>>>>>>>>>> Variant CRC: 0xad4e
>>>>>>>>>>> ======================================================================================
>>>>>>>>>> 
>>>>>>>>>> And the SubnAdmGetResp(PathRecord) is not seen ? If not, it doesn't get
>>>>>>>>>> out that machine and the issue is internal to that machine. It could be
>>>>>>>>>> because of the underlying issue which hangs OpenSM when some IB program
>>>>>>>>>> tried to unregister from the MAD layer but there were outstanding work
>>>>>>>>>> completions. That's based on your original email earlier this AM.
>>>>>>>>> No, the SubnAdmGetResp does not show up, if I use ibdump on the OMPI side and the SA uses a SL>0.
>>>>>>>> 
>>>>>>>> Can ibdump be used to capture output on the SM port ?
>>>>>>> 
>>>>>>> Yes, that works quite well, despite the warning in the ibdump manual.
>>>>>>> But I have started ibdump before opensm, maybe that makes a difference, not sure.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Jens
>>>>>>> 
>>>>>>> PS: I have seen a small bug. Not sure if its a bug in wireshark or ibdump, but the response received by the OMPI node isn't shown correctly. The PathRecord contains an offset which is either missing in the dump or is not treated correctly be wireshark. But it causes wireshark to show the PathRecord data with wrong values.
>>>>>>> Maybe you could redirect this to the developer of ibdump, so that he can check/fix it.
>>>>>> 
>>>>>> Are you referring to the fields after the SA AttributeOffset or
>>>>>> something else ?
>>>>> Yes, after the SMASubnAdmGet Attribute Offset. Here an example:
>>>>> I get on the OMPI side:
>>>>>  SMASubnAdmGetResp(PathRecord)
>>>>>      SM_Key (Verification Key): 0x0000000000000000
>>>>>      Attribute Offset: 0x0008
>>>>>      Reserved: 0x0000
>>>>>      Component Mask: 0x0000803000000000
>>>>>      Attribute (PathRecord)
>>>>>          PathRecord
>>>>>              DGID: ::8:f104:399:ebb5:fe80:0 (::8:f104:399:ebb5:fe80:0)
>>>>>              SGID: ::8:f104:399:ecd5:4:8 (::8:f104:399:ecd5:4:8)
>>>>>              DLID: 0x0000
>>>>>              SLID: 0x0000
>>>>>              0... .... = RawTraffic: 0x00
>>>>>              .... 0000 1000 0000 1111 1111 = FlowLabel: 0x0080ff
>>>>>              HopLimit: 0xff
>>>>>              TClass: 0x00
>>>>>              0... .... = Reversible: 0x00
>>>>>              .000 0011 = NumbPath: 0x03
>>>>>              P_Key: 0x8486
>>>>>              .... .... .... 0000 = SL: 0x0000
>>>>>              00.. .... = MTUSelector: 0x00
>>>>>              ..00 0000 = MTU: 0x00
>>>>>              00.. .... = RateSelector: 0x00
>>>>>              ..00 0000 = Rate: 0x00
>>>>>              00.. .... = PacketLifeTimeSelector: 0x00
>>>>>              ..00 0000 = PacketLifeTime: 0x00
>>>>>              Preference: 0x00
>>>>> 
>>>>> But it should show (see the difference in SLID, DLID, SL which are now correct):
>>>>>  SMASubnAdmGetResp(PathRecord)
>>>>>      SM_Key (Verification Key): 0x0000000000000000
>>>>>      Attribute Offset: 0x0008
>>>>>      Reserved: 0x0000
>>>>>      Component Mask: 0x0000803000000000
>>>>>      Attribute (PathRecord)
>>>>>          PathRecord
>>>>>              DGID: ::8:f104:399:ebb5 (::8:f104:399:ebb5)
>>>>>              SGID: fe80::8:f104:399:ecd5 (fe80::8:f104:399:ecd5)
>>>>>              DLID: 0x0004
>>>>>              SLID: 0x0008
>>>>>              0... .... = RawTraffic: 0x00
>>>>>              .... 0000 0000 0000 0000 0000 = FlowLabel: 0x000000
>>>>>              HopLimit: 0x00
>>>>>              TClass: 0x00
>>>>>              1... .... = Reversible: 0x01
>>>>>              .000 0000 = NumbPath: 0x00
>>>>>              P_Key: 0xffff
>>>>>              .... .... .... 0011 = SL: 0x0003
>>>>>              10.. .... = MTUSelector: 0x02
>>>>>              ..00 0100 = MTU: 0x04
>>>>>              10.. .... = RateSelector: 0x02
>>>>>              ..00 0110 = Rate: 0x06
>>>>>              10.. .... = PacketLifeTimeSelector: 0x02
>>>>>              ..01 0010 = PacketLifeTime: 0x12
>>>>>              Preference: 0x00
>>>> 
>>>> 
>>>> I think everything after AttributeOffset is off by 2 bytes. DGID doesn't
>>>> look right to me (no subnet prefix fe80:: in front of GUID).
>>> 
>>> Yes, I made a small mistake with the hexeditor. I started the shift after the subnet prefix.
>>> Sorry for the confusion.
>>> 
>>> Thank you for the hint with smpquery and saquery, I will check that tomorrow.
>>> 
>>> Jens
>>> 
>>>> 
>>>> -- Hal
>>>> 
>>>>> 
>>>>> Regards,
>>>>> Jens
>>>>> 
>>>>>> 
>>>>>> -- Hal
>>>>>> 
>>>>>>>> 
>>>>>>>> -- Hal
>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> One would need to walk the SLToVLMappingTables from requester (OMPI
>>>>>>>>>>>> port) to SA and back to see whether SL6 would even have a chance of
>>>>>>>>>>>> working (not dropping) aside from whether it's really the correct SL to use.
>>>>>>>>>>> All SL2VL tables look the same. I checked the output of OpenSM.
>>>>>>>>>>> 	SL: |  0  | 1  | 2  | 3  | 4  | 5  | 6  | 7  | 8  | 9  | 10 | 11 | 12 | 13 | 14 | 15 |
>>>>>>>>>>> 	VL: | 0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |0x0 |0x1 |0x2 |0x3 |0x4 |0x5 |0x6 |0x7 |
>>>>>>>>>>> But this is also as expected, because I have set the QoS in the opensm config as follows:
>>>>>>>>>>> 	qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
>>>>>>>>>>> This was set for "default", "CA" and "Switch external ports". I have not touched the config for "Switch Port 0" and "Router ports", they remained: qos_[sw0 | rtr]_sl2vl (null)
>>>>>>>>>> 
>>>>>>>>>> That works as long as all links have (at least) 8 data VLs (VLCap 4).
>>>>>>>>> Yes, all VL_CAP show 4 in the OpenSM log file.
>>>>>>>>> 
>>>>>>>>> Regards
>>>>>>>>> Jens
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> -- Hal
>>>>>>>>>> 
>>>>>>>>>>> Regards
>>>>>>>>>>> Jens
>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> -- Hal
>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> The output of OpenMPI or OpenSM's log file don't show any useful information for this problem, even with higher debug levels.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So nothing interesting logged relative to the PathRecord queries ?
>>>>>>>>>>>>> In the OpenSM log, only that it was received, how the request looks like, and that it was send back.
>>>>>>>>>>>>> And a few "outstanding MADs" a few lines later in the log.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> So, right now I'm stuck, and have no idea if there is an error in the kernel driver, the HCA firmware or something completely different. Or if umad_send basically does not support SL>0.
>>>>>>>>>>>>>>> A workaround for the moment is to set the SL in the umad_set_addr_net(...) call to 0.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> So SL 0 works between all nodes and SA for querying/responses. Wonder if
>>>>>>>>>>>>>> that's how SMSL is set by DFSSSP.
>>>>>>>>>>>>> No, the SMSL set by DFSSSP is different from 0, I have checked this. In our case (OpenSM running on a compute node), it sets the same SL, which is used
>>>>>>>>>>>> for MPI<->MPI traffic, to ensure deadlock freedom.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Regards
>>>>>>>>>>>>> Jens
>>>>>>>>>>>>> 
>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>>>>>>>> --------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>> 
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>>>>>> --------------------------------
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>>>> 
>>>>>>>>> --------------------------------
>>>>>>>>> Dipl.-Math. Jens Domke
>>>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>>>> Global Scientific Information and Computing Center
>>>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>>>> --------------------------------
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>> 
>>>>>>> --------------------------------
>>>>>>> Dipl.-Math. Jens Domke
>>>>>>> Researcher - Tokyo Institute of Technology
>>>>>>> Satoshi MATSUOKA Laboratory
>>>>>>> Global Scientific Information and Computing Center
>>>>>>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>>>>>>> Tokyo, 152-8550, JAPAN
>>>>>>> Tel/Fax: +81-3-5734-3876
>>>>>>> E-Mail: domke.j.aa@m.titech.ac.jp
>>>>>>> --------------------------------
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> 
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> 
>> --------------------------------
>> Dipl.-Math. Jens Domke
>> Researcher - Tokyo Institute of Technology
>> Satoshi MATSUOKA Laboratory
>> Global Scientific Information and Computing Center
>> 2-12-1-E2-7 Ookayama, Meguro-ku, 
>> Tokyo, 152-8550, JAPAN
>> Tel/Fax: +81-3-5734-3876
>> E-Mail: domke.j.aa@m.titech.ac.jp
>> --------------------------------
>> 
>> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--------------------------------
Dipl.-Math. Jens Domke
Researcher - Tokyo Institute of Technology
Satoshi MATSUOKA Laboratory
Global Scientific Information and Computing Center
2-12-1-E2-7 Ookayama, Meguro-ku, 
Tokyo, 152-8550, JAPAN
Tel/Fax: +81-3-5734-3876
E-Mail: domke.j.aa@m.titech.ac.jp
--------------------------------

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2012-12-18  2:26 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-12-14 12:18 umad_send with service level higher than 0 does not work Jens Domke
2012-12-14 13:47 ` Hal Rosenstock
     [not found]   ` <50CB2DF3.7020409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2012-12-14 15:17     ` Jens Domke
2012-12-14 16:42       ` Hal Rosenstock
     [not found]         ` <50CB56E9.70900-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2012-12-14 18:24           ` Jens Domke
2012-12-14 18:58             ` Hal Rosenstock
     [not found]               ` <50CB76F2.70003-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2012-12-14 20:32                 ` Jens Domke
2012-12-14 20:44                   ` Hal Rosenstock
     [not found]                     ` <50CB8F90.1030701-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2012-12-16 12:03                       ` Jens Domke
2012-12-16 12:32                         ` Hal Rosenstock
     [not found]                           ` <50CDBF61.3080100-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2012-12-16 13:25                             ` Hal Rosenstock
2012-12-16 13:39                             ` Jens Domke
2012-12-16 13:48                               ` Hal Rosenstock
     [not found]                                 ` <50CDD114.2090706-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2012-12-16 14:59                                   ` Jens Domke
2012-12-17  6:16                                     ` Jens Domke
2012-12-17 12:04                                       ` Hal Rosenstock
     [not found]                                         ` <50CF0A33.1030809-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2012-12-18  2:26                                           ` Jens Domke
2012-12-14 18:17     ` Ira Weiny

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.