All of lore.kernel.org
 help / color / mirror / Atom feed
* Problems with link, opensm complains IB_SA_MAD_STATUS_REQ_INVALID
@ 2011-12-15 23:17 Gerben Roest
       [not found] ` <4EEA8004.4060103-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Gerben Roest @ 2011-12-15 23:17 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi,

Starting opensm from OFED 1.5.1, 1.5.3.2, 1.5.4 on a Scientific Linux 5
machine, directly linked to its neighbour (a twin 1U setup) gives me no
connection but lots of errors in /var/log/opensm.log, like these:

Dec 15 22:38:35 685651 [45AFD940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
IB_SA_MAD_STATUS_REQ_INVALID
Dec 15 22:38:35 686174 [464FE940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
from port 0x001e8c0000c84b62 (titus HCA-1), sending
IB_SA_MAD_STATUS_REQ_INVALID

Does anyone know what happens here? Another twin node has no problems,
that one uses OFED-1.5.1.

I can send a "-V" log of opensm or any config files if you like,

thanks,

Gerben
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problems with link, opensm complains IB_SA_MAD_STATUS_REQ_INVALID
       [not found] ` <4EEA8004.4060103-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
@ 2011-12-16  0:06   ` Ira Weiny
       [not found]     ` <20111215160600.ebccb033.weiny2-i2BcT+NCU+M@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Ira Weiny @ 2011-12-16  0:06 UTC (permalink / raw)
  To: Gerben Roest; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Thu, 15 Dec 2011 15:17:24 -0800
Gerben Roest <g.roest-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org> wrote:

> Hi,
> 
> Starting opensm from OFED 1.5.1, 1.5.3.2, 1.5.4 on a Scientific Linux 5
> machine, directly linked to its neighbour (a twin 1U setup) gives me no
> connection but lots of errors in /var/log/opensm.log, like these:
> 
> Dec 15 22:38:35 685651 [45AFD940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
> IB_SA_MAD_STATUS_REQ_INVALID
> Dec 15 22:38:35 686174 [464FE940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
> from port 0x001e8c0000c84b62 (titus HCA-1), sending
> IB_SA_MAD_STATUS_REQ_INVALID
> 
> Does anyone know what happens here? Another twin node has no problems,
> that one uses OFED-1.5.1.
> 
> I can send a "-V" log of opensm or any config files if you like,

Just set -D 0x7 which adds VERBOSE and send the snippet around the above errors.

Ira

> 
> thanks,
> 
> Gerben
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Ira Weiny
Member of Technical Staff
Lawrence Livermore National Lab
925-423-8008
weiny2-i2BcT+NCU+M@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problems with link, opensm complains IB_SA_MAD_STATUS_REQ_INVALID
       [not found]     ` <20111215160600.ebccb033.weiny2-i2BcT+NCU+M@public.gmane.org>
@ 2011-12-16  8:56       ` Gerben Roest
       [not found]         ` <4EEB07C3.90803-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Gerben Roest @ 2011-12-16  8:56 UTC (permalink / raw)
  To: Ira Weiny; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 16-12-2011 1:06, Ira Weiny wrote:
> On Thu, 15 Dec 2011 15:17:24 -0800
> Gerben Roest <g.roest-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org> wrote:
> 
>> Hi,
>>
>> Starting opensm from OFED 1.5.1, 1.5.3.2, 1.5.4 on a Scientific Linux 5
>> machine, directly linked to its neighbour (a twin 1U setup) gives me no
>> connection but lots of errors in /var/log/opensm.log, like these:
>>
>> Dec 15 22:38:35 685651 [45AFD940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
>> IB_SA_MAD_STATUS_REQ_INVALID
>> Dec 15 22:38:35 686174 [464FE940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>> from port 0x001e8c0000c84b62 (titus HCA-1), sending
>> IB_SA_MAD_STATUS_REQ_INVALID
>>
>> Does anyone know what happens here? Another twin node has no problems,
>> that one uses OFED-1.5.1.
>>
>> I can send a "-V" log of opensm or any config files if you like,
> 
> Just set -D 0x7 which adds VERBOSE and send the snippet around the above errors.

Dec 15 23:35:05 791001 [4399A940] 0x10 -> osm_vendor_send: [
Dec 15 23:35:05 791008 [4399A940] 0x04 -> osm_vendor_send: RMPP 0 length 256
Dec 15 23:35:05 791021 [4399A940] 0x10 -> osm_vendor_put: [
Dec 15 23:35:05 791028 [4399A940] 0x08 -> osm_vendor_put: Retiring UMAD
0x3dd9290
Dec 15 23:35:05 791034 [4399A940] 0x10 -> osm_vendor_put: ]
Dec 15 23:35:05 791040 [4399A940] 0x08 -> osm_vendor_send: Completed
sending response or unsolicited p_madw = 0x3ddf5c0
Dec 15 23:35:05 791046 [4399A940] 0x10 -> osm_vendor_send: ]
Dec 15 23:35:05 791051 [4399A940] 0x10 -> osm_sa_send_error: ]
Dec 15 23:35:05 791057 [4399A940] 0x10 -> mcmr_rcv_join_mgrp: ]
Dec 15 23:35:05 791062 [4399A940] 0x10 -> osm_mcmr_rcv_process: ]
Dec 15 23:35:05 791068 [4399A940] 0x10 -> sa_mad_ctrl_disp_done_callback: [
Dec 15 23:35:05 791073 [4399A940] 0x10 -> osm_vendor_put: [
Dec 15 23:35:05 791079 [4399A940] 0x08 -> osm_vendor_put: Retiring UMAD
0x3dd7290
Dec 15 23:35:05 791084 [4399A940] 0x10 -> osm_vendor_put: ]
Dec 15 23:35:05 791090 [4399A940] 0x10 -> sa_mad_ctrl_disp_done_callback: ]
Dec 15 23:35:05 792086 [4B1A6940] 0x10 -> osm_vendor_get: [
Dec 15 23:35:05 792106 [4B1A6940] 0x08 -> osm_vendor_get: Acquiring UMAD
for p_madw = 0x3ddf5d8, size = 256
Dec 15 23:35:05 792117 [4B1A6940] 0x08 -> osm_vendor_get: Acquired UMAD
0x3dd7290, size = 256
Dec 15 23:35:05 792126 [4B1A6940] 0x10 -> osm_vendor_get: ]
Dec 15 23:35:05 792132 [4B1A6940] 0x10 -> sa_mad_ctrl_rcv_callback: [
Dec 15 23:35:05 792139 [4B1A6940] 0x08 -> sa_mad_ctrl_rcv_callback: 4 SA
MADs received
Dec 15 23:35:05 792152 [4B1A6940] 0x20 -> SA MAD dump:
                                base_ver................0x1
                                mgmt_class..............0x3
                                class_ver...............0x2
                                method..................0x2 (SubnAdmSet)
                                status..................0x0
                                resv....................0x0
                                trans_id................0x53bf6d21e
                                attr_id.................0x38
(MCMemberRecord)
                                resv1...................0x0
                                attr_mod................0x0
                                rmpp_version............0x0
                                rmpp_type...............0x0
                                rmpp_flags..............0x0
                                rmpp_status.............0x0
                                seg_num.................0x0
                                payload_len/new_win.....0x0
                                sm_key..................0x0000000000000000
                                attr_offset.............0x0
                                resv2...................0x0
                                comp_mask...............0x0000000000010083


Dec 15 23:35:05 792158 [4B1A6940] 0x10 -> sa_mad_ctrl_process: [
Dec 15 23:35:05 792165 [4B1A6940] 0x08 -> sa_mad_ctrl_process: Posting
Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD
Dec 15 23:35:05 792187 [4B1A6940] 0x10 -> sa_mad_ctrl_process: ]
Dec 15 23:35:05 792194 [4B1A6940] 0x10 -> sa_mad_ctrl_rcv_callback: ]
Dec 15 23:35:05 792204 [46B9F940] 0x10 -> osm_mcmr_rcv_process: [
Dec 15 23:35:05 792211 [46B9F940] 0x10 -> mcmr_rcv_join_mgrp: [
Dec 15 23:35:05 792216 [46B9F940] 0x08 -> mcmr_rcv_join_mgrp: Dump of
incoming record
Dec 15 23:35:05 792228 [46B9F940] 0x08 -> MCMember Record dump:

MGID....................ff12:401b:ffff::ffff:ffff
                                PortGid.................fe80::1e:8c00:b9:641
                                qkey....................0x0
                                mlid....................0x0
                                mtu.....................0x0
                                TClass..................0x0
                                pkey....................0xFFFF
                                rate....................0x0
                                pkt_life................0x0
                                SLFlowLabelHopLimit.....0x0
                                ScopeState..............0x1
                                ProxyJoin...............0x0
Dec 15 23:35:05 792236 [46B9F940] 0x04 -> validate_port_caps: Port's
RATE 2 is less than 3
Dec 15 23:35:05 792243 [46B9F940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
IB_SA_MAD_STATUS_REQ_INVALID
Dec 15 23:35:05 792253 [46B9F940] 0x10 -> osm_sa_send_error: [
Dec 15 23:35:05 792260 [46B9F940] 0x10 -> osm_vendor_get: [
Dec 15 23:35:05 792266 [46B9F940] 0x08 -> osm_vendor_get: Acquiring UMAD
for p_madw = 0x3dd73f8, size = 256
Dec 15 23:35:05 792273 [46B9F940] 0x08 -> osm_vendor_get: Acquired UMAD
0x3dd9290, size = 256
Dec 15 23:35:05 792279 [46B9F940] 0x10 -> osm_vendor_get: ]
Dec 15 23:35:05 792291 [46B9F940] 0x20 -> SA MAD dump:
                                base_ver................0x1
                                mgmt_class..............0x3
                                class_ver...............0x2
                                method..................0x81
(SubnAdmGetResp)
                                status..................0x200
                                resv....................0x0
                                trans_id................0x53bf6d21e
                                attr_id.................0x38
(MCMemberRecord)
                                resv1...................0x0
                                attr_mod................0x0
                                rmpp_version............0x0
                                rmpp_type...............0x0
                                rmpp_flags..............0x0
                                rmpp_status.............0x0
                                seg_num.................0x0
                                payload_len/new_win.....0x0
                                sm_key..................0x0000000000000000
                                attr_offset.............0x0
                                resv2...................0x0
                                comp_mask...............0x0000000000010083


Dec 15 23:35:05 792298 [46B9F940] 0x10 -> osm_vendor_send: [
Dec 15 23:35:05 792304 [46B9F940] 0x04 -> osm_vendor_send: RMPP 0 length 256
Dec 15 23:35:05 792318 [46B9F940] 0x10 -> osm_vendor_put: [
Dec 15 23:35:05 792325 [46B9F940] 0x08 -> osm_vendor_put: Retiring UMAD
0x3dd9290
Dec 15 23:35:05 792331 [46B9F940] 0x10 -> osm_vendor_put: ]
Dec 15 23:35:05 792337 [46B9F940] 0x08 -> osm_vendor_send: Completed
sending response or unsolicited p_madw = 0x3dd73e0
Dec 15 23:35:05 792343 [46B9F940] 0x10 -> osm_vendor_send: ]
Dec 15 23:35:05 792360 [46B9F940] 0x10 -> osm_sa_send_error: ]
Dec 15 23:35:05 792366 [46B9F940] 0x10 -> mcmr_rcv_join_mgrp: ]
Dec 15 23:35:05 792371 [46B9F940] 0x10 -> osm_mcmr_rcv_process: ]
Dec 15 23:35:05 792377 [46B9F940] 0x10 -> sa_mad_ctrl_disp_done_callback: [
Dec 15 23:35:05 792383 [46B9F940] 0x10 -> osm_vendor_put: [
Dec 15 23:35:05 792388 [46B9F940] 0x08 -> osm_vendor_put: Retiring UMAD
0x3dd7e40
Dec 15 23:35:05 792394 [46B9F940] 0x10 -> osm_vendor_put: ]
Dec 15 23:35:05 792400 [46B9F940] 0x10 -> sa_mad_ctrl_disp_done_callback: ]
Dec 15 23:35:09 759207 [4A7A5940] 0x08 -> sm_sweeper: Off schedule sweep
signalled
Dec 15 23:35:09 759229 [4A7A5940] 0x10 -> osm_state_mgr_process: [
Dec 15 23:35:09 759240 [4A7A5940] 0x08 -> osm_state_mgr_process:
Received signal OSM_SIGNAL_SWEEP in state MASTER
Dec 15 23:35:09 759249 [4A7A5940] 0x10 -> state_mgr_sweep_hop_0: [
Dec 15 23:35:09 759258 [4A7A5940] 0x04 -> state_mgr_sweep_hop_0:



thanks,

Gerben
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problems with link, opensm complains IB_SA_MAD_STATUS_REQ_INVALID
       [not found]         ` <4EEB07C3.90803-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
@ 2011-12-16  9:14           ` Alex Netes
  2011-12-16 10:46             ` Gerben Roest
  0 siblings, 1 reply; 11+ messages in thread
From: Alex Netes @ 2011-12-16  9:14 UTC (permalink / raw)
  To: Gerben Roest; +Cc: Ira Weiny, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Gerben,

It's complaining about the link rate:

Dec 15 23:35:05 792236 [46B9F940] 0x04 -> validate_port_caps: Port's RATE 2 is less than 3

Probably, the host that is trying to join is connected via 1x cable.
The rate is defined by the capabilities of the host that opened a group, so
you see this problem only when the host with higher rate created the MC group.

On 09:56 Fri 16 Dec     , Gerben Roest wrote:
> On 16-12-2011 1:06, Ira Weiny wrote:
> > On Thu, 15 Dec 2011 15:17:24 -0800
> > Gerben Roest <g.roest-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org> wrote:
> > 
> >> Hi,
> >>
> >> Starting opensm from OFED 1.5.1, 1.5.3.2, 1.5.4 on a Scientific Linux 5
> >> machine, directly linked to its neighbour (a twin 1U setup) gives me no
> >> connection but lots of errors in /var/log/opensm.log, like these:
> >>
> >> Dec 15 22:38:35 685651 [45AFD940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
> >> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
> >> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
> >> IB_SA_MAD_STATUS_REQ_INVALID
> >> Dec 15 22:38:35 686174 [464FE940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
> >> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
> >> from port 0x001e8c0000c84b62 (titus HCA-1), sending
> >> IB_SA_MAD_STATUS_REQ_INVALID
> >>
> >> Does anyone know what happens here? Another twin node has no problems,
> >> that one uses OFED-1.5.1.
> >>
> >> I can send a "-V" log of opensm or any config files if you like,
> > 
> > Just set -D 0x7 which adds VERBOSE and send the snippet around the above errors.
> 
> Dec 15 23:35:05 791001 [4399A940] 0x10 -> osm_vendor_send: [
> Dec 15 23:35:05 791008 [4399A940] 0x04 -> osm_vendor_send: RMPP 0 length 256
> Dec 15 23:35:05 791021 [4399A940] 0x10 -> osm_vendor_put: [
> Dec 15 23:35:05 791028 [4399A940] 0x08 -> osm_vendor_put: Retiring UMAD
> 0x3dd9290
> Dec 15 23:35:05 791034 [4399A940] 0x10 -> osm_vendor_put: ]
> Dec 15 23:35:05 791040 [4399A940] 0x08 -> osm_vendor_send: Completed
> sending response or unsolicited p_madw = 0x3ddf5c0
> Dec 15 23:35:05 791046 [4399A940] 0x10 -> osm_vendor_send: ]
> Dec 15 23:35:05 791051 [4399A940] 0x10 -> osm_sa_send_error: ]
> Dec 15 23:35:05 791057 [4399A940] 0x10 -> mcmr_rcv_join_mgrp: ]
> Dec 15 23:35:05 791062 [4399A940] 0x10 -> osm_mcmr_rcv_process: ]
> Dec 15 23:35:05 791068 [4399A940] 0x10 -> sa_mad_ctrl_disp_done_callback: [
> Dec 15 23:35:05 791073 [4399A940] 0x10 -> osm_vendor_put: [
> Dec 15 23:35:05 791079 [4399A940] 0x08 -> osm_vendor_put: Retiring UMAD
> 0x3dd7290
> Dec 15 23:35:05 791084 [4399A940] 0x10 -> osm_vendor_put: ]
> Dec 15 23:35:05 791090 [4399A940] 0x10 -> sa_mad_ctrl_disp_done_callback: ]
> Dec 15 23:35:05 792086 [4B1A6940] 0x10 -> osm_vendor_get: [
> Dec 15 23:35:05 792106 [4B1A6940] 0x08 -> osm_vendor_get: Acquiring UMAD
> for p_madw = 0x3ddf5d8, size = 256
> Dec 15 23:35:05 792117 [4B1A6940] 0x08 -> osm_vendor_get: Acquired UMAD
> 0x3dd7290, size = 256
> Dec 15 23:35:05 792126 [4B1A6940] 0x10 -> osm_vendor_get: ]
> Dec 15 23:35:05 792132 [4B1A6940] 0x10 -> sa_mad_ctrl_rcv_callback: [
> Dec 15 23:35:05 792139 [4B1A6940] 0x08 -> sa_mad_ctrl_rcv_callback: 4 SA
> MADs received
> Dec 15 23:35:05 792152 [4B1A6940] 0x20 -> SA MAD dump:
>                                 base_ver................0x1
>                                 mgmt_class..............0x3
>                                 class_ver...............0x2
>                                 method..................0x2 (SubnAdmSet)
>                                 status..................0x0
>                                 resv....................0x0
>                                 trans_id................0x53bf6d21e
>                                 attr_id.................0x38
> (MCMemberRecord)
>                                 resv1...................0x0
>                                 attr_mod................0x0
>                                 rmpp_version............0x0
>                                 rmpp_type...............0x0
>                                 rmpp_flags..............0x0
>                                 rmpp_status.............0x0
>                                 seg_num.................0x0
>                                 payload_len/new_win.....0x0
>                                 sm_key..................0x0000000000000000
>                                 attr_offset.............0x0
>                                 resv2...................0x0
>                                 comp_mask...............0x0000000000010083
> 
> 
> Dec 15 23:35:05 792158 [4B1A6940] 0x10 -> sa_mad_ctrl_process: [
> Dec 15 23:35:05 792165 [4B1A6940] 0x08 -> sa_mad_ctrl_process: Posting
> Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD
> Dec 15 23:35:05 792187 [4B1A6940] 0x10 -> sa_mad_ctrl_process: ]
> Dec 15 23:35:05 792194 [4B1A6940] 0x10 -> sa_mad_ctrl_rcv_callback: ]
> Dec 15 23:35:05 792204 [46B9F940] 0x10 -> osm_mcmr_rcv_process: [
> Dec 15 23:35:05 792211 [46B9F940] 0x10 -> mcmr_rcv_join_mgrp: [
> Dec 15 23:35:05 792216 [46B9F940] 0x08 -> mcmr_rcv_join_mgrp: Dump of
> incoming record
> Dec 15 23:35:05 792228 [46B9F940] 0x08 -> MCMember Record dump:
> 
> MGID....................ff12:401b:ffff::ffff:ffff
>                                 PortGid.................fe80::1e:8c00:b9:641
>                                 qkey....................0x0
>                                 mlid....................0x0
>                                 mtu.....................0x0
>                                 TClass..................0x0
>                                 pkey....................0xFFFF
>                                 rate....................0x0
>                                 pkt_life................0x0
>                                 SLFlowLabelHopLimit.....0x0
>                                 ScopeState..............0x1
>                                 ProxyJoin...............0x0
> Dec 15 23:35:05 792236 [46B9F940] 0x04 -> validate_port_caps: Port's
> RATE 2 is less than 3
> Dec 15 23:35:05 792243 [46B9F940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
> IB_SA_MAD_STATUS_REQ_INVALID
> Dec 15 23:35:05 792253 [46B9F940] 0x10 -> osm_sa_send_error: [
> Dec 15 23:35:05 792260 [46B9F940] 0x10 -> osm_vendor_get: [
> Dec 15 23:35:05 792266 [46B9F940] 0x08 -> osm_vendor_get: Acquiring UMAD
> for p_madw = 0x3dd73f8, size = 256
> Dec 15 23:35:05 792273 [46B9F940] 0x08 -> osm_vendor_get: Acquired UMAD
> 0x3dd9290, size = 256
> Dec 15 23:35:05 792279 [46B9F940] 0x10 -> osm_vendor_get: ]
> Dec 15 23:35:05 792291 [46B9F940] 0x20 -> SA MAD dump:
>                                 base_ver................0x1
>                                 mgmt_class..............0x3
>                                 class_ver...............0x2
>                                 method..................0x81
> (SubnAdmGetResp)
>                                 status..................0x200
>                                 resv....................0x0
>                                 trans_id................0x53bf6d21e
>                                 attr_id.................0x38
> (MCMemberRecord)
>                                 resv1...................0x0
>                                 attr_mod................0x0
>                                 rmpp_version............0x0
>                                 rmpp_type...............0x0
>                                 rmpp_flags..............0x0
>                                 rmpp_status.............0x0
>                                 seg_num.................0x0
>                                 payload_len/new_win.....0x0
>                                 sm_key..................0x0000000000000000
>                                 attr_offset.............0x0
>                                 resv2...................0x0
>                                 comp_mask...............0x0000000000010083
> 
> 
> Dec 15 23:35:05 792298 [46B9F940] 0x10 -> osm_vendor_send: [
> Dec 15 23:35:05 792304 [46B9F940] 0x04 -> osm_vendor_send: RMPP 0 length 256
> Dec 15 23:35:05 792318 [46B9F940] 0x10 -> osm_vendor_put: [
> Dec 15 23:35:05 792325 [46B9F940] 0x08 -> osm_vendor_put: Retiring UMAD
> 0x3dd9290
> Dec 15 23:35:05 792331 [46B9F940] 0x10 -> osm_vendor_put: ]
> Dec 15 23:35:05 792337 [46B9F940] 0x08 -> osm_vendor_send: Completed
> sending response or unsolicited p_madw = 0x3dd73e0
> Dec 15 23:35:05 792343 [46B9F940] 0x10 -> osm_vendor_send: ]
> Dec 15 23:35:05 792360 [46B9F940] 0x10 -> osm_sa_send_error: ]
> Dec 15 23:35:05 792366 [46B9F940] 0x10 -> mcmr_rcv_join_mgrp: ]
> Dec 15 23:35:05 792371 [46B9F940] 0x10 -> osm_mcmr_rcv_process: ]
> Dec 15 23:35:05 792377 [46B9F940] 0x10 -> sa_mad_ctrl_disp_done_callback: [
> Dec 15 23:35:05 792383 [46B9F940] 0x10 -> osm_vendor_put: [
> Dec 15 23:35:05 792388 [46B9F940] 0x08 -> osm_vendor_put: Retiring UMAD
> 0x3dd7e40
> Dec 15 23:35:05 792394 [46B9F940] 0x10 -> osm_vendor_put: ]
> Dec 15 23:35:05 792400 [46B9F940] 0x10 -> sa_mad_ctrl_disp_done_callback: ]
> Dec 15 23:35:09 759207 [4A7A5940] 0x08 -> sm_sweeper: Off schedule sweep
> signalled
> Dec 15 23:35:09 759229 [4A7A5940] 0x10 -> osm_state_mgr_process: [
> Dec 15 23:35:09 759240 [4A7A5940] 0x08 -> osm_state_mgr_process:
> Received signal OSM_SIGNAL_SWEEP in state MASTER
> Dec 15 23:35:09 759249 [4A7A5940] 0x10 -> state_mgr_sweep_hop_0: [
> Dec 15 23:35:09 759258 [4A7A5940] 0x04 -> state_mgr_sweep_hop_0:
> 
> 
> 
> thanks,
> 
> Gerben
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

-- 

-- Alex
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problems with link, opensm complains IB_SA_MAD_STATUS_REQ_INVALID
  2011-12-16  9:14           ` Alex Netes
@ 2011-12-16 10:46             ` Gerben Roest
       [not found]               ` <4EEB216D.2010407-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Gerben Roest @ 2011-12-16 10:46 UTC (permalink / raw)
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 16-12-2011 10:14, Alex Netes wrote:
> Hi Gerben,
> 
> It's complaining about the link rate:
> 
> Dec 15 23:35:05 792236 [46B9F940] 0x04 -> validate_port_caps: Port's RATE 2 is less than 3
> 
> Probably, the host that is trying to join is connected via 1x cable.
> The rate is defined by the capabilities of the host that opened a group, so
> you see this problem only when the host with higher rate created the MC group.

Is it possible to force them to some specified speed?

The strange thing is that both hosts show this problem if they start
opensm, they have the same errors in /var/log/opensm.log. This is what
both hosts have:

[root@titus ~]# lspci -v |grep Infini
0a:00.0 InfiniBand: Mellanox Technologies MT26418 [ConnectX VPI PCIe 2.0
5GT/s - IB DDR / 10GigE] (rev a0)

[root@vespasianus ~]# lspci -v |grep Infini
0a:00.0 InfiniBand: Mellanox Technologies MT26418 [ConnectX VPI PCIe 2.0
5GT/s - IB DDR / 10GigE] (rev a0)

The hosts are connected to each other's single port via one IB cable.

[root@vespasianus ~]# grep -A1 -B1 INVALID /var/log/opensm.log| tail

Dec 16 11:35:10 041359 [483D2940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
from port 0x001e8c0000c84b62 (titus HCA-1), sending
IB_SA_MAD_STATUS_REQ_INVALID
Dec 16 11:35:10 041365 [483D2940] 0x10 -> osm_sa_send_error: [
--
Dec 16 11:35:17 351591 [429C9940] 0x04 -> validate_port_caps: Port's
RATE 2 is less than 3
Dec 16 11:35:17 351598 [429C9940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
IB_SA_MAD_STATUS_REQ_INVALID
Dec 16 11:35:17 351604 [429C9940] 0x10 -> osm_sa_send_error: [
--
Dec 16 11:35:18 042907 [43DCB940] 0x04 -> validate_port_caps: Port's
RATE 2 is less than 3
Dec 16 11:35:18 042914 [43DCB940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
from port 0x001e8c0000c84b62 (titus HCA-1), sending
IB_SA_MAD_STATUS_REQ_INVALID
Dec 16 11:35:18 042920 [43DCB940] 0x10 -> osm_sa_send_error: [

Gerben


> 
> On 09:56 Fri 16 Dec     , Gerben Roest wrote:
>> On 16-12-2011 1:06, Ira Weiny wrote:
>>> On Thu, 15 Dec 2011 15:17:24 -0800
>>> Gerben Roest <g.roest-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org> wrote:
>>>
>>>> Hi,
>>>>
>>>> Starting opensm from OFED 1.5.1, 1.5.3.2, 1.5.4 on a Scientific Linux 5
>>>> machine, directly linked to its neighbour (a twin 1U setup) gives me no
>>>> connection but lots of errors in /var/log/opensm.log, like these:
>>>>
>>>> Dec 15 22:38:35 685651 [45AFD940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>>> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
>>>> IB_SA_MAD_STATUS_REQ_INVALID
>>>> Dec 15 22:38:35 686174 [464FE940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>>> from port 0x001e8c0000c84b62 (titus HCA-1), sending
>>>> IB_SA_MAD_STATUS_REQ_INVALID
>>>>
>>>> Does anyone know what happens here? Another twin node has no problems,
>>>> that one uses OFED-1.5.1.
>>>>
>>>> I can send a "-V" log of opensm or any config files if you like,
>>>
>>> Just set -D 0x7 which adds VERBOSE and send the snippet around the above errors.
>>
>> Dec 15 23:35:05 791001 [4399A940] 0x10 -> osm_vendor_send: [
>> Dec 15 23:35:05 791008 [4399A940] 0x04 -> osm_vendor_send: RMPP 0 length 256
>> Dec 15 23:35:05 791021 [4399A940] 0x10 -> osm_vendor_put: [
>> Dec 15 23:35:05 791028 [4399A940] 0x08 -> osm_vendor_put: Retiring UMAD
>> 0x3dd9290
>> Dec 15 23:35:05 791034 [4399A940] 0x10 -> osm_vendor_put: ]
>> Dec 15 23:35:05 791040 [4399A940] 0x08 -> osm_vendor_send: Completed
>> sending response or unsolicited p_madw = 0x3ddf5c0
>> Dec 15 23:35:05 791046 [4399A940] 0x10 -> osm_vendor_send: ]
>> Dec 15 23:35:05 791051 [4399A940] 0x10 -> osm_sa_send_error: ]
>> Dec 15 23:35:05 791057 [4399A940] 0x10 -> mcmr_rcv_join_mgrp: ]
>> Dec 15 23:35:05 791062 [4399A940] 0x10 -> osm_mcmr_rcv_process: ]
>> Dec 15 23:35:05 791068 [4399A940] 0x10 -> sa_mad_ctrl_disp_done_callback: [
>> Dec 15 23:35:05 791073 [4399A940] 0x10 -> osm_vendor_put: [
>> Dec 15 23:35:05 791079 [4399A940] 0x08 -> osm_vendor_put: Retiring UMAD
>> 0x3dd7290
>> Dec 15 23:35:05 791084 [4399A940] 0x10 -> osm_vendor_put: ]
>> Dec 15 23:35:05 791090 [4399A940] 0x10 -> sa_mad_ctrl_disp_done_callback: ]
>> Dec 15 23:35:05 792086 [4B1A6940] 0x10 -> osm_vendor_get: [
>> Dec 15 23:35:05 792106 [4B1A6940] 0x08 -> osm_vendor_get: Acquiring UMAD
>> for p_madw = 0x3ddf5d8, size = 256
>> Dec 15 23:35:05 792117 [4B1A6940] 0x08 -> osm_vendor_get: Acquired UMAD
>> 0x3dd7290, size = 256
>> Dec 15 23:35:05 792126 [4B1A6940] 0x10 -> osm_vendor_get: ]
>> Dec 15 23:35:05 792132 [4B1A6940] 0x10 -> sa_mad_ctrl_rcv_callback: [
>> Dec 15 23:35:05 792139 [4B1A6940] 0x08 -> sa_mad_ctrl_rcv_callback: 4 SA
>> MADs received
>> Dec 15 23:35:05 792152 [4B1A6940] 0x20 -> SA MAD dump:
>>                                 base_ver................0x1
>>                                 mgmt_class..............0x3
>>                                 class_ver...............0x2
>>                                 method..................0x2 (SubnAdmSet)
>>                                 status..................0x0
>>                                 resv....................0x0
>>                                 trans_id................0x53bf6d21e
>>                                 attr_id.................0x38
>> (MCMemberRecord)
>>                                 resv1...................0x0
>>                                 attr_mod................0x0
>>                                 rmpp_version............0x0
>>                                 rmpp_type...............0x0
>>                                 rmpp_flags..............0x0
>>                                 rmpp_status.............0x0
>>                                 seg_num.................0x0
>>                                 payload_len/new_win.....0x0
>>                                 sm_key..................0x0000000000000000
>>                                 attr_offset.............0x0
>>                                 resv2...................0x0
>>                                 comp_mask...............0x0000000000010083
>>
>>
>> Dec 15 23:35:05 792158 [4B1A6940] 0x10 -> sa_mad_ctrl_process: [
>> Dec 15 23:35:05 792165 [4B1A6940] 0x08 -> sa_mad_ctrl_process: Posting
>> Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD
>> Dec 15 23:35:05 792187 [4B1A6940] 0x10 -> sa_mad_ctrl_process: ]
>> Dec 15 23:35:05 792194 [4B1A6940] 0x10 -> sa_mad_ctrl_rcv_callback: ]
>> Dec 15 23:35:05 792204 [46B9F940] 0x10 -> osm_mcmr_rcv_process: [
>> Dec 15 23:35:05 792211 [46B9F940] 0x10 -> mcmr_rcv_join_mgrp: [
>> Dec 15 23:35:05 792216 [46B9F940] 0x08 -> mcmr_rcv_join_mgrp: Dump of
>> incoming record
>> Dec 15 23:35:05 792228 [46B9F940] 0x08 -> MCMember Record dump:
>>
>> MGID....................ff12:401b:ffff::ffff:ffff
>>                                 PortGid.................fe80::1e:8c00:b9:641
>>                                 qkey....................0x0
>>                                 mlid....................0x0
>>                                 mtu.....................0x0
>>                                 TClass..................0x0
>>                                 pkey....................0xFFFF
>>                                 rate....................0x0
>>                                 pkt_life................0x0
>>                                 SLFlowLabelHopLimit.....0x0
>>                                 ScopeState..............0x1
>>                                 ProxyJoin...............0x0
>> Dec 15 23:35:05 792236 [46B9F940] 0x04 -> validate_port_caps: Port's
>> RATE 2 is less than 3
>> Dec 15 23:35:05 792243 [46B9F940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
>> IB_SA_MAD_STATUS_REQ_INVALID
>> Dec 15 23:35:05 792253 [46B9F940] 0x10 -> osm_sa_send_error: [
>> Dec 15 23:35:05 792260 [46B9F940] 0x10 -> osm_vendor_get: [
>> Dec 15 23:35:05 792266 [46B9F940] 0x08 -> osm_vendor_get: Acquiring UMAD
>> for p_madw = 0x3dd73f8, size = 256
>> Dec 15 23:35:05 792273 [46B9F940] 0x08 -> osm_vendor_get: Acquired UMAD
>> 0x3dd9290, size = 256
>> Dec 15 23:35:05 792279 [46B9F940] 0x10 -> osm_vendor_get: ]
>> Dec 15 23:35:05 792291 [46B9F940] 0x20 -> SA MAD dump:
>>                                 base_ver................0x1
>>                                 mgmt_class..............0x3
>>                                 class_ver...............0x2
>>                                 method..................0x81
>> (SubnAdmGetResp)
>>                                 status..................0x200
>>                                 resv....................0x0
>>                                 trans_id................0x53bf6d21e
>>                                 attr_id.................0x38
>> (MCMemberRecord)
>>                                 resv1...................0x0
>>                                 attr_mod................0x0
>>                                 rmpp_version............0x0
>>                                 rmpp_type...............0x0
>>                                 rmpp_flags..............0x0
>>                                 rmpp_status.............0x0
>>                                 seg_num.................0x0
>>                                 payload_len/new_win.....0x0
>>                                 sm_key..................0x0000000000000000
>>                                 attr_offset.............0x0
>>                                 resv2...................0x0
>>                                 comp_mask...............0x0000000000010083
>>
>>
>> Dec 15 23:35:05 792298 [46B9F940] 0x10 -> osm_vendor_send: [
>> Dec 15 23:35:05 792304 [46B9F940] 0x04 -> osm_vendor_send: RMPP 0 length 256
>> Dec 15 23:35:05 792318 [46B9F940] 0x10 -> osm_vendor_put: [
>> Dec 15 23:35:05 792325 [46B9F940] 0x08 -> osm_vendor_put: Retiring UMAD
>> 0x3dd9290
>> Dec 15 23:35:05 792331 [46B9F940] 0x10 -> osm_vendor_put: ]
>> Dec 15 23:35:05 792337 [46B9F940] 0x08 -> osm_vendor_send: Completed
>> sending response or unsolicited p_madw = 0x3dd73e0
>> Dec 15 23:35:05 792343 [46B9F940] 0x10 -> osm_vendor_send: ]
>> Dec 15 23:35:05 792360 [46B9F940] 0x10 -> osm_sa_send_error: ]
>> Dec 15 23:35:05 792366 [46B9F940] 0x10 -> mcmr_rcv_join_mgrp: ]
>> Dec 15 23:35:05 792371 [46B9F940] 0x10 -> osm_mcmr_rcv_process: ]
>> Dec 15 23:35:05 792377 [46B9F940] 0x10 -> sa_mad_ctrl_disp_done_callback: [
>> Dec 15 23:35:05 792383 [46B9F940] 0x10 -> osm_vendor_put: [
>> Dec 15 23:35:05 792388 [46B9F940] 0x08 -> osm_vendor_put: Retiring UMAD
>> 0x3dd7e40
>> Dec 15 23:35:05 792394 [46B9F940] 0x10 -> osm_vendor_put: ]
>> Dec 15 23:35:05 792400 [46B9F940] 0x10 -> sa_mad_ctrl_disp_done_callback: ]
>> Dec 15 23:35:09 759207 [4A7A5940] 0x08 -> sm_sweeper: Off schedule sweep
>> signalled
>> Dec 15 23:35:09 759229 [4A7A5940] 0x10 -> osm_state_mgr_process: [
>> Dec 15 23:35:09 759240 [4A7A5940] 0x08 -> osm_state_mgr_process:
>> Received signal OSM_SIGNAL_SWEEP in state MASTER
>> Dec 15 23:35:09 759249 [4A7A5940] 0x10 -> state_mgr_sweep_hop_0: [
>> Dec 15 23:35:09 759258 [4A7A5940] 0x04 -> state_mgr_sweep_hop_0:
>>
>>
>>
>> thanks,
>>
>> Gerben
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 

Grep IT                      tel: 0252-769005
Egelantier 3                 fax: 0252-769006
2211 NN Noordwijkerhout     g.roest-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org
The Netherlands
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problems with link, opensm complains IB_SA_MAD_STATUS_REQ_INVALID
       [not found]               ` <4EEB216D.2010407-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
@ 2011-12-16 12:30                 ` Hal Rosenstock
       [not found]                   ` <4EEB39E8.5030601-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Hal Rosenstock @ 2011-12-16 12:30 UTC (permalink / raw)
  To: Gerben Roest; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 12/16/2011 5:46 AM, Gerben Roest wrote:
> On 16-12-2011 10:14, Alex Netes wrote:
>> Hi Gerben,
>>
>> It's complaining about the link rate:
>>
>> Dec 15 23:35:05 792236 [46B9F940] 0x04 -> validate_port_caps: Port's RATE 2 is less than 3
>>
>> Probably, the host that is trying to join is connected via 1x cable.
>> The rate is defined by the capabilities of the host that opened a group, so
>> you see this problem only when the host with higher rate created the MC group.
> 
> Is it possible to force them to some specified speed?

The easiest way to fix this is to specify rate=2 in the partition file
for the default partition as documented in the man page under PARTITION
CONFIGURATION SECTION as follows:

Default=0x7fff,ipoib,rate=2:ALL=full;

> The strange thing is that both hosts show this problem if they start
> opensm, 

What OpenSM version is this ?

> they have the same errors in /var/log/opensm.log. This is what
> both hosts have:
> 
> [root@titus ~]# lspci -v |grep Infini
> 0a:00.0 InfiniBand: Mellanox Technologies MT26418 [ConnectX VPI PCIe 2.0
> 5GT/s - IB DDR / 10GigE] (rev a0)
> 
> [root@vespasianus ~]# lspci -v |grep Infini
> 0a:00.0 InfiniBand: Mellanox Technologies MT26418 [ConnectX VPI PCIe 2.0
> 5GT/s - IB DDR / 10GigE] (rev a0)

What (rate) is shown in ibstat or ibstatus for each port ?

> The hosts are connected to each other's single port via one IB cable.

I hope they have the same rate on both ports then.

-- Hal

> [root@vespasianus ~]# grep -A1 -B1 INVALID /var/log/opensm.log| tail
> 
> Dec 16 11:35:10 041359 [483D2940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
> from port 0x001e8c0000c84b62 (titus HCA-1), sending
> IB_SA_MAD_STATUS_REQ_INVALID
> Dec 16 11:35:10 041365 [483D2940] 0x10 -> osm_sa_send_error: [
> --
> Dec 16 11:35:17 351591 [429C9940] 0x04 -> validate_port_caps: Port's
> RATE 2 is less than 3
> Dec 16 11:35:17 351598 [429C9940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
> IB_SA_MAD_STATUS_REQ_INVALID
> Dec 16 11:35:17 351604 [429C9940] 0x10 -> osm_sa_send_error: [
> --
> Dec 16 11:35:18 042907 [43DCB940] 0x04 -> validate_port_caps: Port's
> RATE 2 is less than 3
> Dec 16 11:35:18 042914 [43DCB940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
> from port 0x001e8c0000c84b62 (titus HCA-1), sending
> IB_SA_MAD_STATUS_REQ_INVALID
> Dec 16 11:35:18 042920 [43DCB940] 0x10 -> osm_sa_send_error: [
> 
> Gerben
> 
> 
>>
>> On 09:56 Fri 16 Dec     , Gerben Roest wrote:
>>> On 16-12-2011 1:06, Ira Weiny wrote:
>>>> On Thu, 15 Dec 2011 15:17:24 -0800
>>>> Gerben Roest <g.roest-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Starting opensm from OFED 1.5.1, 1.5.3.2, 1.5.4 on a Scientific Linux 5
>>>>> machine, directly linked to its neighbour (a twin 1U setup) gives me no
>>>>> connection but lots of errors in /var/log/opensm.log, like these:
>>>>>
>>>>> Dec 15 22:38:35 685651 [45AFD940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>>>> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
>>>>> IB_SA_MAD_STATUS_REQ_INVALID
>>>>> Dec 15 22:38:35 686174 [464FE940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>>>> from port 0x001e8c0000c84b62 (titus HCA-1), sending
>>>>> IB_SA_MAD_STATUS_REQ_INVALID
>>>>>
>>>>> Does anyone know what happens here? Another twin node has no problems,
>>>>> that one uses OFED-1.5.1.
>>>>>
>>>>> I can send a "-V" log of opensm or any config files if you like,
>>>>
>>>> Just set -D 0x7 which adds VERBOSE and send the snippet around the above errors.
>>>
>>> Dec 15 23:35:05 791001 [4399A940] 0x10 -> osm_vendor_send: [
>>> Dec 15 23:35:05 791008 [4399A940] 0x04 -> osm_vendor_send: RMPP 0 length 256
>>> Dec 15 23:35:05 791021 [4399A940] 0x10 -> osm_vendor_put: [
>>> Dec 15 23:35:05 791028 [4399A940] 0x08 -> osm_vendor_put: Retiring UMAD
>>> 0x3dd9290
>>> Dec 15 23:35:05 791034 [4399A940] 0x10 -> osm_vendor_put: ]
>>> Dec 15 23:35:05 791040 [4399A940] 0x08 -> osm_vendor_send: Completed
>>> sending response or unsolicited p_madw = 0x3ddf5c0
>>> Dec 15 23:35:05 791046 [4399A940] 0x10 -> osm_vendor_send: ]
>>> Dec 15 23:35:05 791051 [4399A940] 0x10 -> osm_sa_send_error: ]
>>> Dec 15 23:35:05 791057 [4399A940] 0x10 -> mcmr_rcv_join_mgrp: ]
>>> Dec 15 23:35:05 791062 [4399A940] 0x10 -> osm_mcmr_rcv_process: ]
>>> Dec 15 23:35:05 791068 [4399A940] 0x10 -> sa_mad_ctrl_disp_done_callback: [
>>> Dec 15 23:35:05 791073 [4399A940] 0x10 -> osm_vendor_put: [
>>> Dec 15 23:35:05 791079 [4399A940] 0x08 -> osm_vendor_put: Retiring UMAD
>>> 0x3dd7290
>>> Dec 15 23:35:05 791084 [4399A940] 0x10 -> osm_vendor_put: ]
>>> Dec 15 23:35:05 791090 [4399A940] 0x10 -> sa_mad_ctrl_disp_done_callback: ]
>>> Dec 15 23:35:05 792086 [4B1A6940] 0x10 -> osm_vendor_get: [
>>> Dec 15 23:35:05 792106 [4B1A6940] 0x08 -> osm_vendor_get: Acquiring UMAD
>>> for p_madw = 0x3ddf5d8, size = 256
>>> Dec 15 23:35:05 792117 [4B1A6940] 0x08 -> osm_vendor_get: Acquired UMAD
>>> 0x3dd7290, size = 256
>>> Dec 15 23:35:05 792126 [4B1A6940] 0x10 -> osm_vendor_get: ]
>>> Dec 15 23:35:05 792132 [4B1A6940] 0x10 -> sa_mad_ctrl_rcv_callback: [
>>> Dec 15 23:35:05 792139 [4B1A6940] 0x08 -> sa_mad_ctrl_rcv_callback: 4 SA
>>> MADs received
>>> Dec 15 23:35:05 792152 [4B1A6940] 0x20 -> SA MAD dump:
>>>                                 base_ver................0x1
>>>                                 mgmt_class..............0x3
>>>                                 class_ver...............0x2
>>>                                 method..................0x2 (SubnAdmSet)
>>>                                 status..................0x0
>>>                                 resv....................0x0
>>>                                 trans_id................0x53bf6d21e
>>>                                 attr_id.................0x38
>>> (MCMemberRecord)
>>>                                 resv1...................0x0
>>>                                 attr_mod................0x0
>>>                                 rmpp_version............0x0
>>>                                 rmpp_type...............0x0
>>>                                 rmpp_flags..............0x0
>>>                                 rmpp_status.............0x0
>>>                                 seg_num.................0x0
>>>                                 payload_len/new_win.....0x0
>>>                                 sm_key..................0x0000000000000000
>>>                                 attr_offset.............0x0
>>>                                 resv2...................0x0
>>>                                 comp_mask...............0x0000000000010083
>>>
>>>
>>> Dec 15 23:35:05 792158 [4B1A6940] 0x10 -> sa_mad_ctrl_process: [
>>> Dec 15 23:35:05 792165 [4B1A6940] 0x08 -> sa_mad_ctrl_process: Posting
>>> Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD
>>> Dec 15 23:35:05 792187 [4B1A6940] 0x10 -> sa_mad_ctrl_process: ]
>>> Dec 15 23:35:05 792194 [4B1A6940] 0x10 -> sa_mad_ctrl_rcv_callback: ]
>>> Dec 15 23:35:05 792204 [46B9F940] 0x10 -> osm_mcmr_rcv_process: [
>>> Dec 15 23:35:05 792211 [46B9F940] 0x10 -> mcmr_rcv_join_mgrp: [
>>> Dec 15 23:35:05 792216 [46B9F940] 0x08 -> mcmr_rcv_join_mgrp: Dump of
>>> incoming record
>>> Dec 15 23:35:05 792228 [46B9F940] 0x08 -> MCMember Record dump:
>>>
>>> MGID....................ff12:401b:ffff::ffff:ffff
>>>                                 PortGid.................fe80::1e:8c00:b9:641
>>>                                 qkey....................0x0
>>>                                 mlid....................0x0
>>>                                 mtu.....................0x0
>>>                                 TClass..................0x0
>>>                                 pkey....................0xFFFF
>>>                                 rate....................0x0
>>>                                 pkt_life................0x0
>>>                                 SLFlowLabelHopLimit.....0x0
>>>                                 ScopeState..............0x1
>>>                                 ProxyJoin...............0x0
>>> Dec 15 23:35:05 792236 [46B9F940] 0x04 -> validate_port_caps: Port's
>>> RATE 2 is less than 3
>>> Dec 15 23:35:05 792243 [46B9F940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
>>> IB_SA_MAD_STATUS_REQ_INVALID
>>> Dec 15 23:35:05 792253 [46B9F940] 0x10 -> osm_sa_send_error: [
>>> Dec 15 23:35:05 792260 [46B9F940] 0x10 -> osm_vendor_get: [
>>> Dec 15 23:35:05 792266 [46B9F940] 0x08 -> osm_vendor_get: Acquiring UMAD
>>> for p_madw = 0x3dd73f8, size = 256
>>> Dec 15 23:35:05 792273 [46B9F940] 0x08 -> osm_vendor_get: Acquired UMAD
>>> 0x3dd9290, size = 256
>>> Dec 15 23:35:05 792279 [46B9F940] 0x10 -> osm_vendor_get: ]
>>> Dec 15 23:35:05 792291 [46B9F940] 0x20 -> SA MAD dump:
>>>                                 base_ver................0x1
>>>                                 mgmt_class..............0x3
>>>                                 class_ver...............0x2
>>>                                 method..................0x81
>>> (SubnAdmGetResp)
>>>                                 status..................0x200
>>>                                 resv....................0x0
>>>                                 trans_id................0x53bf6d21e
>>>                                 attr_id.................0x38
>>> (MCMemberRecord)
>>>                                 resv1...................0x0
>>>                                 attr_mod................0x0
>>>                                 rmpp_version............0x0
>>>                                 rmpp_type...............0x0
>>>                                 rmpp_flags..............0x0
>>>                                 rmpp_status.............0x0
>>>                                 seg_num.................0x0
>>>                                 payload_len/new_win.....0x0
>>>                                 sm_key..................0x0000000000000000
>>>                                 attr_offset.............0x0
>>>                                 resv2...................0x0
>>>                                 comp_mask...............0x0000000000010083
>>>
>>>
>>> Dec 15 23:35:05 792298 [46B9F940] 0x10 -> osm_vendor_send: [
>>> Dec 15 23:35:05 792304 [46B9F940] 0x04 -> osm_vendor_send: RMPP 0 length 256
>>> Dec 15 23:35:05 792318 [46B9F940] 0x10 -> osm_vendor_put: [
>>> Dec 15 23:35:05 792325 [46B9F940] 0x08 -> osm_vendor_put: Retiring UMAD
>>> 0x3dd9290
>>> Dec 15 23:35:05 792331 [46B9F940] 0x10 -> osm_vendor_put: ]
>>> Dec 15 23:35:05 792337 [46B9F940] 0x08 -> osm_vendor_send: Completed
>>> sending response or unsolicited p_madw = 0x3dd73e0
>>> Dec 15 23:35:05 792343 [46B9F940] 0x10 -> osm_vendor_send: ]
>>> Dec 15 23:35:05 792360 [46B9F940] 0x10 -> osm_sa_send_error: ]
>>> Dec 15 23:35:05 792366 [46B9F940] 0x10 -> mcmr_rcv_join_mgrp: ]
>>> Dec 15 23:35:05 792371 [46B9F940] 0x10 -> osm_mcmr_rcv_process: ]
>>> Dec 15 23:35:05 792377 [46B9F940] 0x10 -> sa_mad_ctrl_disp_done_callback: [
>>> Dec 15 23:35:05 792383 [46B9F940] 0x10 -> osm_vendor_put: [
>>> Dec 15 23:35:05 792388 [46B9F940] 0x08 -> osm_vendor_put: Retiring UMAD
>>> 0x3dd7e40
>>> Dec 15 23:35:05 792394 [46B9F940] 0x10 -> osm_vendor_put: ]
>>> Dec 15 23:35:05 792400 [46B9F940] 0x10 -> sa_mad_ctrl_disp_done_callback: ]
>>> Dec 15 23:35:09 759207 [4A7A5940] 0x08 -> sm_sweeper: Off schedule sweep
>>> signalled
>>> Dec 15 23:35:09 759229 [4A7A5940] 0x10 -> osm_state_mgr_process: [
>>> Dec 15 23:35:09 759240 [4A7A5940] 0x08 -> osm_state_mgr_process:
>>> Received signal OSM_SIGNAL_SWEEP in state MASTER
>>> Dec 15 23:35:09 759249 [4A7A5940] 0x10 -> state_mgr_sweep_hop_0: [
>>> Dec 15 23:35:09 759258 [4A7A5940] 0x04 -> state_mgr_sweep_hop_0:
>>>
>>>
>>>
>>> thanks,
>>>
>>> Gerben
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problems with link, opensm complains IB_SA_MAD_STATUS_REQ_INVALID
       [not found]                   ` <4EEB39E8.5030601-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-12-16 12:55                     ` Gerben Roest
       [not found]                       ` <4EEB3FD3.3080409-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Gerben Roest @ 2011-12-16 12:55 UTC (permalink / raw)
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Alex, Hal,

On 16-12-2011 13:30, Hal Rosenstock wrote:
> On 12/16/2011 5:46 AM, Gerben Roest wrote:
>> On 16-12-2011 10:14, Alex Netes wrote:
>>> Hi Gerben,
>>>
>>> It's complaining about the link rate:
>>>
>>> Dec 15 23:35:05 792236 [46B9F940] 0x04 -> validate_port_caps: Port's RATE 2 is less than 3
>>>
>>> Probably, the host that is trying to join is connected via 1x cable.
>>> The rate is defined by the capabilities of the host that opened a group, so
>>> you see this problem only when the host with higher rate created the MC group.
>>
>> Is it possible to force them to some specified speed?
> 
> The easiest way to fix this is to specify rate=2 in the partition file
> for the default partition as documented in the man page under PARTITION
> CONFIGURATION SECTION as follows:
> 
> Default=0x7fff,ipoib,rate=2:ALL=full;

This does the trick! Thanks!

> 
>> The strange thing is that both hosts show this problem if they start
>> opensm, 
> 
> What OpenSM version is this ?

opensm-3.3.9-1.x86_64

But opensm from OFED-1.5.4 gave the same error.

> 
>> they have the same errors in /var/log/opensm.log. This is what
>> both hosts have:
>>
>> [root@titus ~]# lspci -v |grep Infini
>> 0a:00.0 InfiniBand: Mellanox Technologies MT26418 [ConnectX VPI PCIe 2.0
>> 5GT/s - IB DDR / 10GigE] (rev a0)
>>
>> [root@vespasianus ~]# lspci -v |grep Infini
>> 0a:00.0 InfiniBand: Mellanox Technologies MT26418 [ConnectX VPI PCIe 2.0
>> 5GT/s - IB DDR / 10GigE] (rev a0)
> 
> What (rate) is shown in ibstat or ibstatus for each port ?

Both machines have one port each. Both machines give Rate=2, before and
after the opensm partitions.conf edit.

> 
>> The hosts are connected to each other's single port via one IB cable.
> 
> I hope they have the same rate on both ports then.

yes, they had, and have. They should be identical on-board "cards".

Could this be a cable problem? They should be DDR cards. Does Rate=2
mean DDR?

thanks,

Gerben

>> [root@vespasianus ~]# grep -A1 -B1 INVALID /var/log/opensm.log| tail
>>
>> Dec 16 11:35:10 041359 [483D2940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>> from port 0x001e8c0000c84b62 (titus HCA-1), sending
>> IB_SA_MAD_STATUS_REQ_INVALID
>> Dec 16 11:35:10 041365 [483D2940] 0x10 -> osm_sa_send_error: [
>> --
>> Dec 16 11:35:17 351591 [429C9940] 0x04 -> validate_port_caps: Port's
>> RATE 2 is less than 3
>> Dec 16 11:35:17 351598 [429C9940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
>> IB_SA_MAD_STATUS_REQ_INVALID
>> Dec 16 11:35:17 351604 [429C9940] 0x10 -> osm_sa_send_error: [
>> --
>> Dec 16 11:35:18 042907 [43DCB940] 0x04 -> validate_port_caps: Port's
>> RATE 2 is less than 3
>> Dec 16 11:35:18 042914 [43DCB940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>> from port 0x001e8c0000c84b62 (titus HCA-1), sending
>> IB_SA_MAD_STATUS_REQ_INVALID
>> Dec 16 11:35:18 042920 [43DCB940] 0x10 -> osm_sa_send_error: [
>>
>> Gerben
>>
>>
>>>
>>> On 09:56 Fri 16 Dec     , Gerben Roest wrote:
>>>> On 16-12-2011 1:06, Ira Weiny wrote:
>>>>> On Thu, 15 Dec 2011 15:17:24 -0800
>>>>> Gerben Roest <g.roest-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Starting opensm from OFED 1.5.1, 1.5.3.2, 1.5.4 on a Scientific Linux 5
>>>>>> machine, directly linked to its neighbour (a twin 1U setup) gives me no
>>>>>> connection but lots of errors in /var/log/opensm.log, like these:
>>>>>>
>>>>>> Dec 15 22:38:35 685651 [45AFD940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>>>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>>>>> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
>>>>>> IB_SA_MAD_STATUS_REQ_INVALID
>>>>>> Dec 15 22:38:35 686174 [464FE940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>>>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>>>>> from port 0x001e8c0000c84b62 (titus HCA-1), sending
>>>>>> IB_SA_MAD_STATUS_REQ_INVALID
>>>>>>
>>>>>> Does anyone know what happens here? Another twin node has no problems,
>>>>>> that one uses OFED-1.5.1.
>>>>>>
>>>>>> I can send a "-V" log of opensm or any config files if you like,
>>>>>
>>>>> Just set -D 0x7 which adds VERBOSE and send the snippet around the above errors.
>>>>
>>>> Dec 15 23:35:05 791001 [4399A940] 0x10 -> osm_vendor_send: [
>>>> Dec 15 23:35:05 791008 [4399A940] 0x04 -> osm_vendor_send: RMPP 0 length 256
>>>> Dec 15 23:35:05 791021 [4399A940] 0x10 -> osm_vendor_put: [
>>>> Dec 15 23:35:05 791028 [4399A940] 0x08 -> osm_vendor_put: Retiring UMAD
>>>> 0x3dd9290
>>>> Dec 15 23:35:05 791034 [4399A940] 0x10 -> osm_vendor_put: ]
>>>> Dec 15 23:35:05 791040 [4399A940] 0x08 -> osm_vendor_send: Completed
>>>> sending response or unsolicited p_madw = 0x3ddf5c0
>>>> Dec 15 23:35:05 791046 [4399A940] 0x10 -> osm_vendor_send: ]
>>>> Dec 15 23:35:05 791051 [4399A940] 0x10 -> osm_sa_send_error: ]
>>>> Dec 15 23:35:05 791057 [4399A940] 0x10 -> mcmr_rcv_join_mgrp: ]
>>>> Dec 15 23:35:05 791062 [4399A940] 0x10 -> osm_mcmr_rcv_process: ]
>>>> Dec 15 23:35:05 791068 [4399A940] 0x10 -> sa_mad_ctrl_disp_done_callback: [
>>>> Dec 15 23:35:05 791073 [4399A940] 0x10 -> osm_vendor_put: [
>>>> Dec 15 23:35:05 791079 [4399A940] 0x08 -> osm_vendor_put: Retiring UMAD
>>>> 0x3dd7290
>>>> Dec 15 23:35:05 791084 [4399A940] 0x10 -> osm_vendor_put: ]
>>>> Dec 15 23:35:05 791090 [4399A940] 0x10 -> sa_mad_ctrl_disp_done_callback: ]
>>>> Dec 15 23:35:05 792086 [4B1A6940] 0x10 -> osm_vendor_get: [
>>>> Dec 15 23:35:05 792106 [4B1A6940] 0x08 -> osm_vendor_get: Acquiring UMAD
>>>> for p_madw = 0x3ddf5d8, size = 256
>>>> Dec 15 23:35:05 792117 [4B1A6940] 0x08 -> osm_vendor_get: Acquired UMAD
>>>> 0x3dd7290, size = 256
>>>> Dec 15 23:35:05 792126 [4B1A6940] 0x10 -> osm_vendor_get: ]
>>>> Dec 15 23:35:05 792132 [4B1A6940] 0x10 -> sa_mad_ctrl_rcv_callback: [
>>>> Dec 15 23:35:05 792139 [4B1A6940] 0x08 -> sa_mad_ctrl_rcv_callback: 4 SA
>>>> MADs received
>>>> Dec 15 23:35:05 792152 [4B1A6940] 0x20 -> SA MAD dump:
>>>>                                 base_ver................0x1
>>>>                                 mgmt_class..............0x3
>>>>                                 class_ver...............0x2
>>>>                                 method..................0x2 (SubnAdmSet)
>>>>                                 status..................0x0
>>>>                                 resv....................0x0
>>>>                                 trans_id................0x53bf6d21e
>>>>                                 attr_id.................0x38
>>>> (MCMemberRecord)
>>>>                                 resv1...................0x0
>>>>                                 attr_mod................0x0
>>>>                                 rmpp_version............0x0
>>>>                                 rmpp_type...............0x0
>>>>                                 rmpp_flags..............0x0
>>>>                                 rmpp_status.............0x0
>>>>                                 seg_num.................0x0
>>>>                                 payload_len/new_win.....0x0
>>>>                                 sm_key..................0x0000000000000000
>>>>                                 attr_offset.............0x0
>>>>                                 resv2...................0x0
>>>>                                 comp_mask...............0x0000000000010083
>>>>
>>>>
>>>> Dec 15 23:35:05 792158 [4B1A6940] 0x10 -> sa_mad_ctrl_process: [
>>>> Dec 15 23:35:05 792165 [4B1A6940] 0x08 -> sa_mad_ctrl_process: Posting
>>>> Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD
>>>> Dec 15 23:35:05 792187 [4B1A6940] 0x10 -> sa_mad_ctrl_process: ]
>>>> Dec 15 23:35:05 792194 [4B1A6940] 0x10 -> sa_mad_ctrl_rcv_callback: ]
>>>> Dec 15 23:35:05 792204 [46B9F940] 0x10 -> osm_mcmr_rcv_process: [
>>>> Dec 15 23:35:05 792211 [46B9F940] 0x10 -> mcmr_rcv_join_mgrp: [
>>>> Dec 15 23:35:05 792216 [46B9F940] 0x08 -> mcmr_rcv_join_mgrp: Dump of
>>>> incoming record
>>>> Dec 15 23:35:05 792228 [46B9F940] 0x08 -> MCMember Record dump:
>>>>
>>>> MGID....................ff12:401b:ffff::ffff:ffff
>>>>                                 PortGid.................fe80::1e:8c00:b9:641
>>>>                                 qkey....................0x0
>>>>                                 mlid....................0x0
>>>>                                 mtu.....................0x0
>>>>                                 TClass..................0x0
>>>>                                 pkey....................0xFFFF
>>>>                                 rate....................0x0
>>>>                                 pkt_life................0x0
>>>>                                 SLFlowLabelHopLimit.....0x0
>>>>                                 ScopeState..............0x1
>>>>                                 ProxyJoin...............0x0
>>>> Dec 15 23:35:05 792236 [46B9F940] 0x04 -> validate_port_caps: Port's
>>>> RATE 2 is less than 3
>>>> Dec 15 23:35:05 792243 [46B9F940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>>> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
>>>> IB_SA_MAD_STATUS_REQ_INVALID
>>>> Dec 15 23:35:05 792253 [46B9F940] 0x10 -> osm_sa_send_error: [
>>>> Dec 15 23:35:05 792260 [46B9F940] 0x10 -> osm_vendor_get: [
>>>> Dec 15 23:35:05 792266 [46B9F940] 0x08 -> osm_vendor_get: Acquiring UMAD
>>>> for p_madw = 0x3dd73f8, size = 256
>>>> Dec 15 23:35:05 792273 [46B9F940] 0x08 -> osm_vendor_get: Acquired UMAD
>>>> 0x3dd9290, size = 256
>>>> Dec 15 23:35:05 792279 [46B9F940] 0x10 -> osm_vendor_get: ]
>>>> Dec 15 23:35:05 792291 [46B9F940] 0x20 -> SA MAD dump:
>>>>                                 base_ver................0x1
>>>>                                 mgmt_class..............0x3
>>>>                                 class_ver...............0x2
>>>>                                 method..................0x81
>>>> (SubnAdmGetResp)
>>>>                                 status..................0x200
>>>>                                 resv....................0x0
>>>>                                 trans_id................0x53bf6d21e
>>>>                                 attr_id.................0x38
>>>> (MCMemberRecord)
>>>>                                 resv1...................0x0
>>>>                                 attr_mod................0x0
>>>>                                 rmpp_version............0x0
>>>>                                 rmpp_type...............0x0
>>>>                                 rmpp_flags..............0x0
>>>>                                 rmpp_status.............0x0
>>>>                                 seg_num.................0x0
>>>>                                 payload_len/new_win.....0x0
>>>>                                 sm_key..................0x0000000000000000
>>>>                                 attr_offset.............0x0
>>>>                                 resv2...................0x0
>>>>                                 comp_mask...............0x0000000000010083
>>>>
>>>>
>>>> Dec 15 23:35:05 792298 [46B9F940] 0x10 -> osm_vendor_send: [
>>>> Dec 15 23:35:05 792304 [46B9F940] 0x04 -> osm_vendor_send: RMPP 0 length 256
>>>> Dec 15 23:35:05 792318 [46B9F940] 0x10 -> osm_vendor_put: [
>>>> Dec 15 23:35:05 792325 [46B9F940] 0x08 -> osm_vendor_put: Retiring UMAD
>>>> 0x3dd9290
>>>> Dec 15 23:35:05 792331 [46B9F940] 0x10 -> osm_vendor_put: ]
>>>> Dec 15 23:35:05 792337 [46B9F940] 0x08 -> osm_vendor_send: Completed
>>>> sending response or unsolicited p_madw = 0x3dd73e0
>>>> Dec 15 23:35:05 792343 [46B9F940] 0x10 -> osm_vendor_send: ]
>>>> Dec 15 23:35:05 792360 [46B9F940] 0x10 -> osm_sa_send_error: ]
>>>> Dec 15 23:35:05 792366 [46B9F940] 0x10 -> mcmr_rcv_join_mgrp: ]
>>>> Dec 15 23:35:05 792371 [46B9F940] 0x10 -> osm_mcmr_rcv_process: ]
>>>> Dec 15 23:35:05 792377 [46B9F940] 0x10 -> sa_mad_ctrl_disp_done_callback: [
>>>> Dec 15 23:35:05 792383 [46B9F940] 0x10 -> osm_vendor_put: [
>>>> Dec 15 23:35:05 792388 [46B9F940] 0x08 -> osm_vendor_put: Retiring UMAD
>>>> 0x3dd7e40
>>>> Dec 15 23:35:05 792394 [46B9F940] 0x10 -> osm_vendor_put: ]
>>>> Dec 15 23:35:05 792400 [46B9F940] 0x10 -> sa_mad_ctrl_disp_done_callback: ]
>>>> Dec 15 23:35:09 759207 [4A7A5940] 0x08 -> sm_sweeper: Off schedule sweep
>>>> signalled
>>>> Dec 15 23:35:09 759229 [4A7A5940] 0x10 -> osm_state_mgr_process: [
>>>> Dec 15 23:35:09 759240 [4A7A5940] 0x08 -> osm_state_mgr_process:
>>>> Received signal OSM_SIGNAL_SWEEP in state MASTER
>>>> Dec 15 23:35:09 759249 [4A7A5940] 0x10 -> state_mgr_sweep_hop_0: [
>>>> Dec 15 23:35:09 759258 [4A7A5940] 0x04 -> state_mgr_sweep_hop_0:
>>>>
>>>>
>>>>
>>>> thanks,
>>>>
>>>> Gerben
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
> 


-- 

Grep IT                      tel: 0252-769005
Egelantier 3                 fax: 0252-769006
2211 NN Noordwijkerhout     g.roest-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org
The Netherlands
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problems with link, opensm complains IB_SA_MAD_STATUS_REQ_INVALID
       [not found]                       ` <4EEB3FD3.3080409-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
@ 2011-12-16 13:10                         ` Hal Rosenstock
       [not found]                           ` <4EEB4362.1050505-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Hal Rosenstock @ 2011-12-16 13:10 UTC (permalink / raw)
  To: Gerben Roest; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Gerben,

On 12/16/2011 7:55 AM, Gerben Roest wrote:
> Hi Alex, Hal,
> 
> On 16-12-2011 13:30, Hal Rosenstock wrote:
>> On 12/16/2011 5:46 AM, Gerben Roest wrote:
>>> On 16-12-2011 10:14, Alex Netes wrote:
>>>> Hi Gerben,
>>>>
>>>> It's complaining about the link rate:
>>>>
>>>> Dec 15 23:35:05 792236 [46B9F940] 0x04 -> validate_port_caps: Port's RATE 2 is less than 3
>>>>
>>>> Probably, the host that is trying to join is connected via 1x cable.
>>>> The rate is defined by the capabilities of the host that opened a group, so
>>>> you see this problem only when the host with higher rate created the MC group.
>>>
>>> Is it possible to force them to some specified speed?
>>
>> The easiest way to fix this is to specify rate=2 in the partition file
>> for the default partition as documented in the man page under PARTITION
>> CONFIGURATION SECTION as follows:
>>
>> Default=0x7fff,ipoib,rate=2:ALL=full;
> 
> This does the trick! Thanks!
> 
>>
>>> The strange thing is that both hosts show this problem if they start
>>> opensm, 
>>
>> What OpenSM version is this ?
> 
> opensm-3.3.9-1.x86_64
> 
> But opensm from OFED-1.5.4 gave the same error.
> 
>>
>>> they have the same errors in /var/log/opensm.log. This is what
>>> both hosts have:
>>>
>>> [root@titus ~]# lspci -v |grep Infini
>>> 0a:00.0 InfiniBand: Mellanox Technologies MT26418 [ConnectX VPI PCIe 2.0
>>> 5GT/s - IB DDR / 10GigE] (rev a0)
>>>
>>> [root@vespasianus ~]# lspci -v |grep Infini
>>> 0a:00.0 InfiniBand: Mellanox Technologies MT26418 [ConnectX VPI PCIe 2.0
>>> 5GT/s - IB DDR / 10GigE] (rev a0)
>>
>> What (rate) is shown in ibstat or ibstatus for each port ?
> 
> Both machines have one port each. Both machines give Rate=2, before and
> after the opensm partitions.conf edit.
> 
>>
>>> The hosts are connected to each other's single port via one IB cable.
>>
>> I hope they have the same rate on both ports then.
> 
> yes, they had, and have. They should be identical on-board "cards".
> 
> Could this be a cable problem? 

Yes; do you have another cable to try ? If that increases the active
port rate to the full port rate (4x DDR) then you should be able to
either remove the partition config you just added (and use rate=3) or
make the group rate=6 (see below).

> They should be DDR cards. Does Rate=2 mean DDR?

No; it means 1x SDR (lowest speed/width). 4x DDR would be rate 6 (20
Gbps). See IBA 1.2.1 vol 1 PathRecord SA attribute Rate component.

By default, OpenSM sets the rate for the IPoIB broadcast groups when not
explicitly specified is rate 3 (10 Gbps) which is 4x SDR.

-- Hal

> 
> thanks,
> 
> Gerben
> 
>>> [root@vespasianus ~]# grep -A1 -B1 INVALID /var/log/opensm.log| tail
>>>
>>> Dec 16 11:35:10 041359 [483D2940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>> from port 0x001e8c0000c84b62 (titus HCA-1), sending
>>> IB_SA_MAD_STATUS_REQ_INVALID
>>> Dec 16 11:35:10 041365 [483D2940] 0x10 -> osm_sa_send_error: [
>>> --
>>> Dec 16 11:35:17 351591 [429C9940] 0x04 -> validate_port_caps: Port's
>>> RATE 2 is less than 3
>>> Dec 16 11:35:17 351598 [429C9940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
>>> IB_SA_MAD_STATUS_REQ_INVALID
>>> Dec 16 11:35:17 351604 [429C9940] 0x10 -> osm_sa_send_error: [
>>> --
>>> Dec 16 11:35:18 042907 [43DCB940] 0x04 -> validate_port_caps: Port's
>>> RATE 2 is less than 3
>>> Dec 16 11:35:18 042914 [43DCB940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>> from port 0x001e8c0000c84b62 (titus HCA-1), sending
>>> IB_SA_MAD_STATUS_REQ_INVALID
>>> Dec 16 11:35:18 042920 [43DCB940] 0x10 -> osm_sa_send_error: [
>>>
>>> Gerben
>>>
>>>
>>>>
>>>> On 09:56 Fri 16 Dec     , Gerben Roest wrote:
>>>>> On 16-12-2011 1:06, Ira Weiny wrote:
>>>>>> On Thu, 15 Dec 2011 15:17:24 -0800
>>>>>> Gerben Roest <g.roest-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Starting opensm from OFED 1.5.1, 1.5.3.2, 1.5.4 on a Scientific Linux 5
>>>>>>> machine, directly linked to its neighbour (a twin 1U setup) gives me no
>>>>>>> connection but lots of errors in /var/log/opensm.log, like these:
>>>>>>>
>>>>>>> Dec 15 22:38:35 685651 [45AFD940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>>>>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>>>>>> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
>>>>>>> IB_SA_MAD_STATUS_REQ_INVALID
>>>>>>> Dec 15 22:38:35 686174 [464FE940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>>>>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>>>>>> from port 0x001e8c0000c84b62 (titus HCA-1), sending
>>>>>>> IB_SA_MAD_STATUS_REQ_INVALID
>>>>>>>
>>>>>>> Does anyone know what happens here? Another twin node has no problems,
>>>>>>> that one uses OFED-1.5.1.
>>>>>>>
>>>>>>> I can send a "-V" log of opensm or any config files if you like,
>>>>>>
>>>>>> Just set -D 0x7 which adds VERBOSE and send the snippet around the above errors.
>>>>>
>>>>> Dec 15 23:35:05 791001 [4399A940] 0x10 -> osm_vendor_send: [
>>>>> Dec 15 23:35:05 791008 [4399A940] 0x04 -> osm_vendor_send: RMPP 0 length 256
>>>>> Dec 15 23:35:05 791021 [4399A940] 0x10 -> osm_vendor_put: [
>>>>> Dec 15 23:35:05 791028 [4399A940] 0x08 -> osm_vendor_put: Retiring UMAD
>>>>> 0x3dd9290
>>>>> Dec 15 23:35:05 791034 [4399A940] 0x10 -> osm_vendor_put: ]
>>>>> Dec 15 23:35:05 791040 [4399A940] 0x08 -> osm_vendor_send: Completed
>>>>> sending response or unsolicited p_madw = 0x3ddf5c0
>>>>> Dec 15 23:35:05 791046 [4399A940] 0x10 -> osm_vendor_send: ]
>>>>> Dec 15 23:35:05 791051 [4399A940] 0x10 -> osm_sa_send_error: ]
>>>>> Dec 15 23:35:05 791057 [4399A940] 0x10 -> mcmr_rcv_join_mgrp: ]
>>>>> Dec 15 23:35:05 791062 [4399A940] 0x10 -> osm_mcmr_rcv_process: ]
>>>>> Dec 15 23:35:05 791068 [4399A940] 0x10 -> sa_mad_ctrl_disp_done_callback: [
>>>>> Dec 15 23:35:05 791073 [4399A940] 0x10 -> osm_vendor_put: [
>>>>> Dec 15 23:35:05 791079 [4399A940] 0x08 -> osm_vendor_put: Retiring UMAD
>>>>> 0x3dd7290
>>>>> Dec 15 23:35:05 791084 [4399A940] 0x10 -> osm_vendor_put: ]
>>>>> Dec 15 23:35:05 791090 [4399A940] 0x10 -> sa_mad_ctrl_disp_done_callback: ]
>>>>> Dec 15 23:35:05 792086 [4B1A6940] 0x10 -> osm_vendor_get: [
>>>>> Dec 15 23:35:05 792106 [4B1A6940] 0x08 -> osm_vendor_get: Acquiring UMAD
>>>>> for p_madw = 0x3ddf5d8, size = 256
>>>>> Dec 15 23:35:05 792117 [4B1A6940] 0x08 -> osm_vendor_get: Acquired UMAD
>>>>> 0x3dd7290, size = 256
>>>>> Dec 15 23:35:05 792126 [4B1A6940] 0x10 -> osm_vendor_get: ]
>>>>> Dec 15 23:35:05 792132 [4B1A6940] 0x10 -> sa_mad_ctrl_rcv_callback: [
>>>>> Dec 15 23:35:05 792139 [4B1A6940] 0x08 -> sa_mad_ctrl_rcv_callback: 4 SA
>>>>> MADs received
>>>>> Dec 15 23:35:05 792152 [4B1A6940] 0x20 -> SA MAD dump:
>>>>>                                 base_ver................0x1
>>>>>                                 mgmt_class..............0x3
>>>>>                                 class_ver...............0x2
>>>>>                                 method..................0x2 (SubnAdmSet)
>>>>>                                 status..................0x0
>>>>>                                 resv....................0x0
>>>>>                                 trans_id................0x53bf6d21e
>>>>>                                 attr_id.................0x38
>>>>> (MCMemberRecord)
>>>>>                                 resv1...................0x0
>>>>>                                 attr_mod................0x0
>>>>>                                 rmpp_version............0x0
>>>>>                                 rmpp_type...............0x0
>>>>>                                 rmpp_flags..............0x0
>>>>>                                 rmpp_status.............0x0
>>>>>                                 seg_num.................0x0
>>>>>                                 payload_len/new_win.....0x0
>>>>>                                 sm_key..................0x0000000000000000
>>>>>                                 attr_offset.............0x0
>>>>>                                 resv2...................0x0
>>>>>                                 comp_mask...............0x0000000000010083
>>>>>
>>>>>
>>>>> Dec 15 23:35:05 792158 [4B1A6940] 0x10 -> sa_mad_ctrl_process: [
>>>>> Dec 15 23:35:05 792165 [4B1A6940] 0x08 -> sa_mad_ctrl_process: Posting
>>>>> Dispatcher message OSM_MSG_MAD_MCMEMBER_RECORD
>>>>> Dec 15 23:35:05 792187 [4B1A6940] 0x10 -> sa_mad_ctrl_process: ]
>>>>> Dec 15 23:35:05 792194 [4B1A6940] 0x10 -> sa_mad_ctrl_rcv_callback: ]
>>>>> Dec 15 23:35:05 792204 [46B9F940] 0x10 -> osm_mcmr_rcv_process: [
>>>>> Dec 15 23:35:05 792211 [46B9F940] 0x10 -> mcmr_rcv_join_mgrp: [
>>>>> Dec 15 23:35:05 792216 [46B9F940] 0x08 -> mcmr_rcv_join_mgrp: Dump of
>>>>> incoming record
>>>>> Dec 15 23:35:05 792228 [46B9F940] 0x08 -> MCMember Record dump:
>>>>>
>>>>> MGID....................ff12:401b:ffff::ffff:ffff
>>>>>                                 PortGid.................fe80::1e:8c00:b9:641
>>>>>                                 qkey....................0x0
>>>>>                                 mlid....................0x0
>>>>>                                 mtu.....................0x0
>>>>>                                 TClass..................0x0
>>>>>                                 pkey....................0xFFFF
>>>>>                                 rate....................0x0
>>>>>                                 pkt_life................0x0
>>>>>                                 SLFlowLabelHopLimit.....0x0
>>>>>                                 ScopeState..............0x1
>>>>>                                 ProxyJoin...............0x0
>>>>> Dec 15 23:35:05 792236 [46B9F940] 0x04 -> validate_port_caps: Port's
>>>>> RATE 2 is less than 3
>>>>> Dec 15 23:35:05 792243 [46B9F940] 0x01 -> mcmr_rcv_join_mgrp: ERR 1B12:
>>>>> validate_more_comp_fields, validate_port_caps, or JoinState = 0 failed
>>>>> from port 0x001e8c0000b90641 (vespasianus HCA-1), sending
>>>>> IB_SA_MAD_STATUS_REQ_INVALID
>>>>> Dec 15 23:35:05 792253 [46B9F940] 0x10 -> osm_sa_send_error: [
>>>>> Dec 15 23:35:05 792260 [46B9F940] 0x10 -> osm_vendor_get: [
>>>>> Dec 15 23:35:05 792266 [46B9F940] 0x08 -> osm_vendor_get: Acquiring UMAD
>>>>> for p_madw = 0x3dd73f8, size = 256
>>>>> Dec 15 23:35:05 792273 [46B9F940] 0x08 -> osm_vendor_get: Acquired UMAD
>>>>> 0x3dd9290, size = 256
>>>>> Dec 15 23:35:05 792279 [46B9F940] 0x10 -> osm_vendor_get: ]
>>>>> Dec 15 23:35:05 792291 [46B9F940] 0x20 -> SA MAD dump:
>>>>>                                 base_ver................0x1
>>>>>                                 mgmt_class..............0x3
>>>>>                                 class_ver...............0x2
>>>>>                                 method..................0x81
>>>>> (SubnAdmGetResp)
>>>>>                                 status..................0x200
>>>>>                                 resv....................0x0
>>>>>                                 trans_id................0x53bf6d21e
>>>>>                                 attr_id.................0x38
>>>>> (MCMemberRecord)
>>>>>                                 resv1...................0x0
>>>>>                                 attr_mod................0x0
>>>>>                                 rmpp_version............0x0
>>>>>                                 rmpp_type...............0x0
>>>>>                                 rmpp_flags..............0x0
>>>>>                                 rmpp_status.............0x0
>>>>>                                 seg_num.................0x0
>>>>>                                 payload_len/new_win.....0x0
>>>>>                                 sm_key..................0x0000000000000000
>>>>>                                 attr_offset.............0x0
>>>>>                                 resv2...................0x0
>>>>>                                 comp_mask...............0x0000000000010083
>>>>>
>>>>>
>>>>> Dec 15 23:35:05 792298 [46B9F940] 0x10 -> osm_vendor_send: [
>>>>> Dec 15 23:35:05 792304 [46B9F940] 0x04 -> osm_vendor_send: RMPP 0 length 256
>>>>> Dec 15 23:35:05 792318 [46B9F940] 0x10 -> osm_vendor_put: [
>>>>> Dec 15 23:35:05 792325 [46B9F940] 0x08 -> osm_vendor_put: Retiring UMAD
>>>>> 0x3dd9290
>>>>> Dec 15 23:35:05 792331 [46B9F940] 0x10 -> osm_vendor_put: ]
>>>>> Dec 15 23:35:05 792337 [46B9F940] 0x08 -> osm_vendor_send: Completed
>>>>> sending response or unsolicited p_madw = 0x3dd73e0
>>>>> Dec 15 23:35:05 792343 [46B9F940] 0x10 -> osm_vendor_send: ]
>>>>> Dec 15 23:35:05 792360 [46B9F940] 0x10 -> osm_sa_send_error: ]
>>>>> Dec 15 23:35:05 792366 [46B9F940] 0x10 -> mcmr_rcv_join_mgrp: ]
>>>>> Dec 15 23:35:05 792371 [46B9F940] 0x10 -> osm_mcmr_rcv_process: ]
>>>>> Dec 15 23:35:05 792377 [46B9F940] 0x10 -> sa_mad_ctrl_disp_done_callback: [
>>>>> Dec 15 23:35:05 792383 [46B9F940] 0x10 -> osm_vendor_put: [
>>>>> Dec 15 23:35:05 792388 [46B9F940] 0x08 -> osm_vendor_put: Retiring UMAD
>>>>> 0x3dd7e40
>>>>> Dec 15 23:35:05 792394 [46B9F940] 0x10 -> osm_vendor_put: ]
>>>>> Dec 15 23:35:05 792400 [46B9F940] 0x10 -> sa_mad_ctrl_disp_done_callback: ]
>>>>> Dec 15 23:35:09 759207 [4A7A5940] 0x08 -> sm_sweeper: Off schedule sweep
>>>>> signalled
>>>>> Dec 15 23:35:09 759229 [4A7A5940] 0x10 -> osm_state_mgr_process: [
>>>>> Dec 15 23:35:09 759240 [4A7A5940] 0x08 -> osm_state_mgr_process:
>>>>> Received signal OSM_SIGNAL_SWEEP in state MASTER
>>>>> Dec 15 23:35:09 759249 [4A7A5940] 0x10 -> state_mgr_sweep_hop_0: [
>>>>> Dec 15 23:35:09 759258 [4A7A5940] 0x04 -> state_mgr_sweep_hop_0:
>>>>>
>>>>>
>>>>>
>>>>> thanks,
>>>>>
>>>>> Gerben
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>
>>>
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problems with link, opensm complains IB_SA_MAD_STATUS_REQ_INVALID
       [not found]                           ` <4EEB4362.1050505-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-12-16 15:37                             ` Gerben Roest
       [not found]                               ` <4EEB65D0.8040802-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Gerben Roest @ 2011-12-16 15:37 UTC (permalink / raw)
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 16-12-2011 14:10, Hal Rosenstock wrote:

>> They should be DDR cards. Does Rate=2 mean DDR?
> 
> No; it means 1x SDR (lowest speed/width). 4x DDR would be rate 6 (20
> Gbps). See IBA 1.2.1 vol 1 PathRecord SA attribute Rate component.
> 
> By default, OpenSM sets the rate for the IPoIB broadcast groups when not
> explicitly specified is rate 3 (10 Gbps) which is 4x SDR.

I have a similar twin node that does work correctly (has DDR IB) and it
says at ibstat: "Rate: 20"
whereas the two that are having problems say "Rate: 2".

Testing with openmpi osu_bw show:

Rate=2:	max bw: 245 MB/s
Rate=20: max bw: 1970 MB/s

greetings,

Gerben
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problems with link, opensm complains IB_SA_MAD_STATUS_REQ_INVALID
       [not found]                               ` <4EEB65D0.8040802-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
@ 2011-12-16 15:43                                 ` Hal Rosenstock
       [not found]                                   ` <4EEB6729.8070600-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Hal Rosenstock @ 2011-12-16 15:43 UTC (permalink / raw)
  To: Gerben Roest; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 12/16/2011 10:37 AM, Gerben Roest wrote:
> On 16-12-2011 14:10, Hal Rosenstock wrote:
> 
>>> They should be DDR cards. Does Rate=2 mean DDR?
>>
>> No; it means 1x SDR (lowest speed/width). 4x DDR would be rate 6 (20
>> Gbps). See IBA 1.2.1 vol 1 PathRecord SA attribute Rate component.
>>
>> By default, OpenSM sets the rate for the IPoIB broadcast groups when not
>> explicitly specified is rate 3 (10 Gbps) which is 4x SDR.
> 
> I have a similar twin node that does work correctly (has DDR IB) and it
> says at ibstat: "Rate: 20"
> whereas the two that are having problems say "Rate: 2".
> 
> Testing with openmpi osu_bw show:
> 
> Rate=2:	max bw: 245 MB/s
> Rate=20: max bw: 1970 MB/s

Yes, that's consistent.

Can you temporarily try the cable that is known to work (for rate 20)
between the ports that come up at rate 2 and see if they come up
properly (at rate 20) ?

-- Hal

> 
> greetings,
> 
> Gerben
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Problems with link, opensm complains IB_SA_MAD_STATUS_REQ_INVALID
       [not found]                                   ` <4EEB6729.8070600-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2011-12-16 15:56                                     ` Gerben Roest
  0 siblings, 0 replies; 11+ messages in thread
From: Gerben Roest @ 2011-12-16 15:56 UTC (permalink / raw)
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 16-12-2011 16:43, Hal Rosenstock wrote:
> On 12/16/2011 10:37 AM, Gerben Roest wrote:
>> On 16-12-2011 14:10, Hal Rosenstock wrote:
>>
>>>> They should be DDR cards. Does Rate=2 mean DDR?
>>>
>>> No; it means 1x SDR (lowest speed/width). 4x DDR would be rate 6 (20
>>> Gbps). See IBA 1.2.1 vol 1 PathRecord SA attribute Rate component.
>>>
>>> By default, OpenSM sets the rate for the IPoIB broadcast groups when not
>>> explicitly specified is rate 3 (10 Gbps) which is 4x SDR.
>>
>> I have a similar twin node that does work correctly (has DDR IB) and it
>> says at ibstat: "Rate: 20"
>> whereas the two that are having problems say "Rate: 2".
>>
>> Testing with openmpi osu_bw show:
>>
>> Rate=2:	max bw: 245 MB/s
>> Rate=20: max bw: 1970 MB/s
> 
> Yes, that's consistent.
> 
> Can you temporarily try the cable that is known to work (for rate 20)
> between the ports that come up at rate 2 and see if they come up
> properly (at rate 20) ?

Yes, I'll try that but that will be next week or so. I'll get back to
you on that.

thanks for your help,

Gerben
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-12-16 15:56 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-12-15 23:17 Problems with link, opensm complains IB_SA_MAD_STATUS_REQ_INVALID Gerben Roest
     [not found] ` <4EEA8004.4060103-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
2011-12-16  0:06   ` Ira Weiny
     [not found]     ` <20111215160600.ebccb033.weiny2-i2BcT+NCU+M@public.gmane.org>
2011-12-16  8:56       ` Gerben Roest
     [not found]         ` <4EEB07C3.90803-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
2011-12-16  9:14           ` Alex Netes
2011-12-16 10:46             ` Gerben Roest
     [not found]               ` <4EEB216D.2010407-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
2011-12-16 12:30                 ` Hal Rosenstock
     [not found]                   ` <4EEB39E8.5030601-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-12-16 12:55                     ` Gerben Roest
     [not found]                       ` <4EEB3FD3.3080409-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
2011-12-16 13:10                         ` Hal Rosenstock
     [not found]                           ` <4EEB4362.1050505-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-12-16 15:37                             ` Gerben Roest
     [not found]                               ` <4EEB65D0.8040802-99SnrGqf+M9mR6Xm/wNWPw@public.gmane.org>
2011-12-16 15:43                                 ` Hal Rosenstock
     [not found]                                   ` <4EEB6729.8070600-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-12-16 15:56                                     ` Gerben Roest

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.