All of lore.kernel.org
 help / color / mirror / Atom feed
* ibstat stuck in state initialized after reboot
@ 2010-03-24 16:26 Michael Robbert
       [not found] ` <E25E098F-AFCA-4FEA-BE46-5AF59C408293-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Robbert @ 2010-03-24 16:26 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

I hope this is the correct place to get help with the problem I have. I have an IB fabric running on a Cisco SFS switch with a 7000D as the subnet manager and the whole thing has been running great for well over a year now, but today I noticed that after any node gets rebooted its IB link doesn't initialize. This has happened on 4 hosts now. What I see is as follows:

[root@compute-2-7 ~]# ibstat
CA 'mthca0'
       CA type: MT25204
       Number of ports: 1
       Firmware version: 1.2.917
       Hardware version: 20
       Node GUID: 0x0005ad00000c0990
       System image GUID: 0x0005ad000100d050
       Port 1:
               State: Initializing
               Physical state: LinkUp
               Rate: 20
               Base lid: 0
               LMC: 0
               SM lid: 0
               Capability mask: 0x02510a68
               Port GUID: 0x0005ad00000c0991

I don't know much about subnet managers, since ours is in hardware and we've never had to configure anything on it, but I can login to the device and it isn't showing any errors. On a node that hasn't been rebooted recently and is still working I can see what appears to be a working subnet manager:

[root@compute-2-10 ~]# sminfo 
sminfo: sm lid 2 sm guid 0x5ad00001df2a0, activity count 2146213408 priority 10 state 3 SMINFO_MASTER

The same command on a non-working node shows this:

[root@compute-2-7 ~]# sminfo 
sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 SMINFO_STANDBY

So far I have reseated all the cables involved on both ends and I have moved the cables on the switch end to new ports and none of that has made a difference even after reboots. I am hoping to find a node that I can take offline tomorrow so I can actually test the cables, but since this seems to be happening to any host that reboots it doesn't appear to be a cabling problem. Can anybody suggest where I should go from here? Is there anything I can do from a working or non-working host to diagnose the problem? Should I try rebooting the subnet manager switch? Will that affect the rest of the fabric? 

Thanks,
Mike Robbert
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ibstat stuck in state initialized after reboot
       [not found] ` <E25E098F-AFCA-4FEA-BE46-5AF59C408293-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
@ 2010-03-24 16:38   ` Ira Weiny
       [not found]     ` <20100324093805.4c7c1034.weiny2-i2BcT+NCU+M@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Ira Weiny @ 2010-03-24 16:38 UTC (permalink / raw)
  To: Michael Robbert; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, 24 Mar 2010 10:26:02 -0600
Michael Robbert <mrobbert-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org> wrote:

> I hope this is the correct place to get help with the problem I have. I have
> an IB fabric running on a Cisco SFS switch with a 7000D as the subnet
> manager and the whole thing has been running great for well over a year now,
> but today I noticed that after any node gets rebooted its IB link doesn't
> initialize. This has happened on 4 hosts now. What I see is as follows:
> 
> [root@compute-2-7 ~]# ibstat
> CA 'mthca0'
>        CA type: MT25204
>        Number of ports: 1
>        Firmware version: 1.2.917
>        Hardware version: 20
>        Node GUID: 0x0005ad00000c0990
>        System image GUID: 0x0005ad000100d050
>        Port 1:
>                State: Initializing
>                Physical state: LinkUp
>                Rate: 20
>                Base lid: 0
>                LMC: 0
>                SM lid: 0
>                Capability mask: 0x02510a68
>                Port GUID: 0x0005ad00000c0991
> 
> I don't know much about subnet managers, since ours is in hardware and we've
> never had to configure anything on it, but I can login to the device and it
> isn't showing any errors. On a node that hasn't been rebooted recently and
> is still working I can see what appears to be a working subnet manager:
> 
> [root@compute-2-10 ~]# sminfo 
> sminfo: sm lid 2 sm guid 0x5ad00001df2a0, activity count 2146213408 priority 10 state 3 SMINFO_MASTER
> 
> The same command on a non-working node shows this:
> 
> [root@compute-2-7 ~]# sminfo 
> sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 SMINFO_STANDBY
> 
> So far I have reseated all the cables involved on both ends and I have moved
> the cables on the switch end to new ports and none of that has made a
> difference even after reboots. I am hoping to find a node that I can take
> offline tomorrow so I can actually test the cables, but since this seems to
> be happening to any host that reboots it doesn't appear to be a cabling
> problem. Can anybody suggest where I should go from here? Is there anything
> I can do from a working or non-working host to diagnose the problem? Should
> I try rebooting the subnet manager switch? Will that affect the rest of the
> fabric? 

Have you spoken to Cisco about the problem?  You say you can log into the
"device" (the SM switch?) if so talk to Cisco about how you may be able to
restart the SM there.

It does sound like the SM on the switch is failing to transition the links.
If you can restart the SM on the switch I would try that first.  Otherwise yes
rebooting the switch is probably your best bet, and yes it will affect the
fabric, although I can't say how much without knowing the topology.

Ira

> 
> Thanks,
> Mike Robbert
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://*vger.kernel.org/majordomo-info.html
> 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2-i2BcT+NCU+M@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ibstat stuck in state initialized after reboot
       [not found]     ` <20100324093805.4c7c1034.weiny2-i2BcT+NCU+M@public.gmane.org>
@ 2010-03-24 16:59       ` Michael Robbert
       [not found]         ` <4256D4F9-36CC-4C21-A459-B69B363F29C9-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Robbert @ 2010-03-24 16:59 UTC (permalink / raw)
  To: Ira Weiny; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Ira,
Thanks for the quick response. That is what I was afraid of. I've been looking through the switch documentation, but it doesn't cover starting, stopping, or even checking the status of the SM service. I'll look into opening a TAC case, but since Cisco has gotten out of the IB business I'm not looking forward to seeing what kind of product support they still have. I can tell you a little more about our topology since it is pretty simple. All of our hosts are connected to the single large SFS switch, then the 7000D which is our subnet-manager is only plugged into that larger switch. 

Thanks for the help and wish me luck with support!

Mike

On Mar 24, 2010, at 10:38 AM, Ira Weiny wrote:

> On Wed, 24 Mar 2010 10:26:02 -0600
> Michael Robbert <mrobbert-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org> wrote:
> 
>> I hope this is the correct place to get help with the problem I have. I have
>> an IB fabric running on a Cisco SFS switch with a 7000D as the subnet
>> manager and the whole thing has been running great for well over a year now,
>> but today I noticed that after any node gets rebooted its IB link doesn't
>> initialize. This has happened on 4 hosts now. What I see is as follows:
>> 
>> [root@compute-2-7 ~]# ibstat
>> CA 'mthca0'
>>       CA type: MT25204
>>       Number of ports: 1
>>       Firmware version: 1.2.917
>>       Hardware version: 20
>>       Node GUID: 0x0005ad00000c0990
>>       System image GUID: 0x0005ad000100d050
>>       Port 1:
>>               State: Initializing
>>               Physical state: LinkUp
>>               Rate: 20
>>               Base lid: 0
>>               LMC: 0
>>               SM lid: 0
>>               Capability mask: 0x02510a68
>>               Port GUID: 0x0005ad00000c0991
>> 
>> I don't know much about subnet managers, since ours is in hardware and we've
>> never had to configure anything on it, but I can login to the device and it
>> isn't showing any errors. On a node that hasn't been rebooted recently and
>> is still working I can see what appears to be a working subnet manager:
>> 
>> [root@compute-2-10 ~]# sminfo 
>> sminfo: sm lid 2 sm guid 0x5ad00001df2a0, activity count 2146213408 priority 10 state 3 SMINFO_MASTER
>> 
>> The same command on a non-working node shows this:
>> 
>> [root@compute-2-7 ~]# sminfo 
>> sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 SMINFO_STANDBY
>> 
>> So far I have reseated all the cables involved on both ends and I have moved
>> the cables on the switch end to new ports and none of that has made a
>> difference even after reboots. I am hoping to find a node that I can take
>> offline tomorrow so I can actually test the cables, but since this seems to
>> be happening to any host that reboots it doesn't appear to be a cabling
>> problem. Can anybody suggest where I should go from here? Is there anything
>> I can do from a working or non-working host to diagnose the problem? Should
>> I try rebooting the subnet manager switch? Will that affect the rest of the
>> fabric? 
> 
> Have you spoken to Cisco about the problem?  You say you can log into the
> "device" (the SM switch?) if so talk to Cisco about how you may be able to
> restart the SM there.
> 
> It does sound like the SM on the switch is failing to transition the links.
> If you can restart the SM on the switch I would try that first.  Otherwise yes
> rebooting the switch is probably your best bet, and yes it will affect the
> fabric, although I can't say how much without knowing the topology.
> 
> Ira
> 
>> 
>> Thanks,
>> Mike Robbert
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://*vger.kernel.org/majordomo-info.html
>> 
> 
> 
> -- 
> Ira Weiny
> Math Programmer/Computer Scientist
> Lawrence Livermore National Lab
> 925-423-8008
> weiny2-i2BcT+NCU+M@public.gmane.org

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: ibstat stuck in state initialized after reboot
       [not found]         ` <4256D4F9-36CC-4C21-A459-B69B363F29C9-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
@ 2010-03-24 17:12           ` Meyer, Donald J
       [not found]             ` <6203933669E90E4AB42B5BC4EDE38D350C9B6386B6-qERRe+bbXDTTXloPLtfHfbfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Meyer, Donald J @ 2010-03-24 17:12 UTC (permalink / raw)
  To: Michael Robbert, Ira Weiny; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

http://www.cisco.com/en/US/docs/server_nw_virtual/7024/release_4.1/hardware/installation/guide/7024hig.pdf

smControl
Starts and stops the embedded subnet manager.
Syntax:
smControl start | stop | restart | status

Thanks,
Don Meyer
Senior Network/System Engineer/Programmer
US+ (253) 371-9532 iNet 8-371-9532
*Other names and brands may be claimed as the property of others
-----Original Message-----
From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Michael Robbert
Sent: Wednesday, March 24, 2010 10:00 AM
To: Ira Weiny
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: ibstat stuck in state initialized after reboot

Ira,
Thanks for the quick response. That is what I was afraid of. I've been looking through the switch documentation, but it doesn't cover starting, stopping, or even checking the status of the SM service. I'll look into opening a TAC case, but since Cisco has gotten out of the IB business I'm not looking forward to seeing what kind of product support they still have. I can tell you a little more about our topology since it is pretty simple. All of our hosts are connected to the single large SFS switch, then the 7000D which is our subnet-manager is only plugged into that larger switch. 

Thanks for the help and wish me luck with support!

Mike

On Mar 24, 2010, at 10:38 AM, Ira Weiny wrote:

> On Wed, 24 Mar 2010 10:26:02 -0600
> Michael Robbert <mrobbert-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org> wrote:
> 
>> I hope this is the correct place to get help with the problem I have. I have
>> an IB fabric running on a Cisco SFS switch with a 7000D as the subnet
>> manager and the whole thing has been running great for well over a year now,
>> but today I noticed that after any node gets rebooted its IB link doesn't
>> initialize. This has happened on 4 hosts now. What I see is as follows:
>> 
>> [root@compute-2-7 ~]# ibstat
>> CA 'mthca0'
>>       CA type: MT25204
>>       Number of ports: 1
>>       Firmware version: 1.2.917
>>       Hardware version: 20
>>       Node GUID: 0x0005ad00000c0990
>>       System image GUID: 0x0005ad000100d050
>>       Port 1:
>>               State: Initializing
>>               Physical state: LinkUp
>>               Rate: 20
>>               Base lid: 0
>>               LMC: 0
>>               SM lid: 0
>>               Capability mask: 0x02510a68
>>               Port GUID: 0x0005ad00000c0991
>> 
>> I don't know much about subnet managers, since ours is in hardware and we've
>> never had to configure anything on it, but I can login to the device and it
>> isn't showing any errors. On a node that hasn't been rebooted recently and
>> is still working I can see what appears to be a working subnet manager:
>> 
>> [root@compute-2-10 ~]# sminfo 
>> sminfo: sm lid 2 sm guid 0x5ad00001df2a0, activity count 2146213408 priority 10 state 3 SMINFO_MASTER
>> 
>> The same command on a non-working node shows this:
>> 
>> [root@compute-2-7 ~]# sminfo 
>> sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 SMINFO_STANDBY
>> 
>> So far I have reseated all the cables involved on both ends and I have moved
>> the cables on the switch end to new ports and none of that has made a
>> difference even after reboots. I am hoping to find a node that I can take
>> offline tomorrow so I can actually test the cables, but since this seems to
>> be happening to any host that reboots it doesn't appear to be a cabling
>> problem. Can anybody suggest where I should go from here? Is there anything
>> I can do from a working or non-working host to diagnose the problem? Should
>> I try rebooting the subnet manager switch? Will that affect the rest of the
>> fabric? 
> 
> Have you spoken to Cisco about the problem?  You say you can log into the
> "device" (the SM switch?) if so talk to Cisco about how you may be able to
> restart the SM there.
> 
> It does sound like the SM on the switch is failing to transition the links.
> If you can restart the SM on the switch I would try that first.  Otherwise yes
> rebooting the switch is probably your best bet, and yes it will affect the
> fabric, although I can't say how much without knowing the topology.
> 
> Ira
> 
>> 
>> Thanks,
>> Mike Robbert
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://*vger.kernel.org/majordomo-info.html
>> 
> 
> 
> -- 
> Ira Weiny
> Math Programmer/Computer Scientist
> Lawrence Livermore National Lab
> 925-423-8008
> weiny2-i2BcT+NCU+M@public.gmane.org

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ibstat stuck in state initialized after reboot
       [not found]             ` <6203933669E90E4AB42B5BC4EDE38D350C9B6386B6-qERRe+bbXDTTXloPLtfHfbfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2010-03-24 17:34               ` Michael Robbert
       [not found]                 ` <230744DB-D7A7-4A1C-973E-E0D7097554DE-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
  0 siblings, 1 reply; 11+ messages in thread
From: Michael Robbert @ 2010-03-24 17:34 UTC (permalink / raw)
  To: Meyer, Donald J; +Cc: Ira Weiny, linux-rdma-u79uwXL29TY76Z2rM5mHXA

Interesting note! The 7024 is our large switch where all the hosts are connected, but I was told that we were sold the 7000D because the 7024 didn't have a subnet manager. Unfortunately the 7000D has a different CLI and that command is not available and I don't have the password for our 7024 so I can't log onto it. 
On another note I just noticed the uptime on the 7000D is just over 1 day so that must have been the start of the problem, but I have no idea why it rebooted nor why it didn't come up working. I'm pretty sure we tested a reboot of the device during acceptance testing.

Oh, I just got your second note:
==================================
BTW, I highly recommend running the opensm on a server instead of using the sm on the switch.  We found running the sm on the switch was much less reliable.  I also recommend using a server dedicated to opensm only.
==================================

I will take that into consideration, but we bought this as a "turn-key" solution from Dell. They designed it and we had no experience with IB so we trusted their knowledge. 

Thanks,
Mike


On Mar 24, 2010, at 11:12 AM, Meyer, Donald J wrote:

> http://www.cisco.com/en/US/docs/server_nw_virtual/7024/release_4.1/hardware/installation/guide/7024hig.pdf
> 
> smControl
> Starts and stops the embedded subnet manager.
> Syntax:
> smControl start | stop | restart | status
> 
> Thanks,
> Don Meyer
> Senior Network/System Engineer/Programmer
> US+ (253) 371-9532 iNet 8-371-9532
> *Other names and brands may be claimed as the property of others
> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Michael Robbert
> Sent: Wednesday, March 24, 2010 10:00 AM
> To: Ira Weiny
> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Re: ibstat stuck in state initialized after reboot
> 
> Ira,
> Thanks for the quick response. That is what I was afraid of. I've been looking through the switch documentation, but it doesn't cover starting, stopping, or even checking the status of the SM service. I'll look into opening a TAC case, but since Cisco has gotten out of the IB business I'm not looking forward to seeing what kind of product support they still have. I can tell you a little more about our topology since it is pretty simple. All of our hosts are connected to the single large SFS switch, then the 7000D which is our subnet-manager is only plugged into that larger switch. 
> 
> Thanks for the help and wish me luck with support!
> 
> Mike
> 
> On Mar 24, 2010, at 10:38 AM, Ira Weiny wrote:
> 
>> On Wed, 24 Mar 2010 10:26:02 -0600
>> Michael Robbert <mrobbert-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org> wrote:
>> 
>>> I hope this is the correct place to get help with the problem I have. I have
>>> an IB fabric running on a Cisco SFS switch with a 7000D as the subnet
>>> manager and the whole thing has been running great for well over a year now,
>>> but today I noticed that after any node gets rebooted its IB link doesn't
>>> initialize. This has happened on 4 hosts now. What I see is as follows:
>>> 
>>> [root@compute-2-7 ~]# ibstat
>>> CA 'mthca0'
>>>      CA type: MT25204
>>>      Number of ports: 1
>>>      Firmware version: 1.2.917
>>>      Hardware version: 20
>>>      Node GUID: 0x0005ad00000c0990
>>>      System image GUID: 0x0005ad000100d050
>>>      Port 1:
>>>              State: Initializing
>>>              Physical state: LinkUp
>>>              Rate: 20
>>>              Base lid: 0
>>>              LMC: 0
>>>              SM lid: 0
>>>              Capability mask: 0x02510a68
>>>              Port GUID: 0x0005ad00000c0991
>>> 
>>> I don't know much about subnet managers, since ours is in hardware and we've
>>> never had to configure anything on it, but I can login to the device and it
>>> isn't showing any errors. On a node that hasn't been rebooted recently and
>>> is still working I can see what appears to be a working subnet manager:
>>> 
>>> [root@compute-2-10 ~]# sminfo 
>>> sminfo: sm lid 2 sm guid 0x5ad00001df2a0, activity count 2146213408 priority 10 state 3 SMINFO_MASTER
>>> 
>>> The same command on a non-working node shows this:
>>> 
>>> [root@compute-2-7 ~]# sminfo 
>>> sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 SMINFO_STANDBY
>>> 
>>> So far I have reseated all the cables involved on both ends and I have moved
>>> the cables on the switch end to new ports and none of that has made a
>>> difference even after reboots. I am hoping to find a node that I can take
>>> offline tomorrow so I can actually test the cables, but since this seems to
>>> be happening to any host that reboots it doesn't appear to be a cabling
>>> problem. Can anybody suggest where I should go from here? Is there anything
>>> I can do from a working or non-working host to diagnose the problem? Should
>>> I try rebooting the subnet manager switch? Will that affect the rest of the
>>> fabric? 
>> 
>> Have you spoken to Cisco about the problem?  You say you can log into the
>> "device" (the SM switch?) if so talk to Cisco about how you may be able to
>> restart the SM there.
>> 
>> It does sound like the SM on the switch is failing to transition the links.
>> If you can restart the SM on the switch I would try that first.  Otherwise yes
>> rebooting the switch is probably your best bet, and yes it will affect the
>> fabric, although I can't say how much without knowing the topology.
>> 
>> Ira
>> 
>>> 
>>> Thanks,
>>> Mike Robbert
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://*vger.kernel.org/majordomo-info.html
>>> 
>> 
>> 
>> -- 
>> Ira Weiny
>> Math Programmer/Computer Scientist
>> Lawrence Livermore National Lab
>> 925-423-8008
>> weiny2-i2BcT+NCU+M@public.gmane.org
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ibstat stuck in state initialized after reboot
       [not found]                 ` <230744DB-D7A7-4A1C-973E-E0D7097554DE-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
@ 2010-03-24 18:25                   ` Ira Weiny
       [not found]                     ` <20100324112525.fc4a8eb9.weiny2-i2BcT+NCU+M@public.gmane.org>
  2010-03-24 18:26                   ` Michael Robbert
  1 sibling, 1 reply; 11+ messages in thread
From: Ira Weiny @ 2010-03-24 18:25 UTC (permalink / raw)
  To: Michael Robbert; +Cc: Meyer, Donald J, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, 24 Mar 2010 11:34:02 -0600
Michael Robbert <mrobbert-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org> wrote:

> Interesting note! The 7024 is our large switch where all the hosts are
> connected, but I was told that we were sold the 7000D because the 7024
> didn't have a subnet manager. Unfortunately the 7000D has a different CLI
> and that command is not available and I don't have the password for our 7024
> so I can't log onto it. 
>
> On another note I just noticed the uptime on the 7000D is just over 1 day so
> that must have been the start of the problem, but I have no idea why it
> rebooted nor why it didn't come up working. I'm pretty sure we tested a
> reboot of the device during acceptance testing.
> 
> Oh, I just got your second note:
> ==================================
> BTW, I highly recommend running the opensm on a server instead of using the
> sm on the switch.  We found running the sm on the switch was much less
> reliable.  I also recommend using a server dedicated to opensm only.
> ==================================

I will second this.  OpenSM has come a long way since the time Cisco was
selling IB switches.  If I understand your situation you don't even need the
7000D you could just remove it and run OpenSM on a "management" node.  If you
can afford it adding a node for OpenSM would be nice but I am not sure you
_need_ it.

OpenSM is now managing many of the largest IB networks out there, on a 288
node system it will have no problems at all "out of the box".

:D

Ira
 
> I will take that into consideration, but we bought this as a "turn-key"
> solution from Dell. They designed it and we had no experience with IB so we
> trusted their knowledge. 

<snip>
 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ibstat stuck in state initialized after reboot
       [not found]                 ` <230744DB-D7A7-4A1C-973E-E0D7097554DE-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
  2010-03-24 18:25                   ` Ira Weiny
@ 2010-03-24 18:26                   ` Michael Robbert
  1 sibling, 0 replies; 11+ messages in thread
From: Michael Robbert @ 2010-03-24 18:26 UTC (permalink / raw)
  To: Michael Robbert
  Cc: Meyer, Donald J, Ira Weiny, linux-rdma-u79uwXL29TY76Z2rM5mHXA

I just discovered another interesting point. I tried to start opensm on one of my hosts and it went into STANDBY state. Here is the log of it trying to start up:

Mar 24 12:23:25 117170 [66DAC170] 0x80 -> OpenSM 3.3.5
Entering DISCOVERING state

Mar 24 12:23:25 117863 [66DAC170] 0x02 -> osm_vendor_init: 1000 pending umads specified
Mar 24 12:23:25 118022 [66DAC170] 0x80 -> Entering DISCOVERING state
Mar 24 12:23:25 120961 [66DAC170] 0x02 -> osm_vendor_bind: Binding to port 0x5ad00000bf1e1
Mar 24 12:23:25 129023 [66DAC170] 0x02 -> osm_vendor_bind: Binding to port 0x5ad00000bf1e1
Mar 24 12:23:25 129069 [66DAC170] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x0005ad00000bf1e1
Mar 24 12:23:26 120384 [42E1E940] 0x01 -> umad_receiver: ERR 5411: DR SMP Send completed with error -- dropping
                        Method 0x1, Attr 0x11, TID 0xf00001a51, Hop Ptr: 0x0
Mar 24 12:23:26 120444 [42E1E940] 0x01 -> Received SMP on a 4 hop path: Initial path = 0,0,0,0,0, Return path  = 0,0,0,0,0
Mar 24 12:23:26 120461 [42E1E940] 0x01 -> sm_mad_ctrl_send_err_cb: ERR 3113: MAD completed in error (IB_TIMEOUT): SubnGet(NodeInfo), attr_mod 0x0, TID 0x1a51
Using default GUID 0x5ad00000bf1e1
Entering STANDBY state

Mar 24 12:23:26 120538 [42C1D940] 0x80 -> Entering STANDBY state

Does that change the diagnosis at all? I'm still waiting for a response from tac-FYB4Gu1CFyUAvxtiuMwx3w@public.gmane.org

Thanks,
Mike

On Mar 24, 2010, at 11:34 AM, Michael Robbert wrote:

> Interesting note! The 7024 is our large switch where all the hosts are connected, but I was told that we were sold the 7000D because the 7024 didn't have a subnet manager. Unfortunately the 7000D has a different CLI and that command is not available and I don't have the password for our 7024 so I can't log onto it. 
> On another note I just noticed the uptime on the 7000D is just over 1 day so that must have been the start of the problem, but I have no idea why it rebooted nor why it didn't come up working. I'm pretty sure we tested a reboot of the device during acceptance testing.
> 
> Oh, I just got your second note:
> ==================================
> BTW, I highly recommend running the opensm on a server instead of using the sm on the switch.  We found running the sm on the switch was much less reliable.  I also recommend using a server dedicated to opensm only.
> ==================================
> 
> I will take that into consideration, but we bought this as a "turn-key" solution from Dell. They designed it and we had no experience with IB so we trusted their knowledge. 
> 
> Thanks,
> Mike
> 
> 
> On Mar 24, 2010, at 11:12 AM, Meyer, Donald J wrote:
> 
>> http://www.cisco.com/en/US/docs/server_nw_virtual/7024/release_4.1/hardware/installation/guide/7024hig.pdf
>> 
>> smControl
>> Starts and stops the embedded subnet manager.
>> Syntax:
>> smControl start | stop | restart | status
>> 
>> Thanks,
>> Don Meyer
>> Senior Network/System Engineer/Programmer
>> US+ (253) 371-9532 iNet 8-371-9532
>> *Other names and brands may be claimed as the property of others
>> -----Original Message-----
>> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Michael Robbert
>> Sent: Wednesday, March 24, 2010 10:00 AM
>> To: Ira Weiny
>> Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> Subject: Re: ibstat stuck in state initialized after reboot
>> 
>> Ira,
>> Thanks for the quick response. That is what I was afraid of. I've been looking through the switch documentation, but it doesn't cover starting, stopping, or even checking the status of the SM service. I'll look into opening a TAC case, but since Cisco has gotten out of the IB business I'm not looking forward to seeing what kind of product support they still have. I can tell you a little more about our topology since it is pretty simple. All of our hosts are connected to the single large SFS switch, then the 7000D which is our subnet-manager is only plugged into that larger switch. 
>> 
>> Thanks for the help and wish me luck with support!
>> 
>> Mike
>> 
>> On Mar 24, 2010, at 10:38 AM, Ira Weiny wrote:
>> 
>>> On Wed, 24 Mar 2010 10:26:02 -0600
>>> Michael Robbert <mrobbert-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org> wrote:
>>> 
>>>> I hope this is the correct place to get help with the problem I have. I have
>>>> an IB fabric running on a Cisco SFS switch with a 7000D as the subnet
>>>> manager and the whole thing has been running great for well over a year now,
>>>> but today I noticed that after any node gets rebooted its IB link doesn't
>>>> initialize. This has happened on 4 hosts now. What I see is as follows:
>>>> 
>>>> [root@compute-2-7 ~]# ibstat
>>>> CA 'mthca0'
>>>>     CA type: MT25204
>>>>     Number of ports: 1
>>>>     Firmware version: 1.2.917
>>>>     Hardware version: 20
>>>>     Node GUID: 0x0005ad00000c0990
>>>>     System image GUID: 0x0005ad000100d050
>>>>     Port 1:
>>>>             State: Initializing
>>>>             Physical state: LinkUp
>>>>             Rate: 20
>>>>             Base lid: 0
>>>>             LMC: 0
>>>>             SM lid: 0
>>>>             Capability mask: 0x02510a68
>>>>             Port GUID: 0x0005ad00000c0991
>>>> 
>>>> I don't know much about subnet managers, since ours is in hardware and we've
>>>> never had to configure anything on it, but I can login to the device and it
>>>> isn't showing any errors. On a node that hasn't been rebooted recently and
>>>> is still working I can see what appears to be a working subnet manager:
>>>> 
>>>> [root@compute-2-10 ~]# sminfo 
>>>> sminfo: sm lid 2 sm guid 0x5ad00001df2a0, activity count 2146213408 priority 10 state 3 SMINFO_MASTER
>>>> 
>>>> The same command on a non-working node shows this:
>>>> 
>>>> [root@compute-2-7 ~]# sminfo 
>>>> sminfo: sm lid 0 sm guid 0x0, activity count 0 priority 0 state 2 SMINFO_STANDBY
>>>> 
>>>> So far I have reseated all the cables involved on both ends and I have moved
>>>> the cables on the switch end to new ports and none of that has made a
>>>> difference even after reboots. I am hoping to find a node that I can take
>>>> offline tomorrow so I can actually test the cables, but since this seems to
>>>> be happening to any host that reboots it doesn't appear to be a cabling
>>>> problem. Can anybody suggest where I should go from here? Is there anything
>>>> I can do from a working or non-working host to diagnose the problem? Should
>>>> I try rebooting the subnet manager switch? Will that affect the rest of the
>>>> fabric? 
>>> 
>>> Have you spoken to Cisco about the problem?  You say you can log into the
>>> "device" (the SM switch?) if so talk to Cisco about how you may be able to
>>> restart the SM there.
>>> 
>>> It does sound like the SM on the switch is failing to transition the links.
>>> If you can restart the SM on the switch I would try that first.  Otherwise yes
>>> rebooting the switch is probably your best bet, and yes it will affect the
>>> fabric, although I can't say how much without knowing the topology.
>>> 
>>> Ira
>>> 
>>>> 
>>>> Thanks,
>>>> Mike Robbert
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>>> More majordomo info at  http://*vger.kernel.org/majordomo-info.html
>>>> 
>>> 
>>> 
>>> -- 
>>> Ira Weiny
>>> Math Programmer/Computer Scientist
>>> Lawrence Livermore National Lab
>>> 925-423-8008
>>> weiny2-i2BcT+NCU+M@public.gmane.org
>> 
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ibstat stuck in state initialized after reboot
       [not found]                     ` <20100324112525.fc4a8eb9.weiny2-i2BcT+NCU+M@public.gmane.org>
@ 2010-03-24 19:16                       ` Chuck Hartley
  2010-03-24 19:42                       ` Michael Robbert
  1 sibling, 0 replies; 11+ messages in thread
From: Chuck Hartley @ 2010-03-24 19:16 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Michael Robbert, Meyer, Donald J, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, Mar 24, 2010 at 2:25 PM, Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org> wrote:
> On Wed, 24 Mar 2010 11:34:02 -0600
> Michael Robbert <mrobbert-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org> wrote:
>
> I will second this.  OpenSM has come a long way since the time Cisco was
> selling IB switches.  If I understand your situation you don't even need the
> 7000D you could just remove it and run OpenSM on a "management" node.  If you
> can afford it adding a node for OpenSM would be nice but I am not sure you
> _need_ it.
>
> OpenSM is now managing many of the largest IB networks out there, on a 288
> node system it will have no problems at all "out of the box".
>

Can you provide any guidelines to determine when a dedicated
management node is beneficial?

BTW, we also found that OpenSM is superior to to the SM embedded in
our switches.

Chuck
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ibstat stuck in state initialized after reboot
       [not found]                     ` <20100324112525.fc4a8eb9.weiny2-i2BcT+NCU+M@public.gmane.org>
  2010-03-24 19:16                       ` Chuck Hartley
@ 2010-03-24 19:42                       ` Michael Robbert
       [not found]                         ` <13A62F2E-BA5E-41AD-B020-53A8102F2738-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
  1 sibling, 1 reply; 11+ messages in thread
From: Michael Robbert @ 2010-03-24 19:42 UTC (permalink / raw)
  To: Ira Weiny; +Cc: Meyer, Donald J, linux-rdma-u79uwXL29TY76Z2rM5mHXA

I've got good news. I was able to get opensm to take control. I gave it a priority of 15 and rebooted the 7000D. Unfortunately I'm not sure I can leave it like this forever. The only host I had with opensm installed is my test front end for an OS upgrade I'm testing. We're moving from Rocks 4.3 to Rocks 5.3 (RHEL 4.5 to RHEL 5.4). I may need to reboot this node from time to time over the next couple of weeks, but at least I'm working right now.
So you say that a 288 node system will work "out of the box", what happens when you hit 289? Is that a magic number or just an estimate. We have 268 compute nodes plus a few auxiliary nodes so we're pretty close to that number. 

Thanks,
Mike

On Mar 24, 2010, at 12:25 PM, Ira Weiny wrote:

> On Wed, 24 Mar 2010 11:34:02 -0600
> Michael Robbert <mrobbert-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org> wrote:
> 
>> Interesting note! The 7024 is our large switch where all the hosts are
>> connected, but I was told that we were sold the 7000D because the 7024
>> didn't have a subnet manager. Unfortunately the 7000D has a different CLI
>> and that command is not available and I don't have the password for our 7024
>> so I can't log onto it. 
>> 
>> On another note I just noticed the uptime on the 7000D is just over 1 day so
>> that must have been the start of the problem, but I have no idea why it
>> rebooted nor why it didn't come up working. I'm pretty sure we tested a
>> reboot of the device during acceptance testing.
>> 
>> Oh, I just got your second note:
>> ==================================
>> BTW, I highly recommend running the opensm on a server instead of using the
>> sm on the switch.  We found running the sm on the switch was much less
>> reliable.  I also recommend using a server dedicated to opensm only.
>> ==================================
> 
> I will second this.  OpenSM has come a long way since the time Cisco was
> selling IB switches.  If I understand your situation you don't even need the
> 7000D you could just remove it and run OpenSM on a "management" node.  If you
> can afford it adding a node for OpenSM would be nice but I am not sure you
> _need_ it.
> 
> OpenSM is now managing many of the largest IB networks out there, on a 288
> node system it will have no problems at all "out of the box".
> 
> :D
> 
> Ira
> 
>> I will take that into consideration, but we bought this as a "turn-key"
>> solution from Dell. They designed it and we had no experience with IB so we
>> trusted their knowledge. 
> 
> <snip>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* RE: ibstat stuck in state initialized after reboot
       [not found]                         ` <13A62F2E-BA5E-41AD-B020-53A8102F2738-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
@ 2010-03-24 20:29                           ` Meyer, Donald J
  2010-03-24 20:37                           ` Ira Weiny
  1 sibling, 0 replies; 11+ messages in thread
From: Meyer, Donald J @ 2010-03-24 20:29 UTC (permalink / raw)
  To: Michael Robbert, Ira Weiny; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

I can't speak for all IB networks, but I do know on our network, the SM on our switch wouldn't last longer than a week.  That network had 288 nodes.  We cured all the SM problems by switching to a dedicated server running OpenSM only.  We tried running OpenSM on a server running other tasks too, but still had occasional problems.  Currently we have over 450 nodes and no SM problems at all by using a dedicated OpenSM server.  We do match the version of OpenSM to the version of OFED we are using.

Thanks,
Don Meyer
Senior Network/System Engineer/Programmer
US+ (253) 371-9532 iNet 8-371-9532
*Other names and brands may be claimed as the property of others

-----Original Message-----
From: Michael Robbert [mailto:mrobbert-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org] 
Sent: Wednesday, March 24, 2010 12:43 PM
To: Ira Weiny
Cc: Meyer, Donald J; linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Subject: Re: ibstat stuck in state initialized after reboot

I've got good news. I was able to get opensm to take control. I gave it a priority of 15 and rebooted the 7000D. Unfortunately I'm not sure I can leave it like this forever. The only host I had with opensm installed is my test front end for an OS upgrade I'm testing. We're moving from Rocks 4.3 to Rocks 5.3 (RHEL 4.5 to RHEL 5.4). I may need to reboot this node from time to time over the next couple of weeks, but at least I'm working right now.
So you say that a 288 node system will work "out of the box", what happens when you hit 289? Is that a magic number or just an estimate. We have 268 compute nodes plus a few auxiliary nodes so we're pretty close to that number. 

Thanks,
Mike

On Mar 24, 2010, at 12:25 PM, Ira Weiny wrote:

> On Wed, 24 Mar 2010 11:34:02 -0600
> Michael Robbert <mrobbert-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org> wrote:
> 
>> Interesting note! The 7024 is our large switch where all the hosts are
>> connected, but I was told that we were sold the 7000D because the 7024
>> didn't have a subnet manager. Unfortunately the 7000D has a different CLI
>> and that command is not available and I don't have the password for our 7024
>> so I can't log onto it. 
>> 
>> On another note I just noticed the uptime on the 7000D is just over 1 day so
>> that must have been the start of the problem, but I have no idea why it
>> rebooted nor why it didn't come up working. I'm pretty sure we tested a
>> reboot of the device during acceptance testing.
>> 
>> Oh, I just got your second note:
>> ==================================
>> BTW, I highly recommend running the opensm on a server instead of using the
>> sm on the switch.  We found running the sm on the switch was much less
>> reliable.  I also recommend using a server dedicated to opensm only.
>> ==================================
> 
> I will second this.  OpenSM has come a long way since the time Cisco was
> selling IB switches.  If I understand your situation you don't even need the
> 7000D you could just remove it and run OpenSM on a "management" node.  If you
> can afford it adding a node for OpenSM would be nice but I am not sure you
> _need_ it.
> 
> OpenSM is now managing many of the largest IB networks out there, on a 288
> node system it will have no problems at all "out of the box".
> 
> :D
> 
> Ira
> 
>> I will take that into consideration, but we bought this as a "turn-key"
>> solution from Dell. They designed it and we had no experience with IB so we
>> trusted their knowledge. 
> 
> <snip>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: ibstat stuck in state initialized after reboot
       [not found]                         ` <13A62F2E-BA5E-41AD-B020-53A8102F2738-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
  2010-03-24 20:29                           ` Meyer, Donald J
@ 2010-03-24 20:37                           ` Ira Weiny
  1 sibling, 0 replies; 11+ messages in thread
From: Ira Weiny @ 2010-03-24 20:37 UTC (permalink / raw)
  To: Michael Robbert; +Cc: Meyer, Donald J, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, 24 Mar 2010 13:42:55 -0600
Michael Robbert <mrobbert-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org> wrote:

> I've got good news. I was able to get opensm to take control. I gave it a
> priority of 15 and rebooted the 7000D. Unfortunately I'm not sure I can
> leave it like this forever. The only host I had with opensm installed is my
> test front end for an OS upgrade I'm testing. We're moving from Rocks 4.3 to
> Rocks 5.3 (RHEL 4.5 to RHEL 5.4). I may need to reboot this node from time
> to time over the next couple of weeks, but at least I'm working right now.
>
> So you say that a 288 node system will work "out of the box", what happens
> when you hit 289? Is that a magic number or just an estimate. We have 268
> compute nodes plus a few auxiliary nodes so we're pretty close to that
> number. 

Nothing will happen when you hit 289.  I chose that number because a 7024 has
288 ports which I assumed was the size of your cluster.

There are those running large clusters (thousands of nodes) who have made some
changes to OpenSM for specialized topologies or better SA scalability.  In the
future those changes should be in OpenSM so as you grow, OpenSM grows with
you!

:-D

Ira

> 
> Thanks,
> Mike
> 
> On Mar 24, 2010, at 12:25 PM, Ira Weiny wrote:
> 
> > On Wed, 24 Mar 2010 11:34:02 -0600
> > Michael Robbert <mrobbert-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org> wrote:
> > 
> >> Interesting note! The 7024 is our large switch where all the hosts are
> >> connected, but I was told that we were sold the 7000D because the 7024
> >> didn't have a subnet manager. Unfortunately the 7000D has a different CLI
> >> and that command is not available and I don't have the password for our 7024
> >> so I can't log onto it. 
> >> 
> >> On another note I just noticed the uptime on the 7000D is just over 1 day so
> >> that must have been the start of the problem, but I have no idea why it
> >> rebooted nor why it didn't come up working. I'm pretty sure we tested a
> >> reboot of the device during acceptance testing.
> >> 
> >> Oh, I just got your second note:
> >> ==================================
> >> BTW, I highly recommend running the opensm on a server instead of using the
> >> sm on the switch.  We found running the sm on the switch was much less
> >> reliable.  I also recommend using a server dedicated to opensm only.
> >> ==================================
> > 
> > I will second this.  OpenSM has come a long way since the time Cisco was
> > selling IB switches.  If I understand your situation you don't even need the
> > 7000D you could just remove it and run OpenSM on a "management" node.  If you
> > can afford it adding a node for OpenSM would be nice but I am not sure you
> > _need_ it.
> > 
> > OpenSM is now managing many of the largest IB networks out there, on a 288
> > node system it will have no problems at all "out of the box".
> > 
> > :D
> > 
> > Ira
> > 
> >> I will take that into consideration, but we bought this as a "turn-key"
> >> solution from Dell. They designed it and we had no experience with IB so we
> >> trusted their knowledge. 
> > 
> > <snip>
> > 
> 


-- 
Ira Weiny
Math Programmer/Computer Scientist
Lawrence Livermore National Lab
925-423-8008
weiny2-i2BcT+NCU+M@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2010-03-24 20:37 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-24 16:26 ibstat stuck in state initialized after reboot Michael Robbert
     [not found] ` <E25E098F-AFCA-4FEA-BE46-5AF59C408293-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
2010-03-24 16:38   ` Ira Weiny
     [not found]     ` <20100324093805.4c7c1034.weiny2-i2BcT+NCU+M@public.gmane.org>
2010-03-24 16:59       ` Michael Robbert
     [not found]         ` <4256D4F9-36CC-4C21-A459-B69B363F29C9-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
2010-03-24 17:12           ` Meyer, Donald J
     [not found]             ` <6203933669E90E4AB42B5BC4EDE38D350C9B6386B6-qERRe+bbXDTTXloPLtfHfbfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2010-03-24 17:34               ` Michael Robbert
     [not found]                 ` <230744DB-D7A7-4A1C-973E-E0D7097554DE-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
2010-03-24 18:25                   ` Ira Weiny
     [not found]                     ` <20100324112525.fc4a8eb9.weiny2-i2BcT+NCU+M@public.gmane.org>
2010-03-24 19:16                       ` Chuck Hartley
2010-03-24 19:42                       ` Michael Robbert
     [not found]                         ` <13A62F2E-BA5E-41AD-B020-53A8102F2738-/qOHPfZA4H6HXe+LvDLADg@public.gmane.org>
2010-03-24 20:29                           ` Meyer, Donald J
2010-03-24 20:37                           ` Ira Weiny
2010-03-24 18:26                   ` Michael Robbert

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.