All of lore.kernel.org
 help / color / mirror / Atom feed
* OpenSM 3.3.16 at 100% CPU load, "console off"
@ 2013-10-09 11:10 Sebastian Riemer
       [not found] ` <525539A4.8090700-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Sebastian Riemer @ 2013-10-09 11:10 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Hal,

we've encountered an issue with OpenSM 3.3.16 and the config option
"console off".
OpenSM processes are at 100% CPU load.

>From strace:
poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
read(0, "", 4096)                       = 0
poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
read(0, "", 4096)                       = 0
poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
read(0, "", 4096)                       = 0

As far as I've seen in the code, the function osm_console() from
opensm/osm_console.c is the only function which uses poll().

Is this issue already known or perhaps already fixed?

Thanks,
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OpenSM 3.3.16 at 100% CPU load, "console off"
       [not found] ` <525539A4.8090700-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
@ 2013-10-09 13:28   ` Hal Rosenstock
       [not found]     ` <52555A04.8080606-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Hal Rosenstock @ 2013-10-09 13:28 UTC (permalink / raw)
  To: Sebastian Riemer; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Sebastian,

On 10/9/2013 7:10 AM, Sebastian Riemer wrote:
> Hi Hal,
> 
> we've encountered an issue with OpenSM 3.3.16 and the config option
> "console off".
> OpenSM processes are at 100% CPU load.
> 
>>From strace:
> poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
> read(0, "", 4096)                       = 0
> poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
> read(0, "", 4096)                       = 0
> poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
> read(0, "", 4096)                       = 0

So this doesn't block for 1 second and that's why the CPU is 100% ?

> As far as I've seen in the code, the function osm_console() from
> opensm/osm_console.c is the only function which uses poll().

osm_vendor_ibumad has a receiver thread polling umad under the covers of
umad_recv but I think that uses infinite rather than 1 second.

> Is this issue already known or perhaps already fixed?

This area of the console code has not changed in quite a while.
Any idea if this works with older versions of OpenSM ?

This is first I've heard of this issue.

-- Hal

> Thanks,
> Sebastian
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OpenSM 3.3.16 at 100% CPU load, "console off"
       [not found]     ` <52555A04.8080606-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2013-10-09 13:30       ` David Dillow
       [not found]         ` <1381325445.27365.4.camel-a7a0dvSY7KqLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
  2013-10-09 14:00       ` Hal Rosenstock
  1 sibling, 1 reply; 9+ messages in thread
From: David Dillow @ 2013-10-09 13:30 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: Sebastian Riemer, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Wed, 2013-10-09 at 09:28 -0400, Hal Rosenstock wrote:
> >>From strace:
> > poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
> > read(0, "", 4096)                       = 0
> > poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
> > read(0, "", 4096)                       = 0
> > poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
> > read(0, "", 4096)                       = 0
> 
> So this doesn't block for 1 second and that's why the CPU is 100% ?

Looks like it is spinning on a closed socket (or stdin) -- calling
poll() on such will return immediately...

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OpenSM 3.3.16 at 100% CPU load, "console off"
       [not found]         ` <1381325445.27365.4.camel-a7a0dvSY7KqLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
@ 2013-10-09 13:53           ` Sebastian Riemer
  0 siblings, 0 replies; 9+ messages in thread
From: Sebastian Riemer @ 2013-10-09 13:53 UTC (permalink / raw)
  To: David Dillow; +Cc: Hal Rosenstock, linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 09.10.2013 15:30, David Dillow wrote:
> On Wed, 2013-10-09 at 09:28 -0400, Hal Rosenstock wrote:
>>> >From strace:
>>> poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
>>> read(0, "", 4096)                       = 0
>>> poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
>>> read(0, "", 4096)                       = 0
>>> poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
>>> read(0, "", 4096)                       = 0
>>
>> So this doesn't block for 1 second and that's why the CPU is 100% ?
> 
> Looks like it is spinning on a closed socket (or stdin) -- calling
> poll() on such will return immediately...
> 

Thanks for the responses!

I've seen in the code that the local console is initialized but is not
released correctly. Should be done in osm_console_exit().

Something like this:

       if (p_oct->in_fd >= 0) {
               p_oct->in = NULL;
               p_oct->out = NULL;
               p_oct->in_fd = -1;
               p_oct->out_fd = -1;
       }

I guess what happened was that "console local" was set, changed in the
config to "console off" and the service has been restarted. Restarting
the service again didn't help.

It is strange that the console_init_flag is still set. The function
osm_console() returns 0 if poll() fails. If it would return something
else, then the console_init_flag would be set to 0 again and there would
be no issue anymore I suppose.

Cheers,
Sebastian
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OpenSM 3.3.16 at 100% CPU load, "console off"
       [not found]     ` <52555A04.8080606-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  2013-10-09 13:30       ` David Dillow
@ 2013-10-09 14:00       ` Hal Rosenstock
       [not found]         ` <52556172.2070707-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  1 sibling, 1 reply; 9+ messages in thread
From: Hal Rosenstock @ 2013-10-09 14:00 UTC (permalink / raw)
  To: Sebastian Riemer; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Sebastian,

On 10/9/2013 9:28 AM, Hal Rosenstock wrote:
> Hi Sebastian,
> 
> On 10/9/2013 7:10 AM, Sebastian Riemer wrote:
>> Hi Hal,
>>
>> we've encountered an issue with OpenSM 3.3.16 and the config option
>> "console off".
>> OpenSM processes are at 100% CPU load.
>>
>> >From strace:
>> poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
>> read(0, "", 4096)                       = 0
>> poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
>> read(0, "", 4096)                       = 0
>> poll([{fd=0, events=POLLIN}], 1, 1000)  = 1 ([{fd=0, revents=POLLIN}])
>> read(0, "", 4096)                       = 0
> 
> So this doesn't block for 1 second and that's why the CPU is 100% ?
> 
>> As far as I've seen in the code, the function osm_console() from
>> opensm/osm_console.c is the only function which uses poll().
> 
> osm_vendor_ibumad has a receiver thread polling umad under the covers of
> umad_recv but I think that uses infinite rather than 1 second.
> 
>> Is this issue already known or perhaps already fixed?
> 
> This area of the console code has not changed in quite a while.
> Any idea if this works with older versions of OpenSM ?
> 
> This is first I've heard of this issue.

Do you recall the sequence to get to this ?

Was console option changed to off and then OpenSM SIGHUP'd ? Something
else ?

Is this reproducible ?

-- Hal

> -- Hal
> 
>> Thanks,
>> Sebastian
>>
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OpenSM 3.3.16 at 100% CPU load, "console off"
       [not found]         ` <52556172.2070707-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2013-10-09 14:45           ` Sebastian Riemer
       [not found]             ` <52556BEE.5070409-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Sebastian Riemer @ 2013-10-09 14:45 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 09.10.2013 16:00, Hal Rosenstock wrote:

> Do you recall the sequence to get to this ?
> 
> Was console option changed to off and then OpenSM SIGHUP'd ? Something
> else ?
> 
> Is this reproducible ?

Yes, now I can reproduce it. The opensm has been initially started with
"console off" and I activate "console local" and restart the service.
CPU load is at 100% immediately. I set "console off" again and restart
the service and CPU load is low again.

I did this three times in a row, now. And the third time it even
remained at 100% load in the "off" state. I've set "local" and "off"
again and CPU load was low again.

Cheers,
Sebastian

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OpenSM 3.3.16 at 100% CPU load, "console off"
       [not found]             ` <52556BEE.5070409-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
@ 2013-10-09 15:15               ` Hal Rosenstock
       [not found]                 ` <525572FE.5080805-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Hal Rosenstock @ 2013-10-09 15:15 UTC (permalink / raw)
  To: Sebastian Riemer; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 10/9/2013 10:45 AM, Sebastian Riemer wrote:
> On 09.10.2013 16:00, Hal Rosenstock wrote:
> 
>> Do you recall the sequence to get to this ?
>>
>> Was console option changed to off and then OpenSM SIGHUP'd ? Something
>> else ?
>>
>> Is this reproducible ?
> 
> Yes, now I can reproduce it. The opensm has been initially started with
> "console off" and I activate "console local" and restart the service.
> CPU load is at 100% immediately. I set "console off" again and restart
> the service and CPU load is low again.
> 
> I did this three times in a row, now. And the third time it even
> remained at 100% load in the "off" state. I've set "local" and "off"
> again and CPU load was low again.

What does service restart do in terms of OpenSM ?

Note that the console parameter is _not_ changeable "on the fly" right
now so if OpenSM is being SIGHUP'd by service restart then this is a
current limitation (and is clearly not detected/protected against in the
current code base). It sounds like that may be what is going on.

-- Hal

> Cheers,
> Sebastian
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OpenSM 3.3.16 at 100% CPU load, "console off"
       [not found]                 ` <525572FE.5080805-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2013-10-09 15:52                   ` Sebastian Riemer
       [not found]                     ` <52557BCF.6030302-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Sebastian Riemer @ 2013-10-09 15:52 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 09.10.2013 17:15, Hal Rosenstock wrote:
> What does service restart do in terms of OpenSM ?
> 
> Note that the console parameter is _not_ changeable "on the fly" right
> now so if OpenSM is being SIGHUP'd by service restart then this is a
> current limitation (and is clearly not detected/protected against in the
> current code base). It sounds like that may be what is going on.

Yes, it emits SIGHUP. Thanks for the information! The opensm is a
critical component. So IMHO it needs to be fixed in a way that it either
protects itself against such changes by ignoring them on the fly or it
needs to support these changes.

The current situation is not really acceptable and the opensm stability
is crucial. So I'll think about fixing it.
Are you interested in patches in this regard?

Cheers,
Sebastian


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: OpenSM 3.3.16 at 100% CPU load, "console off"
       [not found]                     ` <52557BCF.6030302-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
@ 2013-10-09 16:51                       ` Hal Rosenstock
  0 siblings, 0 replies; 9+ messages in thread
From: Hal Rosenstock @ 2013-10-09 16:51 UTC (permalink / raw)
  To: Sebastian Riemer; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 10/9/2013 11:52 AM, Sebastian Riemer wrote:
> On 09.10.2013 17:15, Hal Rosenstock wrote:
>> What does service restart do in terms of OpenSM ?
>>
>> Note that the console parameter is _not_ changeable "on the fly" right
>> now so if OpenSM is being SIGHUP'd by service restart then this is a
>> current limitation (and is clearly not detected/protected against in the
>> current code base). It sounds like that may be what is going on.
> 
> Yes, it emits SIGHUP. Thanks for the information! The opensm is a
> critical component. So IMHO it needs to be fixed in a way that it either
> protects itself against such changes by ignoring them on the fly or it
> needs to support these changes.
> 
> The current situation is not really acceptable and the opensm stability
> is crucial. So I'll think about fixing it.
> Are you interested in patches in this regard?

Yes; such patches are always welcome! Thanks.

-- Hal

> 
> Cheers,
> Sebastian
> 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-10-09 16:51 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-10-09 11:10 OpenSM 3.3.16 at 100% CPU load, "console off" Sebastian Riemer
     [not found] ` <525539A4.8090700-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
2013-10-09 13:28   ` Hal Rosenstock
     [not found]     ` <52555A04.8080606-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2013-10-09 13:30       ` David Dillow
     [not found]         ` <1381325445.27365.4.camel-a7a0dvSY7KqLUyTwlgNVppKKF0rrzTr+@public.gmane.org>
2013-10-09 13:53           ` Sebastian Riemer
2013-10-09 14:00       ` Hal Rosenstock
     [not found]         ` <52556172.2070707-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2013-10-09 14:45           ` Sebastian Riemer
     [not found]             ` <52556BEE.5070409-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
2013-10-09 15:15               ` Hal Rosenstock
     [not found]                 ` <525572FE.5080805-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2013-10-09 15:52                   ` Sebastian Riemer
     [not found]                     ` <52557BCF.6030302-EIkl63zCoXaH+58JC4qpiA@public.gmane.org>
2013-10-09 16:51                       ` Hal Rosenstock

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.