linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Network cooling device and how to control NIC speed on thermal condition
@ 2017-04-25  8:36 Waldemar Rymarkiewicz
  2017-04-25 13:17 ` Andrew Lunn
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Waldemar Rymarkiewicz @ 2017-04-25  8:36 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

Hi,

I am not much aware of linux networking architecture so I'd like to
ask first before will start to dig into the code. Appreciate any
feedback.

I am looking on Linux thermal framework and on how to cool down the
system effectively when it hits thermal condition. Already existing
cooling methods cpu_cooling and clock_cooling are good. However, I
wanted to go further and dynamically control also a switch ports'
speed based on thermal condition. Lowering speed means less power,
less power means lower temp.

Is there any in-kernel interface to configure switch port/NIC from other driver?

Is there any mechanism to power save, when port/interface is not
really used (not much or low data traffic), embedded in networking
stack  or is it a task for NIC driver itself ?

I was thinking to create net_cooling device similarly to cpu_cooling
device which cool down the system scaling down cpu freq.  net_cooling
could lower down interface speed (or tune more parameters to achieve
).  Do you thing could this work form networking stack perspective?

Any pointers  to the code or a doc highly appreciated.

Thanks,
/Waldek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Network cooling device and how to control NIC speed on thermal condition
  2017-04-25  8:36 Network cooling device and how to control NIC speed on thermal condition Waldemar Rymarkiewicz
@ 2017-04-25 13:17 ` Andrew Lunn
  2017-04-25 13:45 ` Alan Cox
  2017-04-25 16:23 ` Florian Fainelli
  2 siblings, 0 replies; 9+ messages in thread
From: Andrew Lunn @ 2017-04-25 13:17 UTC (permalink / raw)
  To: Waldemar Rymarkiewicz; +Cc: netdev, linux-kernel

On Tue, Apr 25, 2017 at 10:36:28AM +0200, Waldemar Rymarkiewicz wrote:
> Hi,
> 
> I am not much aware of linux networking architecture so I'd like to
> ask first before will start to dig into the code. Appreciate any
> feedback.
> 
> I am looking on Linux thermal framework and on how to cool down the
> system effectively when it hits thermal condition. Already existing
> cooling methods cpu_cooling and clock_cooling are good. However, I
> wanted to go further and dynamically control also a switch ports'
> speed based on thermal condition. Lowering speed means less power,
> less power means lower temp.
> 
> Is there any in-kernel interface to configure switch port/NIC from other driver?

Hi Waldemar

Linux models switch ports as network interfaces, so mostly, there is
little difference between a NIC and a switch port. What you define for
one, should work for the other. Mostly.

However, i don't think you need to be too worried about the NIC level
of the stack. You can mostly do this higher up in the stack. I would
expect there is a relationship between Packets per Second and
generated heat. You might want the NIC to give you some sort of
heating coefficient, 1PPS is ~ 10uC. Given that, you want to throttle
the PPS in the generic queuing layers. This sounds like a TC filter.
You have userspace install a TC filter, which is a net_cooling device.

This however does not directly work for so called 'fastpath' traffic
in switches. Frames which ingress one switch port and egress another
switch port are mostly never seen by Linux. So a software TC filter
will not affect them. However, there is infrastructure in place to
accelerate TC filters by pushing them down into the hardware. So the
same basic concept can be used for switch fastpath traffic, but
requires a bit more work.

    Andrew

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Network cooling device and how to control NIC speed on thermal condition
  2017-04-25  8:36 Network cooling device and how to control NIC speed on thermal condition Waldemar Rymarkiewicz
  2017-04-25 13:17 ` Andrew Lunn
@ 2017-04-25 13:45 ` Alan Cox
  2017-04-28  8:04   ` Waldemar Rymarkiewicz
  2017-04-25 16:23 ` Florian Fainelli
  2 siblings, 1 reply; 9+ messages in thread
From: Alan Cox @ 2017-04-25 13:45 UTC (permalink / raw)
  To: Waldemar Rymarkiewicz; +Cc: netdev, linux-kernel

> I am looking on Linux thermal framework and on how to cool down the
> system effectively when it hits thermal condition. Already existing
> cooling methods cpu_cooling and clock_cooling are good. However, I
> wanted to go further and dynamically control also a switch ports'
> speed based on thermal condition. Lowering speed means less power,
> less power means lower temp.
> 
> Is there any in-kernel interface to configure switch port/NIC from other driver?

No but you can always hook that kind of functionality to the thermal
daemon. However I'd be careful with your assumptions. Lower speed also
means more time active.

https://github.com/01org/thermal_daemon

For example if you run a big encoding job on an atom instead of an Intel
i7, the atom will often not only take way longer but actually use more
total power than the i7 did.

Thus it would often be far more efficient to time synchronize your
systems, batch up data on the collecting end, have the processing node
wake up on an alarm, collect data from the other node and then actually
go back into suspend.

Modern processors are generally very good in idle state (less so
sometimes the platform around them) so trying to lower speeds may
actually be the wrong thing to do, versus say trying to batch up activity
so that you handle a burst and then sleep the entire platform.

It also makes sense to keep policy like that mostly user space - because
what you do is going to be very device specific - eg with things like
dimming the screen, lowering the wifi power, pausing some system
services, pausing battery charge etc.

Now at platform design time there are some interesting trade offs between
100Mbit and 1Gbit ethernet although less so than there used to be 8)

Alan

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Network cooling device and how to control NIC speed on thermal condition
  2017-04-25  8:36 Network cooling device and how to control NIC speed on thermal condition Waldemar Rymarkiewicz
  2017-04-25 13:17 ` Andrew Lunn
  2017-04-25 13:45 ` Alan Cox
@ 2017-04-25 16:23 ` Florian Fainelli
  2 siblings, 0 replies; 9+ messages in thread
From: Florian Fainelli @ 2017-04-25 16:23 UTC (permalink / raw)
  To: Waldemar Rymarkiewicz, netdev; +Cc: linux-kernel

Hello,

On 04/25/2017 01:36 AM, Waldemar Rymarkiewicz wrote:
> Hi,
> 
> I am not much aware of linux networking architecture so I'd like to
> ask first before will start to dig into the code. Appreciate any
> feedback.
> 
> I am looking on Linux thermal framework and on how to cool down the
> system effectively when it hits thermal condition. Already existing
> cooling methods cpu_cooling and clock_cooling are good. However, I
> wanted to go further and dynamically control also a switch ports'
> speed based on thermal condition. Lowering speed means less power,
> less power means lower temp.
> 
> Is there any in-kernel interface to configure switch port/NIC from other driver?

Well, there is mostly under the form of notifiers though. For instance
there are lots of devices that do converged FCoE/RoCE/Ethernet that have
a two headed set of drivers, one for normal ethernet, and another one
for RDMA/IB for instance. To some extent stacked devices (VLAN, bond,
team, etc.) also call back down into their lower device, but in an
abstracted way, at the net_device level of course (layering).

> 
> Is there any mechanism to power save, when port/interface is not
> really used (not much or low data traffic), embedded in networking
> stack  or is it a task for NIC driver itself ?

The thing we did (currently out of tree) in the Starfighter 2 switch
driver (drivers/net/dsa/bcm_sf2.c) is that any time a port is brought
up/down (a port = a network device) we recalculate the switch core
clock, and we also resize the buffers and that yields to a little bit of
power savings  here and there. I don't recall the numbers from the top
of my head, but it was significant enough our HW designers convinced me
into doing it ;)

> 
> I was thinking to create net_cooling device similarly to cpu_cooling
> device which cool down the system scaling down cpu freq.  net_cooling
> could lower down interface speed (or tune more parameters to achieve
> ).  Do you thing could this work form networking stack perspective?

This sounds like a good idea, but it could be very tricky to get right,
because even if you can somehow throttle your transmit activity (since
the host is in control), you can't do that without being disruptive to
the receive path (or not as effectively).

Unlike any kind of host driven activity: CPU run queue, block devices,
USB etc. (SPI, I2C and so on when no using slave driven interrupts) you
cannot simply apply a "duty cycle" pattern where you turn on your HW
just enough of time that is needed for you to set it up for transfer,
signal transfer completion and go back to sleep. Networking needs to be
able to asynchronously receive packets in a way that is usually not
predictable although it could be for very specific workloads though.

Another thing is that there is still a fair amount of energy that needs
to be spent in maintaining the link, and the HW design may be entirely
clocked based on the link speed. Depending on the HW architecture (store
and forward, cut through etc.) there would still be a cost associated
with maintaining RAMs in a state where they are operational and so on.

You could imagine writing a queuing discipline driver that would
throttle transmission based on temperature sensors present in your NIC,
you could definitively do this in a way that is completely device driver
agnostic by using Linux's thermal framework trip point and temperature
notifications.

For reception, if you are okay with dropping some packets, you could
implement something similar, but chances are that your NIC would still
need to receive packets, be able to fully process them before SW drops
them, at which point, you have a myriad of solutions about how not to
process incoming traffic.

Hope this helps

> 
> Any pointers  to the code or a doc highly appreciated.
> 
> Thanks,
> /Waldek
> 


-- 
Florian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Network cooling device and how to control NIC speed on thermal condition
  2017-04-25 13:45 ` Alan Cox
@ 2017-04-28  8:04   ` Waldemar Rymarkiewicz
  2017-04-28 11:56     ` Andrew Lunn
  0 siblings, 1 reply; 9+ messages in thread
From: Waldemar Rymarkiewicz @ 2017-04-28  8:04 UTC (permalink / raw)
  To: Alan Cox, Andrew Lunn, Florian Fainelli; +Cc: netdev, linux-kernel

On 25 April 2017 at 15:45, Alan Cox <gnomes@lxorguk.ukuu.org.uk> wrote:
>> I am looking on Linux thermal framework and on how to cool down the
>> system effectively when it hits thermal condition. Already existing
>> cooling methods cpu_cooling and clock_cooling are good. However, I
>> wanted to go further and dynamically control also a switch ports'
>> speed based on thermal condition. Lowering speed means less power,
>> less power means lower temp.
>>
>> Is there any in-kernel interface to configure switch port/NIC from other driver?
>
> No but you can always hook that kind of functionality to the thermal
> daemon. However I'd be careful with your assumptions. Lower speed also
> means more time active.
>
> https://github.com/01org/thermal_daemon

This is one of the option indeed. Will consider this option as well. I
would see, however,  a generic solution in the kernel  (configurable
of course) as every network device can generate higher heat with
higher link speed.

> For example if you run a big encoding job on an atom instead of an Intel
> i7, the atom will often not only take way longer but actually use more
> total power than the i7 did.
>
> Thus it would often be far more efficient to time synchronize your
> systems, batch up data on the collecting end, have the processing node
> wake up on an alarm, collect data from the other node and then actually
> go back into suspend.

Yes, that's true in a normal thermal conditions. However, if the
platform reaches max temp trip we don't really care about performance
and time efficiency  we just try to avoid critical trip and system
shutdown by cooling the system eg. lowering cpu freq, limiting usb phy
speed, or net  link speed etc.

I did a quick test to show you what I am about.

I collect SoC temp every a few secs. Meantime, I use ethtool -s ethX
speed <speed> to manipulate link speed and to see how it impacts SoC
temp. My 4 PHYs and switch are integrated into SoC and I always
change link speed for all PHYs , no traffic on the link for this test.
Starting with 1Gb/s and then scaling down to 100 Mb/s and then to
10Mb/s, I see significant  ~10 *C drop in temp while link is set to
10Mb/s.

So, throttling link speed can really help to dissipate heat
significantly when the platform is under threat.

Renegotiating link speed costs something I agree, it also impacts user
experience, but such a thermal condition will not occur often I
believe.


/Waldek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Network cooling device and how to control NIC speed on thermal condition
  2017-04-28  8:04   ` Waldemar Rymarkiewicz
@ 2017-04-28 11:56     ` Andrew Lunn
  2017-05-08  8:08       ` Waldemar Rymarkiewicz
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Lunn @ 2017-04-28 11:56 UTC (permalink / raw)
  To: Waldemar Rymarkiewicz; +Cc: Alan Cox, Florian Fainelli, netdev, linux-kernel

> I collect SoC temp every a few secs. Meantime, I use ethtool -s ethX
> speed <speed> to manipulate link speed and to see how it impacts SoC
> temp. My 4 PHYs and switch are integrated into SoC and I always
> change link speed for all PHYs , no traffic on the link for this test.
> Starting with 1Gb/s and then scaling down to 100 Mb/s and then to
> 10Mb/s, I see significant  ~10 *C drop in temp while link is set to
> 10Mb/s.

Is that a realistic test? No traffic over the network? If you are
hitting your thermal limit, to me that means one of two things:

1) The device is under very heavy load, consuming a lot of power to do
   what it needs to to.

2) Your device is idle, no packets are flowing, but your thermal
   design is wrong, so that it cannot dissipate enough heat.

It seems to me, you are more interested in 1). But your quick test is
more about 2).

I would be more interested in do quick tests of switching 8Gbps,
4Gbps, 2Gbps, 1Gbps, 512Mbps, 256Bps, ... What effect does this have
on temperature?

> So, throttling link speed can really help to dissipate heat
> significantly when the platform is under threat.
> 
> Renegotiating link speed costs something I agree, it also impacts user
> experience, but such a thermal condition will not occur often I
> believe.

It is a heavy handed approach, and you have to be careful. There are
some devices which don't work properly, e.g. if you try to negotiate
1000 half duplex, you might find the link just breaks.

Doing this via packet filtering, dropping packets, gives you a much
finer grained control and is a lot less disruptive. But it assumes
handling packets is what it causing you heat problems, not the links
themselves.

	Andrew

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Network cooling device and how to control NIC speed on thermal condition
  2017-04-28 11:56     ` Andrew Lunn
@ 2017-05-08  8:08       ` Waldemar Rymarkiewicz
  2017-05-08 14:02         ` Andrew Lunn
  0 siblings, 1 reply; 9+ messages in thread
From: Waldemar Rymarkiewicz @ 2017-05-08  8:08 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Alan Cox, Florian Fainelli, netdev, linux-kernel

On 28 April 2017 at 13:56, Andrew Lunn <andrew@lunn.ch> wrote:
> Is that a realistic test? No traffic over the network? If you are
> hitting your thermal limit, to me that means one of two things:
>
> 1) The device is under very heavy load, consuming a lot of power to do
>    what it needs to to.
>
> 2) Your device is idle, no packets are flowing, but your thermal
>    design is wrong, so that it cannot dissipate enough heat.
>
> It seems to me, you are more interested in 1). But your quick test is
> more about 2).

The test was not realistic indeed, but it was rather about showing how
link speed correlates to temperature. In the test, I was not under any
thermal condition. But the same gain on the temperature we can achieve
when we hit hot temperature trip point and does not matter how heavy
the network traffic is. It's not said the source of heat is a heavy
network traffic. There can be several sources of heat on SoC.
However, the fact is that PHYs having active 1G/s link generate much
more heat than having 100M/s link independently from network traffic.


> I would be more interested in do quick tests of switching 8Gbps,
> 4Gbps, 2Gbps, 1Gbps, 512Mbps, 256Bps, ... What effect does this have
> on temperature?
>
>> So, throttling link speed can really help to dissipate heat
>> significantly when the platform is under threat.
>>
>> Renegotiating link speed costs something I agree, it also impacts user
>> experience, but such a thermal condition will not occur often I
>> believe.
>
> It is a heavy handed approach, and you have to be careful. There are
> some devices which don't work properly, e.g. if you try to negotiate
> 1000 half duplex, you might find the link just breaks.

That is a valuable remark. I definitely need to run some interoperability tests.

> Doing this via packet filtering, dropping packets, gives you a much
> finer grained control and is a lot less disruptive. But it assumes
> handling packets is what it causing you heat problems, not the links
> themselves.

Link speed manipulation is considered by me as one of a cooling
method, the way to maintain the temperature along with cpufreq, fan
etc. It's not said, the heat is caused by heavy network traffic
itself. So, packets filtering is not of my interests.

All cooling methods impact host only, but "net cooling" impacts remote
side in addition, which seems to me to be a problem sometimes. Also,
the moment of link renegotiation blocks rx/tx for upper layers, so the
user sees a pause when streaming a video for example. However, if a
system is under a thermal condition, does it really matter?


/Waldek

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Network cooling device and how to control NIC speed on thermal condition
  2017-05-08  8:08       ` Waldemar Rymarkiewicz
@ 2017-05-08 14:02         ` Andrew Lunn
  2017-05-15 14:14           ` Waldemar Rymarkiewicz
  0 siblings, 1 reply; 9+ messages in thread
From: Andrew Lunn @ 2017-05-08 14:02 UTC (permalink / raw)
  To: Waldemar Rymarkiewicz; +Cc: Alan Cox, Florian Fainelli, netdev, linux-kernel

> However, the fact is that PHYs having active 1G/s link generate much
> more heat than having 100M/s link independently from network traffic.

Yes, this is true. I got an off-list email suggesting this power
difference is very significant, more so than actually processing
packets.

> All cooling methods impact host only, but "net cooling" impacts remote
> side in addition, which seems to me to be a problem sometimes. Also,
> the moment of link renegotiation blocks rx/tx for upper layers, so the
> user sees a pause when streaming a video for example. However, if a
> system is under a thermal condition, does it really matter?

I don't know the cooling subsystem too well. Can you express a 'cost'
for making a change, as well as the likely result in making the
change. You might want to make the cost high, so it is used as a last
resort if other methods cannot give enough cooling.

       Andrew

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Network cooling device and how to control NIC speed on thermal condition
  2017-05-08 14:02         ` Andrew Lunn
@ 2017-05-15 14:14           ` Waldemar Rymarkiewicz
  0 siblings, 0 replies; 9+ messages in thread
From: Waldemar Rymarkiewicz @ 2017-05-15 14:14 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Alan Cox, Florian Fainelli, netdev, linux-kernel

On 8 May 2017 at 16:02, Andrew Lunn <andrew@lunn.ch> wrote:
> Yes, this is true. I got an off-list email suggesting this power
> difference is very significant, more so than actually processing
> packets.

this is a reason I've started to discuss this topic. PHYS consume a
lot of power so from thermal perspective its a good candidate for a
cooling device.

>> All cooling methods impact host only, but "net cooling" impacts remote
>> side in addition, which seems to me to be a problem sometimes. Also,
>> the moment of link renegotiation blocks rx/tx for upper layers, so the
>> user sees a pause when streaming a video for example. However, if a
>> system is under a thermal condition, does it really matter?
>
> I don't know the cooling subsystem too well. Can you express a 'cost'
> for making a change, as well as the likely result in making the
> change. You might want to make the cost high, so it is used as a last
> resort if other methods cannot give enough cooling.

Because the cost is relatively high (user experience impact and risk
that we break the link with devices that cannot handle link reneg
properly) definitely it should be a last resort cooling method before
system shutdown.

Thermal framework by default shuts down the system when it reaches the
critical trip point, before we have a hot trip point. Normally, in the
system when you have a thermal zone defined, you also define several
trip points (struct of temp, hysteresis and type) and you map trip
point to the cooling device (cpu, clock, devfreq, fan or whatever you
implement). The thermal governor will respectively activate cooling
devices based on system temperature and trip<->cool_dev map to
maintain system temperature on a possible lowest level.

I also did more tests and actually implemented a prototype net_cooling
device, register it by eth driver. In my setup (a switch and 2 PC,
running iperf test, streaming video) all work pretty well (the link is
renegotiated and transfer continues), but I came to the conclusion,
that instead manipulating link speed I can modify advertised link
modes, excluding the highest speeds and let the PHY layer to reneg a
link. It's much safer.


/Waldek

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2017-05-15 14:15 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-04-25  8:36 Network cooling device and how to control NIC speed on thermal condition Waldemar Rymarkiewicz
2017-04-25 13:17 ` Andrew Lunn
2017-04-25 13:45 ` Alan Cox
2017-04-28  8:04   ` Waldemar Rymarkiewicz
2017-04-28 11:56     ` Andrew Lunn
2017-05-08  8:08       ` Waldemar Rymarkiewicz
2017-05-08 14:02         ` Andrew Lunn
2017-05-15 14:14           ` Waldemar Rymarkiewicz
2017-04-25 16:23 ` Florian Fainelli

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).