linux-wireless.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Issues with mt7915 thermal throttling
@ 2022-01-18  9:08 Nicolas Cavallari
  2022-02-07 17:37 ` [PATCH RFC 1/3] mt76: mt7915e: Fix degraded performance after temporary overheat Nicolas Cavallari
  0 siblings, 1 reply; 4+ messages in thread
From: Nicolas Cavallari @ 2022-01-18  9:08 UTC (permalink / raw)
  To: linux-wireless; +Cc: Ryder Lee, Lorenzo Bianconi, Felix Fietkau

I noticed some strange issues with mt7915's thermal throttling, 
particularly the cooling_device.

First, it seems that the thermal subsystem expect that higher 
cooling_device states provide more cooling, but mt7915e apparently does 
the opposite and use it as a duty cycle, where state=1 does severe 
throttling/max cooling (iperf throughput basically go down to near zero) 
and state=100 is full power.

Also, state=0, from the comments, apparently disable thermal management, 
except that in practice, it does not change the throttle state, since 
throughput stays low when switching from state=1 to state=0, and stays 
high when switching from 100 to 0.

As a result, as soon as the default thermal zone runs a little hot, the 
performance of mt7915e is destroyed and does not recover much when the 
temperature drops down.

I can come up with a patch to fix the first issue, but not the state=0 
one, and i would like some pointers/confirmation.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH RFC 1/3] mt76: mt7915e: Fix degraded performance after temporary overheat
  2022-01-18  9:08 Issues with mt7915 thermal throttling Nicolas Cavallari
@ 2022-02-07 17:37 ` Nicolas Cavallari
  2022-02-07 17:37   ` [PATCH RFC 2/3] mt76: mt7915e: Add a hwmon attribute to get the actual throttle state Nicolas Cavallari
  2022-02-07 17:37   ` [PATCH RFC 3/3] mt76: mt7915e: Enable thermal management by default Nicolas Cavallari
  0 siblings, 2 replies; 4+ messages in thread
From: Nicolas Cavallari @ 2022-02-07 17:37 UTC (permalink / raw)
  To: Felix Fietkau, Lorenzo Bianconi, Ryder Lee, Shayne Chen, Sean Wang
  Cc: Kalle Valo, linux-wireless

mt7915e registers a cooling_device with wrong semantics:

1. cooling_device expect that higher states values should cool more, but
   mt7915e did the opposite...  with the exception of state == 0, which
   should "disable thermal management", but does not seem to have any
   effect since the previous state is kept.

The result is that when the thermal zone heats up a bit and bumps the
cooling_device state from 0 to 1 to cool a bit, the performance is
destroyed, and when going back from 1 to 0, the performance stays bad.

2. Reading the cooling_device state does not always return the last
   written state, but can return the actual hardware throttle state,
   which is different.

This is a problem because the mt7915 firmware actually implement the
equivalent of a thermal zone with trip points.  Setting the cooling
device state actually changes the throttles at each trip point, so the
following could occur if the first issue is fixed:

- thermal subsystem set state to 100% power (state=0)
- mt7915e driver set trip throttles to [100%, 50%, 25%, 12%]
- hardware heats up and decides to switch to 50% power
- thermal subsystem see that power is 50% (state=50), decide to increase
  it to 60% (state=40) because the rest of the system is cool.
- mt7915e driver set trip throttle to [60%, 30%, 15%, 7%]
- hardware thus switches to 30% power
[race to the bottom continues...]

This patch corrects the semantics of the cooling_device to the one that
the thermal subsystem expect it.

Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr>
---
 drivers/net/wireless/mediatek/mt76/mt7915/init.c | 16 ++++++++++------
 .../net/wireless/mediatek/mt76/mt7915/mt7915.h   |  2 ++
 2 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
index 2bc9097c5214..d6efbf1a2724 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
@@ -97,7 +97,7 @@ static int
 mt7915_thermal_get_max_throttle_state(struct thermal_cooling_device *cdev,
 				      unsigned long *state)
 {
-	*state = MT7915_THERMAL_THROTTLE_MAX;
+	*state = MT7915_CDEV_THROTTLE_MAX;
 
 	return 0;
 }
@@ -108,7 +108,7 @@ mt7915_thermal_get_cur_throttle_state(struct thermal_cooling_device *cdev,
 {
 	struct mt7915_phy *phy = cdev->devdata;
 
-	*state = phy->throttle_state;
+	*state = phy->cdev_state;
 
 	return 0;
 }
@@ -120,20 +120,24 @@ mt7915_thermal_set_cur_throttle_state(struct thermal_cooling_device *cdev,
 	struct mt7915_phy *phy = cdev->devdata;
 	int ret;
 
-	if (state > MT7915_THERMAL_THROTTLE_MAX)
+	if (state > MT7915_CDEV_THROTTLE_MAX)
 		return -EINVAL;
 
 	if (phy->throttle_temp[0] > phy->throttle_temp[1])
 		return 0;
 
-	if (state == phy->throttle_state)
+	if (state == phy->cdev_state)
 		return 0;
 
-	ret = mt7915_mcu_set_thermal_throttling(phy, state);
+	// cooling_device convention: 0 = no cooling, more = more cooling
+	// mcu convention: 1 = max cooling, more = less cooling
+	ret = mt7915_mcu_set_thermal_throttling(phy,
+						MT7915_THERMAL_THROTTLE_MAX
+						- state);
 	if (ret)
 		return ret;
 
-	phy->throttle_state = state;
+	phy->cdev_state = state;
 
 	return 0;
 }
diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/mt7915.h b/drivers/net/wireless/mediatek/mt76/mt7915/mt7915.h
index 0403912a521d..cf4c8d2dcc60 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7915/mt7915.h
+++ b/drivers/net/wireless/mediatek/mt76/mt7915/mt7915.h
@@ -49,6 +49,7 @@
 #define MT7915_CFEND_RATE_11B		0x03	/* 11B LP, 11M */
 
 #define MT7915_THERMAL_THROTTLE_MAX	100
+#define MT7915_CDEV_THROTTLE_MAX	99
 
 #define MT7915_SKU_RATE_NUM		161
 
@@ -218,6 +219,7 @@ struct mt7915_phy {
 	struct ieee80211_vif *monitor_vif;
 
 	struct thermal_cooling_device *cdev;
+	u8 cdev_state;
 	u8 throttle_state;
 	u32 throttle_temp[2]; /* 0: critical high, 1: maximum */
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH RFC 2/3] mt76: mt7915e: Add a hwmon attribute to get the actual throttle state.
  2022-02-07 17:37 ` [PATCH RFC 1/3] mt76: mt7915e: Fix degraded performance after temporary overheat Nicolas Cavallari
@ 2022-02-07 17:37   ` Nicolas Cavallari
  2022-02-07 17:37   ` [PATCH RFC 3/3] mt76: mt7915e: Enable thermal management by default Nicolas Cavallari
  1 sibling, 0 replies; 4+ messages in thread
From: Nicolas Cavallari @ 2022-02-07 17:37 UTC (permalink / raw)
  To: Felix Fietkau, Lorenzo Bianconi, Ryder Lee, Shayne Chen, Sean Wang
  Cc: Kalle Valo, linux-wireless

The firmware-controlled actual throttle state was previously available
by reading the cooling_device, but this confused the thermal subsystem.
Add a hwmon attribute to get it instead.

Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr>
---
 .../net/wireless/mediatek/mt76/mt7915/init.c  | 27 ++++++++++++-------
 1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
index d6efbf1a2724..71ca4b0641b6 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
@@ -50,15 +50,22 @@ static ssize_t mt7915_thermal_temp_show(struct device *dev,
 	int i = to_sensor_dev_attr(attr)->index;
 	int temperature;
 
-	if (i)
-		return sprintf(buf, "%u\n", phy->throttle_temp[i - 1] * 1000);
-
-	temperature = mt7915_mcu_get_temperature(phy);
-	if (temperature < 0)
-		return temperature;
-
-	/* display in millidegree celcius */
-	return sprintf(buf, "%u\n", temperature * 1000);
+	switch (i) {
+	case 0:
+		temperature = mt7915_mcu_get_temperature(phy);
+		if (temperature < 0)
+			return temperature;
+		/* display in millidegree celcius */
+		return sprintf(buf, "%u\n", temperature * 1000);
+	case 1:
+	case 2:
+		return sprintf(buf, "%u\n",
+			       phy->throttle_temp[i - 1] * 1000);
+	case 3:
+		return sprintf(buf, "%hhu\n", phy->throttle_state);
+	default:
+		return -EINVAL;
+	}
 }
 
 static ssize_t mt7915_thermal_temp_store(struct device *dev,
@@ -84,11 +91,13 @@ static ssize_t mt7915_thermal_temp_store(struct device *dev,
 static SENSOR_DEVICE_ATTR_RO(temp1_input, mt7915_thermal_temp, 0);
 static SENSOR_DEVICE_ATTR_RW(temp1_crit, mt7915_thermal_temp, 1);
 static SENSOR_DEVICE_ATTR_RW(temp1_max, mt7915_thermal_temp, 2);
+static SENSOR_DEVICE_ATTR_RO(throttle1, mt7915_thermal_temp, 3);
 
 static struct attribute *mt7915_hwmon_attrs[] = {
 	&sensor_dev_attr_temp1_input.dev_attr.attr,
 	&sensor_dev_attr_temp1_crit.dev_attr.attr,
 	&sensor_dev_attr_temp1_max.dev_attr.attr,
+	&sensor_dev_attr_throttle1.dev_attr.attr,
 	NULL,
 };
 ATTRIBUTE_GROUPS(mt7915_hwmon);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH RFC 3/3] mt76: mt7915e: Enable thermal management by default
  2022-02-07 17:37 ` [PATCH RFC 1/3] mt76: mt7915e: Fix degraded performance after temporary overheat Nicolas Cavallari
  2022-02-07 17:37   ` [PATCH RFC 2/3] mt76: mt7915e: Add a hwmon attribute to get the actual throttle state Nicolas Cavallari
@ 2022-02-07 17:37   ` Nicolas Cavallari
  1 sibling, 0 replies; 4+ messages in thread
From: Nicolas Cavallari @ 2022-02-07 17:37 UTC (permalink / raw)
  To: Felix Fietkau, Lorenzo Bianconi, Ryder Lee, Shayne Chen, Sean Wang
  Cc: Kalle Valo, linux-wireless

By default, mt7915e does not enable thermal management until the default
thermal zone does not reach a trip point.  If the rest of the system
have better cooling than the wireless hardware, then it is possible that
the wireless chip can overheat while the rest of the system is fine.

Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr>
---
 drivers/net/wireless/mediatek/mt76/mt7915/init.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/mediatek/mt76/mt7915/init.c b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
index 71ca4b0641b6..4fa5da9195a2 100644
--- a/drivers/net/wireless/mediatek/mt76/mt7915/init.c
+++ b/drivers/net/wireless/mediatek/mt76/mt7915/init.c
@@ -199,7 +199,8 @@ static int mt7915_thermal_init(struct mt7915_phy *phy)
 	phy->throttle_temp[0] = 110;
 	phy->throttle_temp[1] = 120;
 
-	return 0;
+	return mt7915_mcu_set_thermal_throttling(phy,
+						 MT7915_THERMAL_THROTTLE_MAX);
 }
 
 static void mt7915_led_set_config(struct led_classdev *led_cdev,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-02-07 17:53 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-01-18  9:08 Issues with mt7915 thermal throttling Nicolas Cavallari
2022-02-07 17:37 ` [PATCH RFC 1/3] mt76: mt7915e: Fix degraded performance after temporary overheat Nicolas Cavallari
2022-02-07 17:37   ` [PATCH RFC 2/3] mt76: mt7915e: Add a hwmon attribute to get the actual throttle state Nicolas Cavallari
2022-02-07 17:37   ` [PATCH RFC 3/3] mt76: mt7915e: Enable thermal management by default Nicolas Cavallari

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).