All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH net 0/2] mlxsw: core: Thermal control fixes
@ 2021-01-08 14:52 Ido Schimmel
  2021-01-08 14:52 ` [PATCH net 1/2] mlxsw: core: Add validation of transceiver temperature thresholds Ido Schimmel
                   ` (2 more replies)
  0 siblings, 3 replies; 4+ messages in thread
From: Ido Schimmel @ 2021-01-08 14:52 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, vadimp, jiri, mlxsw, Ido Schimmel

From: Ido Schimmel <idosch@nvidia.com>

This series includes two fixes for thermal control in mlxsw.

Patch #1 validates that the alarm temperature threshold read from a
transceiver is above the warning temperature threshold. If not, the
current thresholds are maintained. It was observed that some transceiver
might be unreliable and sometimes report a too low alarm temperature
threshold which would result in thermal shutdown of the system.

Patch #2 increases the temperature threshold above which thermal
shutdown is triggered for the ASIC thermal zone. It is currently too low
and might result in thermal shutdown under perfectly fine operational
conditions.

Please consider both patches for stable.

Vadim Pasternak (2):
  mlxsw: core: Add validation of transceiver temperature thresholds
  mlxsw: core: Increase critical threshold for ASIC thermal zone

 drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 13 ++++++++-----
 1 file changed, 8 insertions(+), 5 deletions(-)

-- 
2.29.2


^ permalink raw reply	[flat|nested] 4+ messages in thread

* [PATCH net 1/2] mlxsw: core: Add validation of transceiver temperature thresholds
  2021-01-08 14:52 [PATCH net 0/2] mlxsw: core: Thermal control fixes Ido Schimmel
@ 2021-01-08 14:52 ` Ido Schimmel
  2021-01-08 14:52 ` [PATCH net 2/2] mlxsw: core: Increase critical threshold for ASIC thermal zone Ido Schimmel
  2021-01-10  0:30 ` [PATCH net 0/2] mlxsw: core: Thermal control fixes patchwork-bot+netdevbpf
  2 siblings, 0 replies; 4+ messages in thread
From: Ido Schimmel @ 2021-01-08 14:52 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, vadimp, jiri, mlxsw, Ido Schimmel

From: Vadim Pasternak <vadimp@nvidia.com>

Validate thresholds to avoid a single failure due to some transceiver
unreliability. Ignore the last readouts in case warning temperature is
above alarm temperature, since it can cause unexpected thermal
shutdown. Stay with the previous values and refresh threshold within
the next iteration.

This is a rare scenario, but it was observed at a customer site.

Fixes: 6a79507cfe94 ("mlxsw: core: Extend thermal module with per QSFP module thermal zones")
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
index 8fa286ccdd6b..250a85049697 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
@@ -176,6 +176,12 @@ mlxsw_thermal_module_trips_update(struct device *dev, struct mlxsw_core *core,
 	if (err)
 		return err;
 
+	if (crit_temp > emerg_temp) {
+		dev_warn(dev, "%s : Critical threshold %d is above emergency threshold %d\n",
+			 tz->tzdev->type, crit_temp, emerg_temp);
+		return 0;
+	}
+
 	/* According to the system thermal requirements, the thermal zones are
 	 * defined with four trip points. The critical and emergency
 	 * temperature thresholds, provided by QSFP module are set as "active"
@@ -190,11 +196,8 @@ mlxsw_thermal_module_trips_update(struct device *dev, struct mlxsw_core *core,
 		tz->trips[MLXSW_THERMAL_TEMP_TRIP_NORM].temp = crit_temp;
 	tz->trips[MLXSW_THERMAL_TEMP_TRIP_HIGH].temp = crit_temp;
 	tz->trips[MLXSW_THERMAL_TEMP_TRIP_HOT].temp = emerg_temp;
-	if (emerg_temp > crit_temp)
-		tz->trips[MLXSW_THERMAL_TEMP_TRIP_CRIT].temp = emerg_temp +
+	tz->trips[MLXSW_THERMAL_TEMP_TRIP_CRIT].temp = emerg_temp +
 					MLXSW_THERMAL_MODULE_TEMP_SHIFT;
-	else
-		tz->trips[MLXSW_THERMAL_TEMP_TRIP_CRIT].temp = emerg_temp;
 
 	return 0;
 }
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* [PATCH net 2/2] mlxsw: core: Increase critical threshold for ASIC thermal zone
  2021-01-08 14:52 [PATCH net 0/2] mlxsw: core: Thermal control fixes Ido Schimmel
  2021-01-08 14:52 ` [PATCH net 1/2] mlxsw: core: Add validation of transceiver temperature thresholds Ido Schimmel
@ 2021-01-08 14:52 ` Ido Schimmel
  2021-01-10  0:30 ` [PATCH net 0/2] mlxsw: core: Thermal control fixes patchwork-bot+netdevbpf
  2 siblings, 0 replies; 4+ messages in thread
From: Ido Schimmel @ 2021-01-08 14:52 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuba, vadimp, jiri, mlxsw, Ido Schimmel

From: Vadim Pasternak <vadimp@nvidia.com>

Increase critical threshold for ASIC thermal zone from 110C to 140C
according to the system hardware requirements. All the supported ASICs
(Spectrum-1, Spectrum-2, Spectrum-3) could be still operational with ASIC
temperature below 140C. With the old critical threshold value system
can perform unjustified shutdown.

All the systems equipped with the above ASICs implement thermal
protection mechanism at firmware level and firmware could decide to
perform system thermal shutdown in case the temperature is below 140C.
So with the new threshold system will not meltdown, while thermal
operating range will be aligned with hardware abilities.

Fixes: 41e760841d26 ("mlxsw: core: Replace thermal temperature trips with defines")
Fixes: a50c1e35650b ("mlxsw: core: Implement thermal zone")
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core_thermal.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
index 250a85049697..bf85ce9835d7 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core_thermal.c
@@ -19,7 +19,7 @@
 #define MLXSW_THERMAL_ASIC_TEMP_NORM	75000	/* 75C */
 #define MLXSW_THERMAL_ASIC_TEMP_HIGH	85000	/* 85C */
 #define MLXSW_THERMAL_ASIC_TEMP_HOT	105000	/* 105C */
-#define MLXSW_THERMAL_ASIC_TEMP_CRIT	110000	/* 110C */
+#define MLXSW_THERMAL_ASIC_TEMP_CRIT	140000	/* 140C */
 #define MLXSW_THERMAL_HYSTERESIS_TEMP	5000	/* 5C */
 #define MLXSW_THERMAL_MODULE_TEMP_SHIFT	(MLXSW_THERMAL_HYSTERESIS_TEMP * 2)
 #define MLXSW_THERMAL_ZONE_MAX_NAME	16
-- 
2.29.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH net 0/2] mlxsw: core: Thermal control fixes
  2021-01-08 14:52 [PATCH net 0/2] mlxsw: core: Thermal control fixes Ido Schimmel
  2021-01-08 14:52 ` [PATCH net 1/2] mlxsw: core: Add validation of transceiver temperature thresholds Ido Schimmel
  2021-01-08 14:52 ` [PATCH net 2/2] mlxsw: core: Increase critical threshold for ASIC thermal zone Ido Schimmel
@ 2021-01-10  0:30 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 4+ messages in thread
From: patchwork-bot+netdevbpf @ 2021-01-10  0:30 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, davem, kuba, vadimp, jiri, mlxsw, idosch

Hello:

This series was applied to netdev/net.git (refs/heads/master):

On Fri,  8 Jan 2021 16:52:08 +0200 you wrote:
> From: Ido Schimmel <idosch@nvidia.com>
> 
> This series includes two fixes for thermal control in mlxsw.
> 
> Patch #1 validates that the alarm temperature threshold read from a
> transceiver is above the warning temperature threshold. If not, the
> current thresholds are maintained. It was observed that some transceiver
> might be unreliable and sometimes report a too low alarm temperature
> threshold which would result in thermal shutdown of the system.
> 
> [...]

Here is the summary with links:
  - [net,1/2] mlxsw: core: Add validation of transceiver temperature thresholds
    https://git.kernel.org/netdev/net/c/57726ebe2733
  - [net,2/2] mlxsw: core: Increase critical threshold for ASIC thermal zone
    https://git.kernel.org/netdev/net/c/b06ca3d5a43c

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2021-01-10  0:30 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-01-08 14:52 [PATCH net 0/2] mlxsw: core: Thermal control fixes Ido Schimmel
2021-01-08 14:52 ` [PATCH net 1/2] mlxsw: core: Add validation of transceiver temperature thresholds Ido Schimmel
2021-01-08 14:52 ` [PATCH net 2/2] mlxsw: core: Increase critical threshold for ASIC thermal zone Ido Schimmel
2021-01-10  0:30 ` [PATCH net 0/2] mlxsw: core: Thermal control fixes patchwork-bot+netdevbpf

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.