linux-pm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH thermal 1/1] thermal/core: Skip cooling device statistics update for configuration operation
@ 2021-09-27  8:24 Vadim Pasternak
  2021-09-27 10:41 ` Daniel Lezcano
  0 siblings, 1 reply; 9+ messages in thread
From: Vadim Pasternak @ 2021-09-27  8:24 UTC (permalink / raw)
  To: rui.zhang, daniel.lezcano; +Cc: =idosch, linux-pm, Vadim Pasternak

The thermal subsystem maintains a transition table between states that
is allocated according to the maximum state supported by the cooling
device.

When the table needs to be updated, the thermal subsystem does not
validate that the new state does not exceed the maximum state,
leading to out-of-bounds memory accesses [1].

Fix this by validating the new state before updating the table.

Some thermal drivers allow user space to configure such states to limit
the minimum fan speed. See, for example:
commit a421ce088ac8 ("mlxsw: core: Extend cooling device with cooling levels")

In any case, the thermal subsystem needs to validate the state before
using it as an index into the transition table.

[1]
[  156.703153] ==================================================================
[  156.710613] BUG: KASAN: slab-out-of-bounds in thermal_cooling_device_stats_update+0x7d/0xb0
[  156.719227] Read of size 4 at addr ffff88811f63e840 by task hw-management.s/2300
[  156.726816]
[  156.728369] CPU: 1 PID: 2300 Comm: hw-management.s Not tainted 5.15.0-rc2-dvs-00320-ga3b397b4fffb #1
[  156.737766] Hardware name: Mellanox Technologies Ltd. MSN2410/VMOD0001, BIOS 4.6.5 09/13/2018
[  156.746561] Call Trace:
[  156.749093]  dump_stack_lvl+0x44/0x57
[  156.752890]  print_address_description.constprop.9+0x21/0x150
[  156.758824]  ? thermal_cooling_device_stats_update+0x7d/0xb0
[  156.764675]  ? thermal_cooling_device_stats_update+0x7d/0xb0
[  156.770521]  kasan_report.cold.14+0x83/0xdf
[  156.774835]  ? thermal_cooling_device_stats_update+0x7d/0xb0
[  156.780654]  thermal_cooling_device_stats_update+0x7d/0xb0
[  156.786314]  __thermal_cdev_update+0xc0/0x200
[  156.790818]  thermal_cdev_update+0x4e/0x70
[  156.795064]  step_wise_throttle+0x426/0x6f0
[  156.799408]  ? devm_thermal_add_hwmon_sysfs+0x80/0x80
[  156.804628]  ? thermal_zone_get_temp+0xa0/0xa0
[  156.809185]  ? netlink_broadcast+0xa/0x10
[  156.813353]  ? thermal_genl_sampling_temp+0x16a/0x200
[  156.818563]  ? thermal_genl_cmd_doit+0x290/0x290
[  156.823332]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  156.827900]  thermal_zone_device_update+0x39e/0x850
[  156.832954]  ? trace_event_raw_event_thermal_zone_trip+0x1f0/0x1f0
[  156.839349]  ? wait_for_completion+0x190/0x190
[  156.843944]  ? devm_thermal_of_cooling_device_register+0xa0/0xa0
[  156.850148]  ? snprintf+0x91/0xc0
[  156.853594]  ? vsprintf+0x10/0x10
[  156.857039]  thermal_zone_device_set_mode+0x81/0xf0
[  156.862079]  mlxsw_thermal_modules_init+0x48f/0x590 [mlxsw_core]
[  156.868392]  ? mlxsw_thermal_set_cur_state+0x590/0x590 [mlxsw_core]
[  156.874954]  ? netlink_broadcast+0xa/0x10
[  156.879090]  ? thermal_genl_send_event+0x117/0x1a0
[  156.884039]  ? thermal_notify_tz_create+0x7d/0xb0
[  156.888903]  ? thermal_genl_sampling_temp+0x200/0x200
[  156.894147]  ? do_init_timer+0x6c/0x80
[  156.898009]  ? thermal_zone_device_register+0x8af/0x9a0
[  156.903438]  ? __thermal_cooling_device_register+0x4f2/0x500
[  156.909272]  mlxsw_thermal_init+0x763/0x880 [mlxsw_core]
[  156.914882]  ? mlxsw_thermal_gearboxes_init.isra.8+0x460/0x460 [mlxsw_core]
[  156.922214]  __mlxsw_core_bus_device_register+0xa0c/0xca0 [mlxsw_core]
[  156.929093]  ? dev_printk_emit+0x90/0xb6
[  156.933136]  ? dev_vprintk_emit+0x208/0x208
[  156.937462]  ? mlxsw_devlink_info_get+0x490/0x490 [mlxsw_core]
[  156.943580]  ? do_raw_spin_lock+0x1d0/0x1d0
[  156.947906]  ? lockdep_hardirqs_on_prepare+0xe/0x230
[  156.953041]  ? __dev_printk+0x9e/0xd6
[  156.956818]  ? _dev_info+0xc8/0xf6
[  156.960339]  ? _dev_notice+0x84/0xf6
[  156.964047]  ? mark_wakeup_next_waiter+0x1b0/0x1b0
[  156.968988]  mlxsw_core_bus_device_register+0x3e/0x60 [mlxsw_core]
[  156.975492]  mlxsw_i2c_probe.cold.8+0x159/0x283 [mlxsw_i2c]
[  156.981259]  ? mlxsw_i2c_wait_go_bit.isra.3+0x1c0/0x1c0 [mlxsw_i2c]
[  156.987701]  ? lock_release+0x50/0x6c0
[  156.991555]  ? devres_open_group+0x13d/0x180
[  156.995936]  ? lock_downgrade+0x3a0/0x3a0
[  157.000108]  ? lock_contended+0x710/0x710
[  157.004229]  ? devres_open_group+0x59/0x180
[  157.008539]  ? devres_log+0x11a/0x180
[  157.012309]  ? trace_hardirqs_on+0x1c/0x110
[  157.016630]  ? preempt_count_sub+0xf/0xb0
[  157.020781]  ? mlxsw_i2c_wait_go_bit.isra.3+0x1c0/0x1c0 [mlxsw_i2c]
[  157.027255]  ? i2c_device_probe+0x30e/0x370
[  157.031573]  i2c_device_probe+0x30e/0x370
[  157.035746]  really_probe+0x149/0x3c0
[  157.039535]  ? driver_allows_async_probing+0x80/0x80
[  157.044629]  __driver_probe_device+0xc3/0x130
[  157.049127]  driver_probe_device+0x45/0x100
[  157.053486]  __device_attach_driver+0xd6/0x100
[  157.058069]  bus_for_each_drv+0xe7/0x150
[  157.062122]  ? bus_rescan_devices+0x10/0x10
[  157.066444]  ? lockdep_hardirqs_on_prepare+0xe/0x230
[  157.071587]  ? trace_hardirqs_on+0x1c/0x110
[  157.075891]  ? preempt_count_sub+0xf/0xb0
[  157.080050]  ? _raw_spin_unlock_irqrestore+0x36/0x50
[  157.085193]  __device_attach+0x185/0x210
[  157.089231]  ? device_bind_driver+0x70/0x70
[  157.093569]  ? kobject_uevent_env+0x287/0x940
[  157.098091]  bus_probe_device+0xf9/0x120
[  157.102159]  device_add+0x623/0xeb0
[  157.105746]  ? static_obj+0x32/0x80
[  157.109348]  ? lockdep_init_map_type+0xd9/0x360
[  157.114062]  ? __fw_devlink_link_to_suppliers+0x270/0x270
[  157.119619]  ? __raw_spin_lock_init+0x71/0x80
[  157.124134]  i2c_new_client_device+0x277/0x3a0
[  157.128715]  new_device_store+0x13c/0x270
[  157.132847]  ? copyin+0x6b/0x80
[  157.136090]  ? i2c_new_ancillary_device+0x20/0x20
[  157.140946]  ? lock_acquire+0xc0/0x3f0
[  157.144830]  ? lock_release+0x6c0/0x6c0
[  157.148833]  ? sysfs_file_ops+0x6b/0x90
[  157.152808]  ? sysfs_file_ops+0x90/0x90
[  157.156774]  kernfs_fop_write_iter+0x1af/0x250
[  157.161346]  new_sync_write+0x25a/0x380
[  157.165327]  ? new_sync_read+0x370/0x370
[  157.169384]  ? rcu_read_lock_bh_held+0xb0/0xb0
[  157.173973]  ? rcu_read_lock_held_common+0x12/0x50
[  157.178911]  ? irq_migrate_all_off_this_cpu+0xf0/0x300
[  157.184208]  ? lock_release+0x6c0/0x6c0
[  157.188149]  ? rcu_read_lock_sched_held+0x5a/0xd0
[  157.192995]  ? rcu_read_lock_held+0xb0/0xb0
[  157.197310]  vfs_write+0x33d/0x530
[  157.200839]  ksys_write+0xbb/0x150
[  157.204359]  ? __ia32_sys_read+0x40/0x40
[  157.208410]  ? set_load_weight+0xd1/0x110
[  157.212557]  do_syscall_64+0x3a/0x80
[  157.216256]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  157.221457] RIP: 0033:0x7fa80bd47970
[  157.225131] Code: 73 01 c3 48 8b 0d 28 d5 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 99 2d 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 7e 9b 01 00 48 89 04 24
[  157.244407] RSP: 002b:00007fff14b2c648 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  157.252196] RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007fa80bd47970
[  157.259528] RDX: 0000000000000013 RSI: 0000000000ede408 RDI: 0000000000000001
[  157.266868] RBP: 0000000000ede408 R08: 00007fa80c007760 R09: 00007fa80c653700
[  157.274219] R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000013
[  157.281564] R13: 0000000000000001 R14: 00007fa80c006600 R15: 0000000000000013
[  157.288933]
[  157.290484] Allocated by task 2300:
[  157.294054]  kasan_save_stack+0x19/0x40
[  157.298024]  __kasan_kmalloc+0x7f/0xa0
[  157.301914]  __kmalloc+0x18f/0x2c0
[  157.305417]  thermal_cooling_device_setup_sysfs+0xf9/0x1a0
[  157.311075]  __thermal_cooling_device_register+0x1b5/0x500
[  157.316734]  mlxsw_thermal_init+0x7e4/0x880 [mlxsw_core]
[  157.322325]  __mlxsw_core_bus_device_register+0xa0c/0xca0 [mlxsw_core]
[  157.329154]  mlxsw_core_bus_device_register+0x3e/0x60 [mlxsw_core]
[  157.335656]  mlxsw_i2c_probe.cold.8+0x159/0x283 [mlxsw_i2c]
[  157.341373]  i2c_device_probe+0x30e/0x370
[  157.345524]  really_probe+0x149/0x3c0
[  157.349330]  __driver_probe_device+0xc3/0x130
[  157.353815]  driver_probe_device+0x45/0x100
[  157.358117]  __device_attach_driver+0xd6/0x100
[  157.362712]  bus_for_each_drv+0xe7/0x150
[  157.366768]  __device_attach+0x185/0x210
[  157.370823]  bus_probe_device+0xf9/0x120
[  157.374897]  device_add+0x623/0xeb0
[  157.378512]  i2c_new_client_device+0x277/0x3a0
[  157.383094]  new_device_store+0x13c/0x270
[  157.387222]  kernfs_fop_write_iter+0x1af/0x250
[  157.391797]  new_sync_write+0x25a/0x380
[  157.395765]  vfs_write+0x33d/0x530
[  157.399267]  ksys_write+0xbb/0x150
[  157.402795]  do_syscall_64+0x3a/0x80
[  157.406496]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  157.411686]
[  157.413236] The buggy address belongs to the object at ffff88811f63e400
[  157.413236]  which belongs to the cache kmalloc-1k of size 1024
[  157.426135] The buggy address is located 64 bytes to the right of
[  157.426135]  1024-byte region [ffff88811f63e400, ffff88811f63e800)
[  157.438710] The buggy address belongs to the page:
[  157.443640] page:ffffea00047d8e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88811f63c000 pfn:0x11f638
[  157.454579] head:ffffea00047d8e00 order:3 compound_mapcount:0 compound_pincount:0
[  157.462281] flags: 0x200000000010200(slab|head|node=0|zone=2)
[  157.468220] raw: 0200000000010200 ffffea000863a808 ffffea00045ab208 ffff888100043bc0
[  157.476205] raw: ffff88811f63c000 00000000000a0002 00000001ffffffff 0000000000000000
[  157.484185] page dumped because: kasan: bad access detected
[  157.489895]
[  157.491445] Memory state around the buggy address:
[  157.496373]  ffff88811f63e700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  157.503809]  ffff88811f63e780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  157.511235] >ffff88811f63e800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  157.518641]                                            ^
[  157.524113]  ffff88811f63e880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  157.531571]  ffff88811f63e900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[  157.538980] ==================================================================

Fixes: 8ea229511e06 ("thermal: Add cooling device's statistics in sysfs")
Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
---
 drivers/thermal/thermal_sysfs.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/thermal/thermal_sysfs.c b/drivers/thermal/thermal_sysfs.c
index 1c4aac8464a7..80b38b180140 100644
--- a/drivers/thermal/thermal_sysfs.c
+++ b/drivers/thermal/thermal_sysfs.c
@@ -673,10 +673,14 @@ void thermal_cooling_device_stats_update(struct thermal_cooling_device *cdev,
 					 unsigned long new_state)
 {
 	struct cooling_dev_stats *stats = cdev->stats;
+	unsigned long max_state;
 
 	if (!stats)
 		return;
 
+	if (cdev->ops->get_max_state(cdev, &max_state) || new_state > max_state)
+		return;
+
 	spin_lock(&stats->lock);
 
 	if (stats->state == new_state)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics update for configuration operation
  2021-09-27  8:24 [PATCH thermal 1/1] thermal/core: Skip cooling device statistics update for configuration operation Vadim Pasternak
@ 2021-09-27 10:41 ` Daniel Lezcano
  2021-09-27 11:22   ` Vadim Pasternak
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Lezcano @ 2021-09-27 10:41 UTC (permalink / raw)
  To: Vadim Pasternak, rui.zhang; +Cc: =idosch, linux-pm


Hi Vadim,


On 27/09/2021 10:24, Vadim Pasternak wrote:
> The thermal subsystem maintains a transition table between states that
> is allocated according to the maximum state supported by the cooling
> device.
> 
> When the table needs to be updated, the thermal subsystem does not
> validate that the new state does not exceed the maximum state,
> leading to out-of-bounds memory accesses [1].

Actually, thermal_cooling_device_stats_update() is called if the
set_cur_state is successful.

With a state greater than the max state, the set_cur_state should fail
and thermal_cooling_device_stats_update() is not called.

Perhaps the problem is in mlxsw_thermal_set_cur_state() ?

> Fix this by validating the new state before updating the table.
> 
> Some thermal drivers allow user space to configure such states to limit
> the minimum fan speed. See, for example:
> commit a421ce088ac8 ("mlxsw: core: Extend cooling device with cooling levels")
> 
> In any case, the thermal subsystem needs to validate the state before
> using it as an index into the transition table.
> 
> [1]
> [  156.703153] ==================================================================
> [  156.710613] BUG: KASAN: slab-out-of-bounds in thermal_cooling_device_stats_update+0x7d/0xb0
> [  156.719227] Read of size 4 at addr ffff88811f63e840 by task hw-management.s/2300
> [  156.726816]
> [  156.728369] CPU: 1 PID: 2300 Comm: hw-management.s Not tainted 5.15.0-rc2-dvs-00320-ga3b397b4fffb #1
> [  156.737766] Hardware name: Mellanox Technologies Ltd. MSN2410/VMOD0001, BIOS 4.6.5 09/13/2018
> [  156.746561] Call Trace:
> [  156.749093]  dump_stack_lvl+0x44/0x57
> [  156.752890]  print_address_description.constprop.9+0x21/0x150
> [  156.758824]  ? thermal_cooling_device_stats_update+0x7d/0xb0
> [  156.764675]  ? thermal_cooling_device_stats_update+0x7d/0xb0
> [  156.770521]  kasan_report.cold.14+0x83/0xdf
> [  156.774835]  ? thermal_cooling_device_stats_update+0x7d/0xb0
> [  156.780654]  thermal_cooling_device_stats_update+0x7d/0xb0
> [  156.786314]  __thermal_cdev_update+0xc0/0x200
> [  156.790818]  thermal_cdev_update+0x4e/0x70
> [  156.795064]  step_wise_throttle+0x426/0x6f0
> [  156.799408]  ? devm_thermal_add_hwmon_sysfs+0x80/0x80
> [  156.804628]  ? thermal_zone_get_temp+0xa0/0xa0
> [  156.809185]  ? netlink_broadcast+0xa/0x10
> [  156.813353]  ? thermal_genl_sampling_temp+0x16a/0x200
> [  156.818563]  ? thermal_genl_cmd_doit+0x290/0x290
> [  156.823332]  ? rcu_read_lock_bh_held+0xb0/0xb0
> [  156.827900]  thermal_zone_device_update+0x39e/0x850
> [  156.832954]  ? trace_event_raw_event_thermal_zone_trip+0x1f0/0x1f0
> [  156.839349]  ? wait_for_completion+0x190/0x190
> [  156.843944]  ? devm_thermal_of_cooling_device_register+0xa0/0xa0
> [  156.850148]  ? snprintf+0x91/0xc0
> [  156.853594]  ? vsprintf+0x10/0x10
> [  156.857039]  thermal_zone_device_set_mode+0x81/0xf0
> [  156.862079]  mlxsw_thermal_modules_init+0x48f/0x590 [mlxsw_core]
> [  156.868392]  ? mlxsw_thermal_set_cur_state+0x590/0x590 [mlxsw_core]
> [  156.874954]  ? netlink_broadcast+0xa/0x10
> [  156.879090]  ? thermal_genl_send_event+0x117/0x1a0
> [  156.884039]  ? thermal_notify_tz_create+0x7d/0xb0
> [  156.888903]  ? thermal_genl_sampling_temp+0x200/0x200
> [  156.894147]  ? do_init_timer+0x6c/0x80
> [  156.898009]  ? thermal_zone_device_register+0x8af/0x9a0
> [  156.903438]  ? __thermal_cooling_device_register+0x4f2/0x500
> [  156.909272]  mlxsw_thermal_init+0x763/0x880 [mlxsw_core]
> [  156.914882]  ? mlxsw_thermal_gearboxes_init.isra.8+0x460/0x460 [mlxsw_core]
> [  156.922214]  __mlxsw_core_bus_device_register+0xa0c/0xca0 [mlxsw_core]
> [  156.929093]  ? dev_printk_emit+0x90/0xb6
> [  156.933136]  ? dev_vprintk_emit+0x208/0x208
> [  156.937462]  ? mlxsw_devlink_info_get+0x490/0x490 [mlxsw_core]
> [  156.943580]  ? do_raw_spin_lock+0x1d0/0x1d0
> [  156.947906]  ? lockdep_hardirqs_on_prepare+0xe/0x230
> [  156.953041]  ? __dev_printk+0x9e/0xd6
> [  156.956818]  ? _dev_info+0xc8/0xf6
> [  156.960339]  ? _dev_notice+0x84/0xf6
> [  156.964047]  ? mark_wakeup_next_waiter+0x1b0/0x1b0
> [  156.968988]  mlxsw_core_bus_device_register+0x3e/0x60 [mlxsw_core]
> [  156.975492]  mlxsw_i2c_probe.cold.8+0x159/0x283 [mlxsw_i2c]
> [  156.981259]  ? mlxsw_i2c_wait_go_bit.isra.3+0x1c0/0x1c0 [mlxsw_i2c]
> [  156.987701]  ? lock_release+0x50/0x6c0
> [  156.991555]  ? devres_open_group+0x13d/0x180
> [  156.995936]  ? lock_downgrade+0x3a0/0x3a0
> [  157.000108]  ? lock_contended+0x710/0x710
> [  157.004229]  ? devres_open_group+0x59/0x180
> [  157.008539]  ? devres_log+0x11a/0x180
> [  157.012309]  ? trace_hardirqs_on+0x1c/0x110
> [  157.016630]  ? preempt_count_sub+0xf/0xb0
> [  157.020781]  ? mlxsw_i2c_wait_go_bit.isra.3+0x1c0/0x1c0 [mlxsw_i2c]
> [  157.027255]  ? i2c_device_probe+0x30e/0x370
> [  157.031573]  i2c_device_probe+0x30e/0x370
> [  157.035746]  really_probe+0x149/0x3c0
> [  157.039535]  ? driver_allows_async_probing+0x80/0x80
> [  157.044629]  __driver_probe_device+0xc3/0x130
> [  157.049127]  driver_probe_device+0x45/0x100
> [  157.053486]  __device_attach_driver+0xd6/0x100
> [  157.058069]  bus_for_each_drv+0xe7/0x150
> [  157.062122]  ? bus_rescan_devices+0x10/0x10
> [  157.066444]  ? lockdep_hardirqs_on_prepare+0xe/0x230
> [  157.071587]  ? trace_hardirqs_on+0x1c/0x110
> [  157.075891]  ? preempt_count_sub+0xf/0xb0
> [  157.080050]  ? _raw_spin_unlock_irqrestore+0x36/0x50
> [  157.085193]  __device_attach+0x185/0x210
> [  157.089231]  ? device_bind_driver+0x70/0x70
> [  157.093569]  ? kobject_uevent_env+0x287/0x940
> [  157.098091]  bus_probe_device+0xf9/0x120
> [  157.102159]  device_add+0x623/0xeb0
> [  157.105746]  ? static_obj+0x32/0x80
> [  157.109348]  ? lockdep_init_map_type+0xd9/0x360
> [  157.114062]  ? __fw_devlink_link_to_suppliers+0x270/0x270
> [  157.119619]  ? __raw_spin_lock_init+0x71/0x80
> [  157.124134]  i2c_new_client_device+0x277/0x3a0
> [  157.128715]  new_device_store+0x13c/0x270
> [  157.132847]  ? copyin+0x6b/0x80
> [  157.136090]  ? i2c_new_ancillary_device+0x20/0x20
> [  157.140946]  ? lock_acquire+0xc0/0x3f0
> [  157.144830]  ? lock_release+0x6c0/0x6c0
> [  157.148833]  ? sysfs_file_ops+0x6b/0x90
> [  157.152808]  ? sysfs_file_ops+0x90/0x90
> [  157.156774]  kernfs_fop_write_iter+0x1af/0x250
> [  157.161346]  new_sync_write+0x25a/0x380
> [  157.165327]  ? new_sync_read+0x370/0x370
> [  157.169384]  ? rcu_read_lock_bh_held+0xb0/0xb0
> [  157.173973]  ? rcu_read_lock_held_common+0x12/0x50
> [  157.178911]  ? irq_migrate_all_off_this_cpu+0xf0/0x300
> [  157.184208]  ? lock_release+0x6c0/0x6c0
> [  157.188149]  ? rcu_read_lock_sched_held+0x5a/0xd0
> [  157.192995]  ? rcu_read_lock_held+0xb0/0xb0
> [  157.197310]  vfs_write+0x33d/0x530
> [  157.200839]  ksys_write+0xbb/0x150
> [  157.204359]  ? __ia32_sys_read+0x40/0x40
> [  157.208410]  ? set_load_weight+0xd1/0x110
> [  157.212557]  do_syscall_64+0x3a/0x80
> [  157.216256]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  157.221457] RIP: 0033:0x7fa80bd47970
> [  157.225131] Code: 73 01 c3 48 8b 0d 28 d5 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00 83 3d 99 2d 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 7e 9b 01 00 48 89 04 24
> [  157.244407] RSP: 002b:00007fff14b2c648 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [  157.252196] RAX: ffffffffffffffda RBX: 0000000000000013 RCX: 00007fa80bd47970
> [  157.259528] RDX: 0000000000000013 RSI: 0000000000ede408 RDI: 0000000000000001
> [  157.266868] RBP: 0000000000ede408 R08: 00007fa80c007760 R09: 00007fa80c653700
> [  157.274219] R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000013
> [  157.281564] R13: 0000000000000001 R14: 00007fa80c006600 R15: 0000000000000013
> [  157.288933]
> [  157.290484] Allocated by task 2300:
> [  157.294054]  kasan_save_stack+0x19/0x40
> [  157.298024]  __kasan_kmalloc+0x7f/0xa0
> [  157.301914]  __kmalloc+0x18f/0x2c0
> [  157.305417]  thermal_cooling_device_setup_sysfs+0xf9/0x1a0
> [  157.311075]  __thermal_cooling_device_register+0x1b5/0x500
> [  157.316734]  mlxsw_thermal_init+0x7e4/0x880 [mlxsw_core]
> [  157.322325]  __mlxsw_core_bus_device_register+0xa0c/0xca0 [mlxsw_core]
> [  157.329154]  mlxsw_core_bus_device_register+0x3e/0x60 [mlxsw_core]
> [  157.335656]  mlxsw_i2c_probe.cold.8+0x159/0x283 [mlxsw_i2c]
> [  157.341373]  i2c_device_probe+0x30e/0x370
> [  157.345524]  really_probe+0x149/0x3c0
> [  157.349330]  __driver_probe_device+0xc3/0x130
> [  157.353815]  driver_probe_device+0x45/0x100
> [  157.358117]  __device_attach_driver+0xd6/0x100
> [  157.362712]  bus_for_each_drv+0xe7/0x150
> [  157.366768]  __device_attach+0x185/0x210
> [  157.370823]  bus_probe_device+0xf9/0x120
> [  157.374897]  device_add+0x623/0xeb0
> [  157.378512]  i2c_new_client_device+0x277/0x3a0
> [  157.383094]  new_device_store+0x13c/0x270
> [  157.387222]  kernfs_fop_write_iter+0x1af/0x250
> [  157.391797]  new_sync_write+0x25a/0x380
> [  157.395765]  vfs_write+0x33d/0x530
> [  157.399267]  ksys_write+0xbb/0x150
> [  157.402795]  do_syscall_64+0x3a/0x80
> [  157.406496]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  157.411686]
> [  157.413236] The buggy address belongs to the object at ffff88811f63e400
> [  157.413236]  which belongs to the cache kmalloc-1k of size 1024
> [  157.426135] The buggy address is located 64 bytes to the right of
> [  157.426135]  1024-byte region [ffff88811f63e400, ffff88811f63e800)
> [  157.438710] The buggy address belongs to the page:
> [  157.443640] page:ffffea00047d8e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88811f63c000 pfn:0x11f638
> [  157.454579] head:ffffea00047d8e00 order:3 compound_mapcount:0 compound_pincount:0
> [  157.462281] flags: 0x200000000010200(slab|head|node=0|zone=2)
> [  157.468220] raw: 0200000000010200 ffffea000863a808 ffffea00045ab208 ffff888100043bc0
> [  157.476205] raw: ffff88811f63c000 00000000000a0002 00000001ffffffff 0000000000000000
> [  157.484185] page dumped because: kasan: bad access detected
> [  157.489895]
> [  157.491445] Memory state around the buggy address:
> [  157.496373]  ffff88811f63e700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [  157.503809]  ffff88811f63e780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [  157.511235] >ffff88811f63e800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [  157.518641]                                            ^
> [  157.524113]  ffff88811f63e880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [  157.531571]  ffff88811f63e900: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> [  157.538980] ==================================================================
> 
> Fixes: 8ea229511e06 ("thermal: Add cooling device's statistics in sysfs")
> Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
> ---
>  drivers/thermal/thermal_sysfs.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/drivers/thermal/thermal_sysfs.c b/drivers/thermal/thermal_sysfs.c
> index 1c4aac8464a7..80b38b180140 100644
> --- a/drivers/thermal/thermal_sysfs.c
> +++ b/drivers/thermal/thermal_sysfs.c
> @@ -673,10 +673,14 @@ void thermal_cooling_device_stats_update(struct thermal_cooling_device *cdev,
>  					 unsigned long new_state)
>  {
>  	struct cooling_dev_stats *stats = cdev->stats;
> +	unsigned long max_state;
>  
>  	if (!stats)
>  		return;
>  
> +	if (cdev->ops->get_max_state(cdev, &max_state) || new_state > max_state)
> +		return;
> +
>  	spin_lock(&stats->lock);
>  
>  	if (stats->state == new_state)
> 


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics update for configuration operation
  2021-09-27 10:41 ` Daniel Lezcano
@ 2021-09-27 11:22   ` Vadim Pasternak
  2021-09-27 12:31     ` Daniel Lezcano
  0 siblings, 1 reply; 9+ messages in thread
From: Vadim Pasternak @ 2021-09-27 11:22 UTC (permalink / raw)
  To: Daniel Lezcano, rui.zhang; +Cc: linux-pm, Ido Schimmel

Hi Daniel,

Thank you for quick reply.

> -----Original Message-----
> From: Daniel Lezcano <daniel.lezcano@linaro.org>
> Sent: Monday, September 27, 2021 1:42 PM
> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
> Cc: =idosch@nvidia.com; linux-pm@vger.kernel.org
> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics
> update for configuration operation
> 
> 
> Hi Vadim,
> 
> 
> On 27/09/2021 10:24, Vadim Pasternak wrote:
> > The thermal subsystem maintains a transition table between states that
> > is allocated according to the maximum state supported by the cooling
> > device.
> >
> > When the table needs to be updated, the thermal subsystem does not
> > validate that the new state does not exceed the maximum state, leading
> > to out-of-bounds memory accesses [1].
> 
> Actually, thermal_cooling_device_stats_update() is called if the
> set_cur_state is successful.
> 
> With a state greater than the max state, the set_cur_state should fail and
> thermal_cooling_device_stats_update() is not called.
> 
> Perhaps the problem is in mlxsw_thermal_set_cur_state() ?

"mlxsw" thermal drivers has additional use of 'sysfs' 'cur_state' for
configuration purpose to limit minimum fan speed. 
Fan speed minimum is enforced by setting 'cur_state' with value
exceeding actual fan speed maximum.

This feature provides ability to limit fan speed according to some
system wise considerations, like absence of some replaceable units 
or high system ambient temperature, or some other factors which
indirectly impacts system airflow.

For example, if cooling devices operates at cooling levels from 1 to 10
(1 for 10% fan speed, 10 for 100% fan speed), cooling device minimal
speed can be limited by setting 'cur_state' attribute through 'sysfs'
to the values from 'max_state' + 1 to 'max_state * 2' (from 11 to 20).
Following this example if value is set to 14 (40%) cooling levels vector
will be set to 4, 4, 4, 4, 4, 5, 6, 7, 8, 9, 10 for setting device speed
cooling states respectively in 40, 40, 40, 40, 40, 50, 60. 70, 80, 90,
100 percent. And it limits cooling device to operate only at 40% speed
and above.

Maybe it would be worth adding earlier some dedicated 'cur_state_limit'
attribute for this feature, but it was not done.

We have another driver required this feature and one new we are
developing now, which require fan minim speed limit as well.

Thanks,
Vadim.

> >
> > Request for changing fan minimum speed is configuration request and 
> > can be set only through 'sysfs' write procedure. Such state change:
> > - should not be reported by thermal_notify_cdev_state_update().
> > - should not be recorded by thermal_cooling_device_stats_update().
> >   In this case statistics update violates range of thermal statistics
> >   table.

> 
> > Fix this by validating the new state before updating the table.
> >
> > Some thermal drivers allow user space to configure such states to
> > limit the minimum fan speed. See, for example:
> > commit a421ce088ac8 ("mlxsw: core: Extend cooling device with cooling
> > levels")
> >
> > In any case, the thermal subsystem needs to validate the state before
> > using it as an index into the transition table.
> >
> > [1]
> > [  156.703153]
> >
> ==========================================================
> ========
> > [  156.710613] BUG: KASAN: slab-out-of-bounds in
> > thermal_cooling_device_stats_update+0x7d/0xb0
> > [  156.719227] Read of size 4 at addr ffff88811f63e840 by task
> > hw-management.s/2300 [  156.726816] [  156.728369] CPU: 1 PID: 2300
> > Comm: hw-management.s Not tainted 5.15.0-rc2-dvs-00320-ga3b397b4fffb
> > #1 [  156.737766] Hardware name: Mellanox Technologies Ltd.
> > MSN2410/VMOD0001, BIOS 4.6.5 09/13/2018 [  156.746561] Call Trace:
> > [  156.749093]  dump_stack_lvl+0x44/0x57 [  156.752890]
> > print_address_description.constprop.9+0x21/0x150
> > [  156.758824]  ? thermal_cooling_device_stats_update+0x7d/0xb0
> > [  156.764675]  ? thermal_cooling_device_stats_update+0x7d/0xb0
> > [  156.770521]  kasan_report.cold.14+0x83/0xdf [  156.774835]  ?
> > thermal_cooling_device_stats_update+0x7d/0xb0
> > [  156.780654]  thermal_cooling_device_stats_update+0x7d/0xb0
> > [  156.786314]  __thermal_cdev_update+0xc0/0x200 [  156.790818]
> > thermal_cdev_update+0x4e/0x70 [  156.795064]
> > step_wise_throttle+0x426/0x6f0 [  156.799408]  ?
> > devm_thermal_add_hwmon_sysfs+0x80/0x80
> > [  156.804628]  ? thermal_zone_get_temp+0xa0/0xa0 [  156.809185]  ?
> > netlink_broadcast+0xa/0x10 [  156.813353]  ?
> > thermal_genl_sampling_temp+0x16a/0x200
> > [  156.818563]  ? thermal_genl_cmd_doit+0x290/0x290 [  156.823332]  ?
> > rcu_read_lock_bh_held+0xb0/0xb0 [  156.827900]
> > thermal_zone_device_update+0x39e/0x850
> > [  156.832954]  ? trace_event_raw_event_thermal_zone_trip+0x1f0/0x1f0
> > [  156.839349]  ? wait_for_completion+0x190/0x190 [  156.843944]  ?
> > devm_thermal_of_cooling_device_register+0xa0/0xa0
> > [  156.850148]  ? snprintf+0x91/0xc0
> > [  156.853594]  ? vsprintf+0x10/0x10
> > [  156.857039]  thermal_zone_device_set_mode+0x81/0xf0
> > [  156.862079]  mlxsw_thermal_modules_init+0x48f/0x590 [mlxsw_core] [
> > 156.868392]  ? mlxsw_thermal_set_cur_state+0x590/0x590 [mlxsw_core] [
> > 156.874954]  ? netlink_broadcast+0xa/0x10 [  156.879090]  ?
> > thermal_genl_send_event+0x117/0x1a0
> > [  156.884039]  ? thermal_notify_tz_create+0x7d/0xb0
> > [  156.888903]  ? thermal_genl_sampling_temp+0x200/0x200
> > [  156.894147]  ? do_init_timer+0x6c/0x80 [  156.898009]  ?
> > thermal_zone_device_register+0x8af/0x9a0
> > [  156.903438]  ? __thermal_cooling_device_register+0x4f2/0x500
> > [  156.909272]  mlxsw_thermal_init+0x763/0x880 [mlxsw_core] [
> > 156.914882]  ? mlxsw_thermal_gearboxes_init.isra.8+0x460/0x460
> > [mlxsw_core] [  156.922214]
> > __mlxsw_core_bus_device_register+0xa0c/0xca0 [mlxsw_core] [
> > 156.929093]  ? dev_printk_emit+0x90/0xb6 [  156.933136]  ?
> > dev_vprintk_emit+0x208/0x208 [  156.937462]  ?
> > mlxsw_devlink_info_get+0x490/0x490 [mlxsw_core] [  156.943580]  ?
> > do_raw_spin_lock+0x1d0/0x1d0 [  156.947906]  ?
> > lockdep_hardirqs_on_prepare+0xe/0x230
> > [  156.953041]  ? __dev_printk+0x9e/0xd6 [  156.956818]  ?
> > _dev_info+0xc8/0xf6 [  156.960339]  ? _dev_notice+0x84/0xf6 [
> > 156.964047]  ? mark_wakeup_next_waiter+0x1b0/0x1b0
> > [  156.968988]  mlxsw_core_bus_device_register+0x3e/0x60 [mlxsw_core]
> > [  156.975492]  mlxsw_i2c_probe.cold.8+0x159/0x283 [mlxsw_i2c] [
> > 156.981259]  ? mlxsw_i2c_wait_go_bit.isra.3+0x1c0/0x1c0 [mlxsw_i2c] [
> > 156.987701]  ? lock_release+0x50/0x6c0 [  156.991555]  ?
> > devres_open_group+0x13d/0x180 [  156.995936]  ?
> > lock_downgrade+0x3a0/0x3a0 [  157.000108]  ?
> > lock_contended+0x710/0x710 [  157.004229]  ?
> > devres_open_group+0x59/0x180 [  157.008539]  ? devres_log+0x11a/0x180
> > [  157.012309]  ? trace_hardirqs_on+0x1c/0x110 [  157.016630]  ?
> > preempt_count_sub+0xf/0xb0 [  157.020781]  ?
> > mlxsw_i2c_wait_go_bit.isra.3+0x1c0/0x1c0 [mlxsw_i2c] [  157.027255]  ?
> > i2c_device_probe+0x30e/0x370 [  157.031573]
> > i2c_device_probe+0x30e/0x370 [  157.035746]  really_probe+0x149/0x3c0
> > [  157.039535]  ? driver_allows_async_probing+0x80/0x80
> > [  157.044629]  __driver_probe_device+0xc3/0x130 [  157.049127]
> > driver_probe_device+0x45/0x100 [  157.053486]
> > __device_attach_driver+0xd6/0x100 [  157.058069]
> > bus_for_each_drv+0xe7/0x150 [  157.062122]  ?
> > bus_rescan_devices+0x10/0x10 [  157.066444]  ?
> > lockdep_hardirqs_on_prepare+0xe/0x230
> > [  157.071587]  ? trace_hardirqs_on+0x1c/0x110 [  157.075891]  ?
> > preempt_count_sub+0xf/0xb0 [  157.080050]  ?
> > _raw_spin_unlock_irqrestore+0x36/0x50
> > [  157.085193]  __device_attach+0x185/0x210 [  157.089231]  ?
> > device_bind_driver+0x70/0x70 [  157.093569]  ?
> > kobject_uevent_env+0x287/0x940 [  157.098091]
> > bus_probe_device+0xf9/0x120 [  157.102159]  device_add+0x623/0xeb0 [
> > 157.105746]  ? static_obj+0x32/0x80 [  157.109348]  ?
> > lockdep_init_map_type+0xd9/0x360 [  157.114062]  ?
> > __fw_devlink_link_to_suppliers+0x270/0x270
> > [  157.119619]  ? __raw_spin_lock_init+0x71/0x80 [  157.124134]
> > i2c_new_client_device+0x277/0x3a0 [  157.128715]
> > new_device_store+0x13c/0x270 [  157.132847]  ? copyin+0x6b/0x80 [
> > 157.136090]  ? i2c_new_ancillary_device+0x20/0x20
> > [  157.140946]  ? lock_acquire+0xc0/0x3f0 [  157.144830]  ?
> > lock_release+0x6c0/0x6c0 [  157.148833]  ? sysfs_file_ops+0x6b/0x90 [
> > 157.152808]  ? sysfs_file_ops+0x90/0x90 [  157.156774]
> > kernfs_fop_write_iter+0x1af/0x250 [  157.161346]
> > new_sync_write+0x25a/0x380 [  157.165327]  ?
> new_sync_read+0x370/0x370
> > [  157.169384]  ? rcu_read_lock_bh_held+0xb0/0xb0 [  157.173973]  ?
> > rcu_read_lock_held_common+0x12/0x50
> > [  157.178911]  ? irq_migrate_all_off_this_cpu+0xf0/0x300
> > [  157.184208]  ? lock_release+0x6c0/0x6c0 [  157.188149]  ?
> > rcu_read_lock_sched_held+0x5a/0xd0
> > [  157.192995]  ? rcu_read_lock_held+0xb0/0xb0 [  157.197310]
> > vfs_write+0x33d/0x530 [  157.200839]  ksys_write+0xbb/0x150 [
> > 157.204359]  ? __ia32_sys_read+0x40/0x40 [  157.208410]  ?
> > set_load_weight+0xd1/0x110 [  157.212557]  do_syscall_64+0x3a/0x80 [
> > 157.216256]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  157.221457] RIP: 0033:0x7fa80bd47970 [  157.225131] Code: 73 01 c3
> > 48 8b 0d 28 d5 2b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 0f 1f 44 00 00
> > 83 3d 99 2d 2c 00 00 75 10 b8 01 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73
> > 31 c3 48 83 ec 08 e8 7e 9b 01 00 48 89 04 24 [  157.244407] RSP:
> > 002b:00007fff14b2c648 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [
> > 157.252196] RAX: ffffffffffffffda RBX: 0000000000000013 RCX:
> > 00007fa80bd47970 [  157.259528] RDX: 0000000000000013 RSI:
> > 0000000000ede408 RDI: 0000000000000001 [  157.266868] RBP:
> > 0000000000ede408 R08: 00007fa80c007760 R09: 00007fa80c653700 [
> > 157.274219] R10: 0000000000000073 R11: 0000000000000246 R12:
> 0000000000000013 [  157.281564] R13: 0000000000000001 R14:
> 00007fa80c006600 R15: 0000000000000013 [  157.288933] [  157.290484]
> Allocated by task 2300:
> > [  157.294054]  kasan_save_stack+0x19/0x40 [  157.298024]
> > __kasan_kmalloc+0x7f/0xa0 [  157.301914]  __kmalloc+0x18f/0x2c0 [
> > 157.305417]  thermal_cooling_device_setup_sysfs+0xf9/0x1a0
> > [  157.311075]  __thermal_cooling_device_register+0x1b5/0x500
> > [  157.316734]  mlxsw_thermal_init+0x7e4/0x880 [mlxsw_core] [
> > 157.322325]  __mlxsw_core_bus_device_register+0xa0c/0xca0
> [mlxsw_core]
> > [  157.329154]  mlxsw_core_bus_device_register+0x3e/0x60 [mlxsw_core]
> > [  157.335656]  mlxsw_i2c_probe.cold.8+0x159/0x283 [mlxsw_i2c] [
> > 157.341373]  i2c_device_probe+0x30e/0x370 [  157.345524]
> > really_probe+0x149/0x3c0 [  157.349330]
> > __driver_probe_device+0xc3/0x130 [  157.353815]
> > driver_probe_device+0x45/0x100 [  157.358117]
> > __device_attach_driver+0xd6/0x100 [  157.362712]
> > bus_for_each_drv+0xe7/0x150 [  157.366768]
> > __device_attach+0x185/0x210 [  157.370823]
> > bus_probe_device+0xf9/0x120 [  157.374897]  device_add+0x623/0xeb0 [
> > 157.378512]  i2c_new_client_device+0x277/0x3a0 [  157.383094]
> > new_device_store+0x13c/0x270 [  157.387222]
> > kernfs_fop_write_iter+0x1af/0x250 [  157.391797]
> > new_sync_write+0x25a/0x380 [  157.395765]  vfs_write+0x33d/0x530 [
> > 157.399267]  ksys_write+0xbb/0x150 [  157.402795]
> > do_syscall_64+0x3a/0x80 [  157.406496]
> > entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [  157.411686]
> > [  157.413236] The buggy address belongs to the object at
> > ffff88811f63e400 [  157.413236]  which belongs to the cache kmalloc-1k
> > of size 1024 [  157.426135] The buggy address is located 64 bytes to
> > the right of [  157.426135]  1024-byte region [ffff88811f63e400,
> > ffff88811f63e800) [  157.438710] The buggy address belongs to the page:
> > [  157.443640] page:ffffea00047d8e00 refcount:1 mapcount:0
> > mapping:0000000000000000 index:0xffff88811f63c000 pfn:0x11f638 [
> > 157.454579] head:ffffea00047d8e00 order:3 compound_mapcount:0
> > compound_pincount:0 [  157.462281] flags:
> > 0x200000000010200(slab|head|node=0|zone=2)
> > [  157.468220] raw: 0200000000010200 ffffea000863a808 ffffea00045ab208
> > ffff888100043bc0 [  157.476205] raw: ffff88811f63c000 00000000000a0002
> > 00000001ffffffff 0000000000000000 [  157.484185] page dumped because:
> > kasan: bad access detected [  157.489895] [  157.491445] Memory state
> > around the buggy address:
> > [  157.496373]  ffff88811f63e700: fc fc fc fc fc fc fc fc fc fc fc fc
> > fc fc fc fc [  157.503809]  ffff88811f63e780: fc fc fc fc fc fc fc fc
> > fc fc fc fc fc fc fc fc [  157.511235] >ffff88811f63e800: fc fc fc fc fc fc fc fc fc fc
> fc fc fc fc fc fc
> > [  157.518641]                                            ^
> > [  157.524113]  ffff88811f63e880: fc fc fc fc fc fc fc fc fc fc fc fc
> > fc fc fc fc [  157.531571]  ffff88811f63e900: fc fc fc fc fc fc fc fc
> > fc fc fc fc fc fc fc fc [  157.538980]
> >
> ==========================================================
> ========
> >
> > Fixes: 8ea229511e06 ("thermal: Add cooling device's statistics in
> > sysfs")
> > Signed-off-by: Vadim Pasternak <vadimp@nvidia.com>
> > ---
> >  drivers/thermal/thermal_sysfs.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/drivers/thermal/thermal_sysfs.c
> > b/drivers/thermal/thermal_sysfs.c index 1c4aac8464a7..80b38b180140
> > 100644
> > --- a/drivers/thermal/thermal_sysfs.c
> > +++ b/drivers/thermal/thermal_sysfs.c
> > @@ -673,10 +673,14 @@ void
> thermal_cooling_device_stats_update(struct thermal_cooling_device *cdev,
> >  					 unsigned long new_state)
> >  {
> >  	struct cooling_dev_stats *stats = cdev->stats;
> > +	unsigned long max_state;
> >
> >  	if (!stats)
> >  		return;
> >
> > +	if (cdev->ops->get_max_state(cdev, &max_state) || new_state >
> max_state)
> > +		return;
> > +
> >  	spin_lock(&stats->lock);
> >
> >  	if (stats->state == new_state)
> >
> 
> 
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-
> blog/> Blog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics update for configuration operation
  2021-09-27 11:22   ` Vadim Pasternak
@ 2021-09-27 12:31     ` Daniel Lezcano
  2021-09-27 13:29       ` Vadim Pasternak
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Lezcano @ 2021-09-27 12:31 UTC (permalink / raw)
  To: Vadim Pasternak, rui.zhang; +Cc: linux-pm, Ido Schimmel, Rafael J. Wysocki




On 27/09/2021 13:22, Vadim Pasternak wrote:
> Hi Daniel,
> 
> Thank you for quick reply.
> 
>> -----Original Message-----
>> From: Daniel Lezcano <daniel.lezcano@linaro.org>
>> Sent: Monday, September 27, 2021 1:42 PM
>> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
>> Cc: =idosch@nvidia.com; linux-pm@vger.kernel.org
>> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics
>> update for configuration operation
>>
>>
>> Hi Vadim,
>>
>>
>> On 27/09/2021 10:24, Vadim Pasternak wrote:
>>> The thermal subsystem maintains a transition table between states that
>>> is allocated according to the maximum state supported by the cooling
>>> device.
>>>
>>> When the table needs to be updated, the thermal subsystem does not
>>> validate that the new state does not exceed the maximum state, leading
>>> to out-of-bounds memory accesses [1].
>>
>> Actually, thermal_cooling_device_stats_update() is called if the
>> set_cur_state is successful.
>>
>> With a state greater than the max state, the set_cur_state should fail and
>> thermal_cooling_device_stats_update() is not called.
>>
>> Perhaps the problem is in mlxsw_thermal_set_cur_state() ?
> 
> "mlxsw" thermal drivers has additional use of 'sysfs' 'cur_state' for
> configuration purpose to limit minimum fan speed. 
> Fan speed minimum is enforced by setting 'cur_state' with value
> exceeding actual fan speed maximum.

Yes, and that is the problem because the driver is doing weird things
with the cooling device state resulting in an abuse of the sysfs API and
conflicting with the thermal internals.


> This feature provides ability to limit fan speed according to some
> system wise considerations, like absence of some replaceable units 
> or high system ambient temperature, or some other factors which
> indirectly impacts system airflow.

Is that a static thermal profile depending on the platform set by
userspace or something which can be changed dynamically at runtime via
eg. a daemon ?

> For example, if cooling devices operates at cooling levels from 1 to 10
> (1 for 10% fan speed, 10 for 100% fan speed), cooling device minimal
> speed can be limited by setting 'cur_state' attribute through 'sysfs'
> to the values from 'max_state' + 1 to 'max_state * 2' (from 11 to 20).
> Following this example if value is set to 14 (40%) cooling levels vector
> will be set to 4, 4, 4, 4, 4, 5, 6, 7, 8, 9, 10 for setting device speed
> cooling states respectively in 40, 40, 40, 40, 40, 50, 60. 70, 80, 90,
> 100 percent. And it limits cooling device to operate only at 40% speed
> and above.
>
> Maybe it would be worth adding earlier some dedicated 'cur_state_limit'
> attribute for this feature, but it was not done.
> 
> We have another driver required this feature and one new we are
> developing now, which require fan minim speed limit as well.

The use case is valid but I think the approach is wrong. Probably the
simplest thing to do is to set a low trip point with a minimal fan speed.


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics update for configuration operation
  2021-09-27 12:31     ` Daniel Lezcano
@ 2021-09-27 13:29       ` Vadim Pasternak
  2021-09-27 13:55         ` Daniel Lezcano
  0 siblings, 1 reply; 9+ messages in thread
From: Vadim Pasternak @ 2021-09-27 13:29 UTC (permalink / raw)
  To: Daniel Lezcano, rui.zhang; +Cc: linux-pm, Ido Schimmel, Rafael J. Wysocki



> -----Original Message-----
> From: Daniel Lezcano <daniel.lezcano@linaro.org>
> Sent: Monday, September 27, 2021 3:32 PM
> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
> Cc: linux-pm@vger.kernel.org; Ido Schimmel <idosch@nvidia.com>; Rafael J.
> Wysocki <rjw@rjwysocki.net>
> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics
> update for configuration operation
> 
> 
> 
> 
> On 27/09/2021 13:22, Vadim Pasternak wrote:
> > Hi Daniel,
> >
> > Thank you for quick reply.
> >
> >> -----Original Message-----
> >> From: Daniel Lezcano <daniel.lezcano@linaro.org>
> >> Sent: Monday, September 27, 2021 1:42 PM
> >> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
> >> Cc: =idosch@nvidia.com; linux-pm@vger.kernel.org
> >> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device
> >> statistics update for configuration operation
> >>
> >>
> >> Hi Vadim,
> >>
> >>
> >> On 27/09/2021 10:24, Vadim Pasternak wrote:
> >>> The thermal subsystem maintains a transition table between states
> >>> that is allocated according to the maximum state supported by the
> >>> cooling device.
> >>>
> >>> When the table needs to be updated, the thermal subsystem does not
> >>> validate that the new state does not exceed the maximum state,
> >>> leading to out-of-bounds memory accesses [1].
> >>
> >> Actually, thermal_cooling_device_stats_update() is called if the
> >> set_cur_state is successful.
> >>
> >> With a state greater than the max state, the set_cur_state should
> >> fail and
> >> thermal_cooling_device_stats_update() is not called.
> >>
> >> Perhaps the problem is in mlxsw_thermal_set_cur_state() ?
> >
> > "mlxsw" thermal drivers has additional use of 'sysfs' 'cur_state' for
> > configuration purpose to limit minimum fan speed.
> > Fan speed minimum is enforced by setting 'cur_state' with value
> > exceeding actual fan speed maximum.
> 
> Yes, and that is the problem because the driver is doing weird things with the
> cooling device state resulting in an abuse of the sysfs API and conflicting with
> the thermal internals.
> 
> 
> > This feature provides ability to limit fan speed according to some
> > system wise considerations, like absence of some replaceable units or
> > high system ambient temperature, or some other factors which
> > indirectly impacts system airflow.
> 
> Is that a static thermal profile depending on the platform set by userspace or
> something which can be changed dynamically at runtime via eg. a daemon ?

Yes, this is some profiles/rules, which are system specific and according to these
rules userspace can limit fan speed. Like, for example:
- if one of power supplies is removed, system fan should be enforced to full
  speed, because it makes a hole in a box, and it has hard impact on airflow.
- If port side ambient temperature reaches some threshold X1, fan speed should
  be limited by Y1%, X2 - Y2%, etcetera.
- if temperature fault is detected for any optical transceivers - some limit is
  required. 

> 
> > For example, if cooling devices operates at cooling levels from 1 to
> > 10
> > (1 for 10% fan speed, 10 for 100% fan speed), cooling device minimal
> > speed can be limited by setting 'cur_state' attribute through 'sysfs'
> > to the values from 'max_state' + 1 to 'max_state * 2' (from 11 to 20).
> > Following this example if value is set to 14 (40%) cooling levels
> > vector will be set to 4, 4, 4, 4, 4, 5, 6, 7, 8, 9, 10 for setting
> > device speed cooling states respectively in 40, 40, 40, 40, 40, 50,
> > 60. 70, 80, 90,
> > 100 percent. And it limits cooling device to operate only at 40% speed
> > and above.
> >
> > Maybe it would be worth adding earlier some dedicated 'cur_state_limit'
> > attribute for this feature, but it was not done.
> >
> > We have another driver required this feature and one new we are
> > developing now, which require fan minim speed limit as well.
> 
> The use case is valid but I think the approach is wrong. Probably the simplest
> thing to do is to set a low trip point with a minimal fan speed.

For "trip_point_0_temp" there is the below definition:
	{	/* In range - 0-40% PWM */
		.type		= THERMAL_TRIP_ACTIVE,
		.temp		= MLXSW_THERMAL_ASIC_TEMP_NORM,
		.hyst		= MLXSW_THERMAL_HYSTERESIS_TEMP,
		.min_state	= 0,
		.max_state	= (4 * MLXSW_THERMAL_MAX_STATE) / 10, (40%)
	},
For "trip_point_1_temp":
	{
		/* In range - 40-100% PWM */
		.type		= THERMAL_TRIP_ACTIVE,
		.temp		= MLXSW_THERMAL_ASIC_TEMP_HIGH,
		.hyst		= MLXSW_THERMAL_HYSTERESIS_TEMP,
		.min_state	= (4 * MLXSW_THERMAL_MAX_STATE) / 10, (100%)
		.max_state	= MLXSW_THERMAL_MAX_STATE,
	},

To limit cooling device by f.e 70%, I should change dynamically 'min_state' and 'max_state'
for both trips to (70%, 70%) and (70%, 100%). I am not sure I can do it?

And we have many customers, using this user space interface, it would be not so good to
change it.

I understand it is possible to handle this issue inside driver's set_cur_stat() callback by
returning positive value for configuration request.
But maybe this feature could you useful for other developers and it could be some common
interface to support it? 

> 
> 
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-
> blog/> Blog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics update for configuration operation
  2021-09-27 13:29       ` Vadim Pasternak
@ 2021-09-27 13:55         ` Daniel Lezcano
  2021-09-27 17:52           ` Vadim Pasternak
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Lezcano @ 2021-09-27 13:55 UTC (permalink / raw)
  To: Vadim Pasternak, rui.zhang; +Cc: linux-pm, Ido Schimmel, Rafael J. Wysocki

On 27/09/2021 15:29, Vadim Pasternak wrote:
> 
> 
>> -----Original Message-----
>> From: Daniel Lezcano <daniel.lezcano@linaro.org>
>> Sent: Monday, September 27, 2021 3:32 PM
>> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
>> Cc: linux-pm@vger.kernel.org; Ido Schimmel <idosch@nvidia.com>; Rafael J.
>> Wysocki <rjw@rjwysocki.net>
>> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics
>> update for configuration operation
>>
>>
>>
>>
>> On 27/09/2021 13:22, Vadim Pasternak wrote:
>>> Hi Daniel,
>>>
>>> Thank you for quick reply.
>>>
>>>> -----Original Message-----
>>>> From: Daniel Lezcano <daniel.lezcano@linaro.org>
>>>> Sent: Monday, September 27, 2021 1:42 PM
>>>> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
>>>> Cc: =idosch@nvidia.com; linux-pm@vger.kernel.org
>>>> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device
>>>> statistics update for configuration operation
>>>>
>>>>
>>>> Hi Vadim,
>>>>
>>>>
>>>> On 27/09/2021 10:24, Vadim Pasternak wrote:
>>>>> The thermal subsystem maintains a transition table between states
>>>>> that is allocated according to the maximum state supported by the
>>>>> cooling device.
>>>>>
>>>>> When the table needs to be updated, the thermal subsystem does not
>>>>> validate that the new state does not exceed the maximum state,
>>>>> leading to out-of-bounds memory accesses [1].
>>>>
>>>> Actually, thermal_cooling_device_stats_update() is called if the
>>>> set_cur_state is successful.
>>>>
>>>> With a state greater than the max state, the set_cur_state should
>>>> fail and
>>>> thermal_cooling_device_stats_update() is not called.
>>>>
>>>> Perhaps the problem is in mlxsw_thermal_set_cur_state() ?
>>>
>>> "mlxsw" thermal drivers has additional use of 'sysfs' 'cur_state' for
>>> configuration purpose to limit minimum fan speed.
>>> Fan speed minimum is enforced by setting 'cur_state' with value
>>> exceeding actual fan speed maximum.
>>
>> Yes, and that is the problem because the driver is doing weird things with the
>> cooling device state resulting in an abuse of the sysfs API and conflicting with
>> the thermal internals.
>>
>>
>>> This feature provides ability to limit fan speed according to some
>>> system wise considerations, like absence of some replaceable units or
>>> high system ambient temperature, or some other factors which
>>> indirectly impacts system airflow.
>>
>> Is that a static thermal profile depending on the platform set by userspace or
>> something which can be changed dynamically at runtime via eg. a daemon ?
> 
> Yes, this is some profiles/rules, which are system specific and according to these
> rules userspace can limit fan speed. Like, for example:
> - if one of power supplies is removed, system fan should be enforced to full
>   speed, because it makes a hole in a box, and it has hard impact on airflow.
> - If port side ambient temperature reaches some threshold X1, fan speed should
>   be limited by Y1%, X2 - Y2%, etcetera.
> - if temperature fault is detected for any optical transceivers - some limit is
>   required. 

I see, thanks for the information.

>>> For example, if cooling devices operates at cooling levels from 1 to
>>> 10
>>> (1 for 10% fan speed, 10 for 100% fan speed), cooling device minimal
>>> speed can be limited by setting 'cur_state' attribute through 'sysfs'
>>> to the values from 'max_state' + 1 to 'max_state * 2' (from 11 to 20).
>>> Following this example if value is set to 14 (40%) cooling levels
>>> vector will be set to 4, 4, 4, 4, 4, 5, 6, 7, 8, 9, 10 for setting
>>> device speed cooling states respectively in 40, 40, 40, 40, 40, 50,
>>> 60. 70, 80, 90,
>>> 100 percent. And it limits cooling device to operate only at 40% speed
>>> and above.
>>>
>>> Maybe it would be worth adding earlier some dedicated 'cur_state_limit'
>>> attribute for this feature, but it was not done.
>>>
>>> We have another driver required this feature and one new we are
>>> developing now, which require fan minim speed limit as well.
>>
>> The use case is valid but I think the approach is wrong. Probably the simplest
>> thing to do is to set a low trip point with a minimal fan speed.
> 
> For "trip_point_0_temp" there is the below definition:
> 	{	/* In range - 0-40% PWM */
> 		.type		= THERMAL_TRIP_ACTIVE,
> 		.temp		= MLXSW_THERMAL_ASIC_TEMP_NORM,
> 		.hyst		= MLXSW_THERMAL_HYSTERESIS_TEMP,
> 		.min_state	= 0,
> 		.max_state	= (4 * MLXSW_THERMAL_MAX_STATE) / 10, (40%)
> 	},
> For "trip_point_1_temp":
> 	{
> 		/* In range - 40-100% PWM */
> 		.type		= THERMAL_TRIP_ACTIVE,
> 		.temp		= MLXSW_THERMAL_ASIC_TEMP_HIGH,
> 		.hyst		= MLXSW_THERMAL_HYSTERESIS_TEMP,
> 		.min_state	= (4 * MLXSW_THERMAL_MAX_STATE) / 10, (100%)
> 		.max_state	= MLXSW_THERMAL_MAX_STATE,
> 	},
> 
> To limit cooling device by f.e 70%, I should change dynamically 'min_state' and 'max_state'
> for both trips to (70%, 70%) and (70%, 100%). I am not sure I can do it?
> 
> And we have many customers, using this user space interface, it would be not so good to
> change it.

The issue is the driver is wrong here. The change you referred in the
past should have not been merged. I note there is no thermal maintainer
blessing in the tags.

> I understand it is possible to handle this issue inside driver's set_cur_stat() callback by
> returning positive value for configuration request.
> But maybe this feature could you useful for other developers and it could be some common
> interface to support it? 

I suggest to revert a421ce088ac8 and switch to hwmon to set the fan
speed. At the first glance, it is already supported by the mlx driver, no ?


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics update for configuration operation
  2021-09-27 13:55         ` Daniel Lezcano
@ 2021-09-27 17:52           ` Vadim Pasternak
  2021-09-27 21:12             ` Daniel Lezcano
  0 siblings, 1 reply; 9+ messages in thread
From: Vadim Pasternak @ 2021-09-27 17:52 UTC (permalink / raw)
  To: Daniel Lezcano, rui.zhang; +Cc: linux-pm, Ido Schimmel, Rafael J. Wysocki



> -----Original Message-----
> From: Daniel Lezcano <daniel.lezcano@linaro.org>
> Sent: Monday, September 27, 2021 4:56 PM
> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
> Cc: linux-pm@vger.kernel.org; Ido Schimmel <idosch@nvidia.com>; Rafael J.
> Wysocki <rjw@rjwysocki.net>
> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics
> update for configuration operation
> 
> On 27/09/2021 15:29, Vadim Pasternak wrote:
> >
> >
> >> -----Original Message-----
> >> From: Daniel Lezcano <daniel.lezcano@linaro.org>
> >> Sent: Monday, September 27, 2021 3:32 PM
> >> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
> >> Cc: linux-pm@vger.kernel.org; Ido Schimmel <idosch@nvidia.com>;
> Rafael J.
> >> Wysocki <rjw@rjwysocki.net>
> >> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device
> >> statistics update for configuration operation
> >>
> >>
> >>
> >>
> >> On 27/09/2021 13:22, Vadim Pasternak wrote:
> >>> Hi Daniel,
> >>>
> >>> Thank you for quick reply.
> >>>
> >>>> -----Original Message-----
> >>>> From: Daniel Lezcano <daniel.lezcano@linaro.org>
> >>>> Sent: Monday, September 27, 2021 1:42 PM
> >>>> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
> >>>> Cc: =idosch@nvidia.com; linux-pm@vger.kernel.org
> >>>> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device
> >>>> statistics update for configuration operation
> >>>>
> >>>>
> >>>> Hi Vadim,
> >>>>
> >>>>
> >>>> On 27/09/2021 10:24, Vadim Pasternak wrote:
> >>>>> The thermal subsystem maintains a transition table between states
> >>>>> that is allocated according to the maximum state supported by the
> >>>>> cooling device.
> >>>>>
> >>>>> When the table needs to be updated, the thermal subsystem does
> not
> >>>>> validate that the new state does not exceed the maximum state,
> >>>>> leading to out-of-bounds memory accesses [1].
> >>>>
> >>>> Actually, thermal_cooling_device_stats_update() is called if the
> >>>> set_cur_state is successful.
> >>>>
> >>>> With a state greater than the max state, the set_cur_state should
> >>>> fail and
> >>>> thermal_cooling_device_stats_update() is not called.
> >>>>
> >>>> Perhaps the problem is in mlxsw_thermal_set_cur_state() ?
> >>>
> >>> "mlxsw" thermal drivers has additional use of 'sysfs' 'cur_state'
> >>> for configuration purpose to limit minimum fan speed.
> >>> Fan speed minimum is enforced by setting 'cur_state' with value
> >>> exceeding actual fan speed maximum.
> >>
> >> Yes, and that is the problem because the driver is doing weird things
> >> with the cooling device state resulting in an abuse of the sysfs API
> >> and conflicting with the thermal internals.
> >>
> >>
> >>> This feature provides ability to limit fan speed according to some
> >>> system wise considerations, like absence of some replaceable units
> >>> or high system ambient temperature, or some other factors which
> >>> indirectly impacts system airflow.
> >>
> >> Is that a static thermal profile depending on the platform set by
> >> userspace or something which can be changed dynamically at runtime via
> eg. a daemon ?
> >
> > Yes, this is some profiles/rules, which are system specific and
> > according to these rules userspace can limit fan speed. Like, for example:
> > - if one of power supplies is removed, system fan should be enforced to
> full
> >   speed, because it makes a hole in a box, and it has hard impact on airflow.
> > - If port side ambient temperature reaches some threshold X1, fan speed
> should
> >   be limited by Y1%, X2 - Y2%, etcetera.
> > - if temperature fault is detected for any optical transceivers - some limit is
> >   required.
> 
> I see, thanks for the information.
> 
> >>> For example, if cooling devices operates at cooling levels from 1 to
> >>> 10
> >>> (1 for 10% fan speed, 10 for 100% fan speed), cooling device minimal
> >>> speed can be limited by setting 'cur_state' attribute through 'sysfs'
> >>> to the values from 'max_state' + 1 to 'max_state * 2' (from 11 to 20).
> >>> Following this example if value is set to 14 (40%) cooling levels
> >>> vector will be set to 4, 4, 4, 4, 4, 5, 6, 7, 8, 9, 10 for setting
> >>> device speed cooling states respectively in 40, 40, 40, 40, 40, 50,
> >>> 60. 70, 80, 90,
> >>> 100 percent. And it limits cooling device to operate only at 40%
> >>> speed and above.
> >>>
> >>> Maybe it would be worth adding earlier some dedicated 'cur_state_limit'
> >>> attribute for this feature, but it was not done.
> >>>
> >>> We have another driver required this feature and one new we are
> >>> developing now, which require fan minim speed limit as well.
> >>
> >> The use case is valid but I think the approach is wrong. Probably the
> >> simplest thing to do is to set a low trip point with a minimal fan speed.
> >
> > For "trip_point_0_temp" there is the below definition:
> > 	{	/* In range - 0-40% PWM */
> > 		.type		= THERMAL_TRIP_ACTIVE,
> > 		.temp		= MLXSW_THERMAL_ASIC_TEMP_NORM,
> > 		.hyst		= MLXSW_THERMAL_HYSTERESIS_TEMP,
> > 		.min_state	= 0,
> > 		.max_state	= (4 * MLXSW_THERMAL_MAX_STATE) / 10,
> (40%)
> > 	},
> > For "trip_point_1_temp":
> > 	{
> > 		/* In range - 40-100% PWM */
> > 		.type		= THERMAL_TRIP_ACTIVE,
> > 		.temp		= MLXSW_THERMAL_ASIC_TEMP_HIGH,
> > 		.hyst		= MLXSW_THERMAL_HYSTERESIS_TEMP,
> > 		.min_state	= (4 * MLXSW_THERMAL_MAX_STATE) / 10,
> (100%)
> > 		.max_state	= MLXSW_THERMAL_MAX_STATE,
> > 	},
> >
> > To limit cooling device by f.e 70%, I should change dynamically 'min_state'
> and 'max_state'
> > for both trips to (70%, 70%) and (70%, 100%). I am not sure I can do it?
> >
> > And we have many customers, using this user space interface, it would
> > be not so good to change it.
> 
> The issue is the driver is wrong here. The change you referred in the past
> should have not been merged. I note there is no thermal maintainer blessing
> in the tags.
> 
> > I understand it is possible to handle this issue inside driver's
> > set_cur_stat() callback by returning positive value for configuration
> request.
> > But maybe this feature could you useful for other developers and it
> > could be some common interface to support it?
> 
> I suggest to revert a421ce088ac8 and switch to hwmon to set the fan speed.
> At the first glance, it is already supported by the mlx driver, no ?
> 

Yes, hwmon is supported.
But setting fan trough hwmon means we'll have two owners controlling same device.
If speed is set through hwmon for example to 100%, thermal could decide to decrease
it, since it does not know what the reason was for setting full speed.
And it could make a completion between hwmon and thermal for cooling control.

Maybe adding new set_min_state()/get_min_state() callback could work?
In such case user space can limit speed through "min_state" attrbbute.

Maybe some other approach, but within thermal subsystem?





> 
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-
> blog/> Blog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics update for configuration operation
  2021-09-27 17:52           ` Vadim Pasternak
@ 2021-09-27 21:12             ` Daniel Lezcano
  2021-09-28 11:35               ` Rafael J. Wysocki
  0 siblings, 1 reply; 9+ messages in thread
From: Daniel Lezcano @ 2021-09-27 21:12 UTC (permalink / raw)
  To: Vadim Pasternak, rui.zhang; +Cc: linux-pm, Ido Schimmel, Rafael J. Wysocki

On 27/09/2021 19:52, Vadim Pasternak wrote:
> 
> 
>> -----Original Message-----
>> From: Daniel Lezcano <daniel.lezcano@linaro.org>
>> Sent: Monday, September 27, 2021 4:56 PM
>> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
>> Cc: linux-pm@vger.kernel.org; Ido Schimmel <idosch@nvidia.com>; Rafael J.
>> Wysocki <rjw@rjwysocki.net>
>> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics
>> update for configuration operation
>>
>> On 27/09/2021 15:29, Vadim Pasternak wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Daniel Lezcano <daniel.lezcano@linaro.org>
>>>> Sent: Monday, September 27, 2021 3:32 PM
>>>> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
>>>> Cc: linux-pm@vger.kernel.org; Ido Schimmel <idosch@nvidia.com>;
>> Rafael J.
>>>> Wysocki <rjw@rjwysocki.net>
>>>> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device
>>>> statistics update for configuration operation
>>>>
>>>>
>>>>
>>>>
>>>> On 27/09/2021 13:22, Vadim Pasternak wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> Thank you for quick reply.
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Daniel Lezcano <daniel.lezcano@linaro.org>
>>>>>> Sent: Monday, September 27, 2021 1:42 PM
>>>>>> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
>>>>>> Cc: =idosch@nvidia.com; linux-pm@vger.kernel.org
>>>>>> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device
>>>>>> statistics update for configuration operation
>>>>>>
>>>>>>
>>>>>> Hi Vadim,
>>>>>>
>>>>>>
>>>>>> On 27/09/2021 10:24, Vadim Pasternak wrote:
>>>>>>> The thermal subsystem maintains a transition table between states
>>>>>>> that is allocated according to the maximum state supported by the
>>>>>>> cooling device.
>>>>>>>
>>>>>>> When the table needs to be updated, the thermal subsystem does
>> not
>>>>>>> validate that the new state does not exceed the maximum state,
>>>>>>> leading to out-of-bounds memory accesses [1].
>>>>>>
>>>>>> Actually, thermal_cooling_device_stats_update() is called if the
>>>>>> set_cur_state is successful.
>>>>>>
>>>>>> With a state greater than the max state, the set_cur_state should
>>>>>> fail and
>>>>>> thermal_cooling_device_stats_update() is not called.
>>>>>>
>>>>>> Perhaps the problem is in mlxsw_thermal_set_cur_state() ?
>>>>>
>>>>> "mlxsw" thermal drivers has additional use of 'sysfs' 'cur_state'
>>>>> for configuration purpose to limit minimum fan speed.
>>>>> Fan speed minimum is enforced by setting 'cur_state' with value
>>>>> exceeding actual fan speed maximum.
>>>>
>>>> Yes, and that is the problem because the driver is doing weird things
>>>> with the cooling device state resulting in an abuse of the sysfs API
>>>> and conflicting with the thermal internals.
>>>>
>>>>
>>>>> This feature provides ability to limit fan speed according to some
>>>>> system wise considerations, like absence of some replaceable units
>>>>> or high system ambient temperature, or some other factors which
>>>>> indirectly impacts system airflow.
>>>>
>>>> Is that a static thermal profile depending on the platform set by
>>>> userspace or something which can be changed dynamically at runtime via
>> eg. a daemon ?
>>>
>>> Yes, this is some profiles/rules, which are system specific and
>>> according to these rules userspace can limit fan speed. Like, for example:
>>> - if one of power supplies is removed, system fan should be enforced to
>> full
>>>   speed, because it makes a hole in a box, and it has hard impact on airflow.
>>> - If port side ambient temperature reaches some threshold X1, fan speed
>> should
>>>   be limited by Y1%, X2 - Y2%, etcetera.
>>> - if temperature fault is detected for any optical transceivers - some limit is
>>>   required.
>>
>> I see, thanks for the information.
>>
>>>>> For example, if cooling devices operates at cooling levels from 1 to
>>>>> 10
>>>>> (1 for 10% fan speed, 10 for 100% fan speed), cooling device minimal
>>>>> speed can be limited by setting 'cur_state' attribute through 'sysfs'
>>>>> to the values from 'max_state' + 1 to 'max_state * 2' (from 11 to 20).
>>>>> Following this example if value is set to 14 (40%) cooling levels
>>>>> vector will be set to 4, 4, 4, 4, 4, 5, 6, 7, 8, 9, 10 for setting
>>>>> device speed cooling states respectively in 40, 40, 40, 40, 40, 50,
>>>>> 60. 70, 80, 90,
>>>>> 100 percent. And it limits cooling device to operate only at 40%
>>>>> speed and above.
>>>>>
>>>>> Maybe it would be worth adding earlier some dedicated 'cur_state_limit'
>>>>> attribute for this feature, but it was not done.
>>>>>
>>>>> We have another driver required this feature and one new we are
>>>>> developing now, which require fan minim speed limit as well.
>>>>
>>>> The use case is valid but I think the approach is wrong. Probably the
>>>> simplest thing to do is to set a low trip point with a minimal fan speed.
>>>
>>> For "trip_point_0_temp" there is the below definition:
>>> 	{	/* In range - 0-40% PWM */
>>> 		.type		= THERMAL_TRIP_ACTIVE,
>>> 		.temp		= MLXSW_THERMAL_ASIC_TEMP_NORM,
>>> 		.hyst		= MLXSW_THERMAL_HYSTERESIS_TEMP,
>>> 		.min_state	= 0,
>>> 		.max_state	= (4 * MLXSW_THERMAL_MAX_STATE) / 10,
>> (40%)
>>> 	},
>>> For "trip_point_1_temp":
>>> 	{
>>> 		/* In range - 40-100% PWM */
>>> 		.type		= THERMAL_TRIP_ACTIVE,
>>> 		.temp		= MLXSW_THERMAL_ASIC_TEMP_HIGH,
>>> 		.hyst		= MLXSW_THERMAL_HYSTERESIS_TEMP,
>>> 		.min_state	= (4 * MLXSW_THERMAL_MAX_STATE) / 10,
>> (100%)
>>> 		.max_state	= MLXSW_THERMAL_MAX_STATE,
>>> 	},
>>>
>>> To limit cooling device by f.e 70%, I should change dynamically 'min_state'
>> and 'max_state'
>>> for both trips to (70%, 70%) and (70%, 100%). I am not sure I can do it?
>>>
>>> And we have many customers, using this user space interface, it would
>>> be not so good to change it.
>>
>> The issue is the driver is wrong here. The change you referred in the past
>> should have not been merged. I note there is no thermal maintainer blessing
>> in the tags.
>>
>>> I understand it is possible to handle this issue inside driver's
>>> set_cur_stat() callback by returning positive value for configuration
>> request.
>>> But maybe this feature could you useful for other developers and it
>>> could be some common interface to support it?
>>
>> I suggest to revert a421ce088ac8 and switch to hwmon to set the fan speed.
>> At the first glance, it is already supported by the mlx driver, no ?
>>
> 
> Yes, hwmon is supported.
> But setting fan trough hwmon means we'll have two owners controlling same device.

Whatever the solution, there are several components acting on the device
at the same time (userspace and in-kernel) :/


> If speed is set through hwmon for example to 100%, thermal could decide to decrease
> it, since it does not know what the reason was for setting full speed.
> And it could make a completion between hwmon and thermal for cooling control.
> 
> Maybe adding new set_min_state()/get_min_state() callback could work?
> In such case user space can limit speed through "min_state" attrbbute.
> 
> Maybe some other approach, but within thermal subsystem?

I'm very reluctant to let the userspace to deal with the cooling device
because that is prone to abuse. And this driver is another example :(

Actually, the thermal framework should be dedicated to protect Tj,
nothing else.

However, as the hwmon is already supported, what is possible is to set
the min state in the driver when it is set via hwmon and use this min
state from the cooling device to prevent going below in set_cur_state.

So instead of dealing with out-of-boundaries values with the state, just
ignore requests which are below the min_state set by the hwmon in
set_cur_state.

There is no consistent mapping between what is a cooling device state
and hwmon, so it is hard to do generic code ATM.

Dealing with the state in the driver, makes the code self-encapsulate
and consistent.

Please revert commit a421ce088ac8 and check the code in mlxreg-fan.c
which sounds the exact copy of what provided a421ce088ac8.


>> --
>> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>>
>> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
>> <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-
>> blog/> Blog


-- 
<http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs

Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics update for configuration operation
  2021-09-27 21:12             ` Daniel Lezcano
@ 2021-09-28 11:35               ` Rafael J. Wysocki
  0 siblings, 0 replies; 9+ messages in thread
From: Rafael J. Wysocki @ 2021-09-28 11:35 UTC (permalink / raw)
  To: Daniel Lezcano
  Cc: Vadim Pasternak, rui.zhang, linux-pm, Ido Schimmel, Rafael J. Wysocki

On Mon, Sep 27, 2021 at 11:12 PM Daniel Lezcano
<daniel.lezcano@linaro.org> wrote:
>
> On 27/09/2021 19:52, Vadim Pasternak wrote:
> >
> >
> >> -----Original Message-----
> >> From: Daniel Lezcano <daniel.lezcano@linaro.org>
> >> Sent: Monday, September 27, 2021 4:56 PM
> >> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
> >> Cc: linux-pm@vger.kernel.org; Ido Schimmel <idosch@nvidia.com>; Rafael J.
> >> Wysocki <rjw@rjwysocki.net>
> >> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device statistics
> >> update for configuration operation
> >>
> >> On 27/09/2021 15:29, Vadim Pasternak wrote:
> >>>
> >>>
> >>>> -----Original Message-----
> >>>> From: Daniel Lezcano <daniel.lezcano@linaro.org>
> >>>> Sent: Monday, September 27, 2021 3:32 PM
> >>>> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
> >>>> Cc: linux-pm@vger.kernel.org; Ido Schimmel <idosch@nvidia.com>;
> >> Rafael J.
> >>>> Wysocki <rjw@rjwysocki.net>
> >>>> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device
> >>>> statistics update for configuration operation
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 27/09/2021 13:22, Vadim Pasternak wrote:
> >>>>> Hi Daniel,
> >>>>>
> >>>>> Thank you for quick reply.
> >>>>>
> >>>>>> -----Original Message-----
> >>>>>> From: Daniel Lezcano <daniel.lezcano@linaro.org>
> >>>>>> Sent: Monday, September 27, 2021 1:42 PM
> >>>>>> To: Vadim Pasternak <vadimp@nvidia.com>; rui.zhang@intel.com
> >>>>>> Cc: =idosch@nvidia.com; linux-pm@vger.kernel.org
> >>>>>> Subject: Re: [PATCH thermal 1/1] thermal/core: Skip cooling device
> >>>>>> statistics update for configuration operation
> >>>>>>
> >>>>>>
> >>>>>> Hi Vadim,
> >>>>>>
> >>>>>>
> >>>>>> On 27/09/2021 10:24, Vadim Pasternak wrote:
> >>>>>>> The thermal subsystem maintains a transition table between states
> >>>>>>> that is allocated according to the maximum state supported by the
> >>>>>>> cooling device.
> >>>>>>>
> >>>>>>> When the table needs to be updated, the thermal subsystem does
> >> not
> >>>>>>> validate that the new state does not exceed the maximum state,
> >>>>>>> leading to out-of-bounds memory accesses [1].
> >>>>>>
> >>>>>> Actually, thermal_cooling_device_stats_update() is called if the
> >>>>>> set_cur_state is successful.
> >>>>>>
> >>>>>> With a state greater than the max state, the set_cur_state should
> >>>>>> fail and
> >>>>>> thermal_cooling_device_stats_update() is not called.
> >>>>>>
> >>>>>> Perhaps the problem is in mlxsw_thermal_set_cur_state() ?
> >>>>>
> >>>>> "mlxsw" thermal drivers has additional use of 'sysfs' 'cur_state'
> >>>>> for configuration purpose to limit minimum fan speed.
> >>>>> Fan speed minimum is enforced by setting 'cur_state' with value
> >>>>> exceeding actual fan speed maximum.
> >>>>
> >>>> Yes, and that is the problem because the driver is doing weird things
> >>>> with the cooling device state resulting in an abuse of the sysfs API
> >>>> and conflicting with the thermal internals.
> >>>>
> >>>>
> >>>>> This feature provides ability to limit fan speed according to some
> >>>>> system wise considerations, like absence of some replaceable units
> >>>>> or high system ambient temperature, or some other factors which
> >>>>> indirectly impacts system airflow.
> >>>>
> >>>> Is that a static thermal profile depending on the platform set by
> >>>> userspace or something which can be changed dynamically at runtime via
> >> eg. a daemon ?
> >>>
> >>> Yes, this is some profiles/rules, which are system specific and
> >>> according to these rules userspace can limit fan speed. Like, for example:
> >>> - if one of power supplies is removed, system fan should be enforced to
> >> full
> >>>   speed, because it makes a hole in a box, and it has hard impact on airflow.
> >>> - If port side ambient temperature reaches some threshold X1, fan speed
> >> should
> >>>   be limited by Y1%, X2 - Y2%, etcetera.
> >>> - if temperature fault is detected for any optical transceivers - some limit is
> >>>   required.
> >>
> >> I see, thanks for the information.
> >>
> >>>>> For example, if cooling devices operates at cooling levels from 1 to
> >>>>> 10
> >>>>> (1 for 10% fan speed, 10 for 100% fan speed), cooling device minimal
> >>>>> speed can be limited by setting 'cur_state' attribute through 'sysfs'
> >>>>> to the values from 'max_state' + 1 to 'max_state * 2' (from 11 to 20).
> >>>>> Following this example if value is set to 14 (40%) cooling levels
> >>>>> vector will be set to 4, 4, 4, 4, 4, 5, 6, 7, 8, 9, 10 for setting
> >>>>> device speed cooling states respectively in 40, 40, 40, 40, 40, 50,
> >>>>> 60. 70, 80, 90,
> >>>>> 100 percent. And it limits cooling device to operate only at 40%
> >>>>> speed and above.
> >>>>>
> >>>>> Maybe it would be worth adding earlier some dedicated 'cur_state_limit'
> >>>>> attribute for this feature, but it was not done.
> >>>>>
> >>>>> We have another driver required this feature and one new we are
> >>>>> developing now, which require fan minim speed limit as well.
> >>>>
> >>>> The use case is valid but I think the approach is wrong. Probably the
> >>>> simplest thing to do is to set a low trip point with a minimal fan speed.
> >>>
> >>> For "trip_point_0_temp" there is the below definition:
> >>>     {       /* In range - 0-40% PWM */
> >>>             .type           = THERMAL_TRIP_ACTIVE,
> >>>             .temp           = MLXSW_THERMAL_ASIC_TEMP_NORM,
> >>>             .hyst           = MLXSW_THERMAL_HYSTERESIS_TEMP,
> >>>             .min_state      = 0,
> >>>             .max_state      = (4 * MLXSW_THERMAL_MAX_STATE) / 10,
> >> (40%)
> >>>     },
> >>> For "trip_point_1_temp":
> >>>     {
> >>>             /* In range - 40-100% PWM */
> >>>             .type           = THERMAL_TRIP_ACTIVE,
> >>>             .temp           = MLXSW_THERMAL_ASIC_TEMP_HIGH,
> >>>             .hyst           = MLXSW_THERMAL_HYSTERESIS_TEMP,
> >>>             .min_state      = (4 * MLXSW_THERMAL_MAX_STATE) / 10,
> >> (100%)
> >>>             .max_state      = MLXSW_THERMAL_MAX_STATE,
> >>>     },
> >>>
> >>> To limit cooling device by f.e 70%, I should change dynamically 'min_state'
> >> and 'max_state'
> >>> for both trips to (70%, 70%) and (70%, 100%). I am not sure I can do it?
> >>>
> >>> And we have many customers, using this user space interface, it would
> >>> be not so good to change it.
> >>
> >> The issue is the driver is wrong here. The change you referred in the past
> >> should have not been merged. I note there is no thermal maintainer blessing
> >> in the tags.
> >>
> >>> I understand it is possible to handle this issue inside driver's
> >>> set_cur_stat() callback by returning positive value for configuration
> >> request.
> >>> But maybe this feature could you useful for other developers and it
> >>> could be some common interface to support it?
> >>
> >> I suggest to revert a421ce088ac8 and switch to hwmon to set the fan speed.
> >> At the first glance, it is already supported by the mlx driver, no ?
> >>
> >
> > Yes, hwmon is supported.
> > But setting fan trough hwmon means we'll have two owners controlling same device.
>
> Whatever the solution, there are several components acting on the device
> at the same time (userspace and in-kernel) :/
>
>
> > If speed is set through hwmon for example to 100%, thermal could decide to decrease
> > it, since it does not know what the reason was for setting full speed.
> > And it could make a completion between hwmon and thermal for cooling control.
> >
> > Maybe adding new set_min_state()/get_min_state() callback could work?
> > In such case user space can limit speed through "min_state" attrbbute.
> >
> > Maybe some other approach, but within thermal subsystem?
>
> I'm very reluctant to let the userspace to deal with the cooling device
> because that is prone to abuse. And this driver is another example :(
>
> Actually, the thermal framework should be dedicated to protect Tj,
> nothing else.

Do I understand correctly that you mean the fan should be handled by
either hwmon or thermal, but not both at the same time?

I totally agree.

If user space is allowed to control the fan via hwmon, it is
responsible for setting its speed as needed.

> However, as the hwmon is already supported, what is possible is to set
> the min state in the driver when it is set via hwmon and use this min
> state from the cooling device to prevent going below in set_cur_state.
>
> So instead of dealing with out-of-boundaries values with the state, just
> ignore requests which are below the min_state set by the hwmon in
> set_cur_state.
>
> There is no consistent mapping between what is a cooling device state
> and hwmon, so it is hard to do generic code ATM.
>
> Dealing with the state in the driver, makes the code self-encapsulate
> and consistent.
>
> Please revert commit a421ce088ac8 and check the code in mlxreg-fan.c
> which sounds the exact copy of what provided a421ce088ac8.
>
>
> >> --
> >> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
> >>
> >> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> >> <http://twitter.com/#!/linaroorg> Twitter | <http://www.linaro.org/linaro-
> >> blog/> Blog
>
>
> --
> <http://www.linaro.org/> Linaro.org │ Open source software for ARM SoCs
>
> Follow Linaro:  <http://www.facebook.com/pages/Linaro> Facebook |
> <http://twitter.com/#!/linaroorg> Twitter |
> <http://www.linaro.org/linaro-blog/> Blog

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-09-28 11:35 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-27  8:24 [PATCH thermal 1/1] thermal/core: Skip cooling device statistics update for configuration operation Vadim Pasternak
2021-09-27 10:41 ` Daniel Lezcano
2021-09-27 11:22   ` Vadim Pasternak
2021-09-27 12:31     ` Daniel Lezcano
2021-09-27 13:29       ` Vadim Pasternak
2021-09-27 13:55         ` Daniel Lezcano
2021-09-27 17:52           ` Vadim Pasternak
2021-09-27 21:12             ` Daniel Lezcano
2021-09-28 11:35               ` Rafael J. Wysocki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).