All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/4] HWMON LM90 interrupt fixes and improvements
@ 2021-06-18 21:54 Dmitry Osipenko
  2021-06-18 21:54 ` [PATCH v3 1/4] hwmon: (lm90) Don't override interrupt trigger type Dmitry Osipenko
                   ` (4 more replies)
  0 siblings, 5 replies; 23+ messages in thread
From: Dmitry Osipenko @ 2021-06-18 21:54 UTC (permalink / raw)
  To: Jean Delvare, Guenter Roeck; +Cc: linux-kernel, linux-hwmon, linux-tegra

Hi,

This series makes interrupt usable on NVIDIA Tegra devices, it also
switches LM90 driver to use hwmon_notify_event().

Changelog:

v3: - No code changes. Added changelog.

v2: - Dropped "hwmon: (lm90) Use edge-triggered interrupt" patch
      and replaced it with "hwmon: (lm90) Don't override interrupt
      trigger type", as was discussed during review of v1.

    - Added these new patches:

        hwmon: (lm90) Use hwmon_notify_event()
        hwmon: (lm90) Unmask hardware interrupt
        hwmon: (lm90) Disable interrupt on suspend

Dmitry Osipenko (4):
  hwmon: (lm90) Don't override interrupt trigger type
  hwmon: (lm90) Use hwmon_notify_event()
  hwmon: (lm90) Unmask hardware interrupt
  hwmon: (lm90) Disable interrupt on suspend

 drivers/hwmon/lm90.c | 79 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 66 insertions(+), 13 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 23+ messages in thread

* [PATCH v3 1/4] hwmon: (lm90) Don't override interrupt trigger type
  2021-06-18 21:54 [PATCH v3 0/4] HWMON LM90 interrupt fixes and improvements Dmitry Osipenko
@ 2021-06-18 21:54 ` Dmitry Osipenko
  2021-06-18 21:54 ` [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event() Dmitry Osipenko
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 23+ messages in thread
From: Dmitry Osipenko @ 2021-06-18 21:54 UTC (permalink / raw)
  To: Jean Delvare, Guenter Roeck; +Cc: linux-kernel, linux-hwmon, linux-tegra

The lm90 driver sets interrupt trigger type to level-low. This type is
not suitable for sensors like NCT1008 that don't deassert interrupt line
until temperature is back to normal, resulting in interrupt storm. The
appropriate trigger type should come from OF device description and
currently it's overridden by the driver's trigger type. Don't specify
the trigger type in the driver code, letting interrupt core to use the
device-specific trigger type.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
---
 drivers/hwmon/lm90.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/hwmon/lm90.c b/drivers/hwmon/lm90.c
index ebbfd5f352c0..2e057fad05b4 100644
--- a/drivers/hwmon/lm90.c
+++ b/drivers/hwmon/lm90.c
@@ -1908,8 +1908,7 @@ static int lm90_probe(struct i2c_client *client)
 		dev_dbg(dev, "IRQ: %d\n", client->irq);
 		err = devm_request_threaded_irq(dev, client->irq,
 						NULL, lm90_irq_thread,
-						IRQF_TRIGGER_LOW | IRQF_ONESHOT,
-						"lm90", client);
+						IRQF_ONESHOT, "lm90", client);
 		if (err < 0) {
 			dev_err(dev, "cannot request IRQ %d\n", client->irq);
 			return err;
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2021-06-18 21:54 [PATCH v3 0/4] HWMON LM90 interrupt fixes and improvements Dmitry Osipenko
  2021-06-18 21:54 ` [PATCH v3 1/4] hwmon: (lm90) Don't override interrupt trigger type Dmitry Osipenko
@ 2021-06-18 21:54 ` Dmitry Osipenko
  2022-02-21 12:01   ` Jon Hunter
  2021-06-18 21:54 ` [PATCH v3 3/4] hwmon: (lm90) Unmask hardware interrupt Dmitry Osipenko
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 23+ messages in thread
From: Dmitry Osipenko @ 2021-06-18 21:54 UTC (permalink / raw)
  To: Jean Delvare, Guenter Roeck; +Cc: linux-kernel, linux-hwmon, linux-tegra

Use hwmon_notify_event() to notify userspace and thermal core about
temperature changes.

Suggested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
---
 drivers/hwmon/lm90.c | 44 +++++++++++++++++++++++++++++++++-----------
 1 file changed, 33 insertions(+), 11 deletions(-)

diff --git a/drivers/hwmon/lm90.c b/drivers/hwmon/lm90.c
index 2e057fad05b4..e7b678a40b39 100644
--- a/drivers/hwmon/lm90.c
+++ b/drivers/hwmon/lm90.c
@@ -465,6 +465,7 @@ enum lm90_temp11_reg_index {
 
 struct lm90_data {
 	struct i2c_client *client;
+	struct device *hwmon_dev;
 	u32 channel_config[4];
 	struct hwmon_channel_info temp_info;
 	const struct hwmon_channel_info *info[3];
@@ -1731,22 +1732,41 @@ static bool lm90_is_tripped(struct i2c_client *client, u16 *status)
 
 	if ((st & (LM90_STATUS_LLOW | LM90_STATUS_LHIGH | LM90_STATUS_LTHRM)) ||
 	    (st2 & MAX6696_STATUS2_LOT2))
-		dev_warn(&client->dev,
-			 "temp%d out of range, please check!\n", 1);
+		dev_dbg(&client->dev,
+			"temp%d out of range, please check!\n", 1);
 	if ((st & (LM90_STATUS_RLOW | LM90_STATUS_RHIGH | LM90_STATUS_RTHRM)) ||
 	    (st2 & MAX6696_STATUS2_ROT2))
-		dev_warn(&client->dev,
-			 "temp%d out of range, please check!\n", 2);
+		dev_dbg(&client->dev,
+			"temp%d out of range, please check!\n", 2);
 	if (st & LM90_STATUS_ROPEN)
-		dev_warn(&client->dev,
-			 "temp%d diode open, please check!\n", 2);
+		dev_dbg(&client->dev,
+			"temp%d diode open, please check!\n", 2);
 	if (st2 & (MAX6696_STATUS2_R2LOW | MAX6696_STATUS2_R2HIGH |
 		   MAX6696_STATUS2_R2THRM | MAX6696_STATUS2_R2OT2))
-		dev_warn(&client->dev,
-			 "temp%d out of range, please check!\n", 3);
+		dev_dbg(&client->dev,
+			"temp%d out of range, please check!\n", 3);
 	if (st2 & MAX6696_STATUS2_R2OPEN)
-		dev_warn(&client->dev,
-			 "temp%d diode open, please check!\n", 3);
+		dev_dbg(&client->dev,
+			"temp%d diode open, please check!\n", 3);
+
+	if (st & LM90_STATUS_LLOW)
+		hwmon_notify_event(data->hwmon_dev, hwmon_temp,
+				   hwmon_temp_min, 0);
+	if (st & LM90_STATUS_RLOW)
+		hwmon_notify_event(data->hwmon_dev, hwmon_temp,
+				   hwmon_temp_min, 1);
+	if (st2 & MAX6696_STATUS2_R2LOW)
+		hwmon_notify_event(data->hwmon_dev, hwmon_temp,
+				   hwmon_temp_min, 2);
+	if (st & LM90_STATUS_LHIGH)
+		hwmon_notify_event(data->hwmon_dev, hwmon_temp,
+				   hwmon_temp_max, 0);
+	if (st & LM90_STATUS_RHIGH)
+		hwmon_notify_event(data->hwmon_dev, hwmon_temp,
+				   hwmon_temp_max, 1);
+	if (st2 & MAX6696_STATUS2_R2HIGH)
+		hwmon_notify_event(data->hwmon_dev, hwmon_temp,
+				   hwmon_temp_max, 2);
 
 	return true;
 }
@@ -1904,6 +1924,8 @@ static int lm90_probe(struct i2c_client *client)
 	if (IS_ERR(hwmon_dev))
 		return PTR_ERR(hwmon_dev);
 
+	data->hwmon_dev = hwmon_dev;
+
 	if (client->irq) {
 		dev_dbg(dev, "IRQ: %d\n", client->irq);
 		err = devm_request_threaded_irq(dev, client->irq,
@@ -1940,7 +1962,7 @@ static void lm90_alert(struct i2c_client *client, enum i2c_alert_protocol type,
 			lm90_update_confreg(data, data->config | 0x80);
 		}
 	} else {
-		dev_info(&client->dev, "Everything OK\n");
+		dev_dbg(&client->dev, "Everything OK\n");
 	}
 }
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 3/4] hwmon: (lm90) Unmask hardware interrupt
  2021-06-18 21:54 [PATCH v3 0/4] HWMON LM90 interrupt fixes and improvements Dmitry Osipenko
  2021-06-18 21:54 ` [PATCH v3 1/4] hwmon: (lm90) Don't override interrupt trigger type Dmitry Osipenko
  2021-06-18 21:54 ` [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event() Dmitry Osipenko
@ 2021-06-18 21:54 ` Dmitry Osipenko
  2021-06-18 21:54 ` [PATCH v3 4/4] hwmon: (lm90) Disable interrupt on suspend Dmitry Osipenko
  2021-06-19 11:10 ` [PATCH v3 0/4] HWMON LM90 interrupt fixes and improvements Guenter Roeck
  4 siblings, 0 replies; 23+ messages in thread
From: Dmitry Osipenko @ 2021-06-18 21:54 UTC (permalink / raw)
  To: Jean Delvare, Guenter Roeck; +Cc: linux-kernel, linux-hwmon, linux-tegra

The ALERT interrupt is enabled by default after power-on, but it could
be masked by bootloader. For example this is the case on Acer A500 tablet
device. Unmask the hardware interrupt if interrupt is provided.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
---
 drivers/hwmon/lm90.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/hwmon/lm90.c b/drivers/hwmon/lm90.c
index e7b678a40b39..658b486d2f5e 100644
--- a/drivers/hwmon/lm90.c
+++ b/drivers/hwmon/lm90.c
@@ -1704,6 +1704,13 @@ static int lm90_init_client(struct i2c_client *client, struct lm90_data *data)
 	if (data->kind == max6696)
 		config &= ~0x08;
 
+	/*
+	 * Interrupt is enabled by default on reset, but it may be disabled
+	 * by bootloader, unmask it.
+	 */
+	if (client->irq)
+		config &= ~0x80;
+
 	config &= 0xBF;	/* run */
 	lm90_update_confreg(data, config);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* [PATCH v3 4/4] hwmon: (lm90) Disable interrupt on suspend
  2021-06-18 21:54 [PATCH v3 0/4] HWMON LM90 interrupt fixes and improvements Dmitry Osipenko
                   ` (2 preceding siblings ...)
  2021-06-18 21:54 ` [PATCH v3 3/4] hwmon: (lm90) Unmask hardware interrupt Dmitry Osipenko
@ 2021-06-18 21:54 ` Dmitry Osipenko
  2021-06-19 11:10 ` [PATCH v3 0/4] HWMON LM90 interrupt fixes and improvements Guenter Roeck
  4 siblings, 0 replies; 23+ messages in thread
From: Dmitry Osipenko @ 2021-06-18 21:54 UTC (permalink / raw)
  To: Jean Delvare, Guenter Roeck; +Cc: linux-kernel, linux-hwmon, linux-tegra

I2C accesses are prohibited and will error out after suspending of the
I2C controller, hence we need to ensure that interrupt won't fire on
suspend when it's too late. Disable interrupt across suspend/resume.

Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
---
 drivers/hwmon/lm90.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/drivers/hwmon/lm90.c b/drivers/hwmon/lm90.c
index 658b486d2f5e..b53f17511b05 100644
--- a/drivers/hwmon/lm90.c
+++ b/drivers/hwmon/lm90.c
@@ -1973,11 +1973,36 @@ static void lm90_alert(struct i2c_client *client, enum i2c_alert_protocol type,
 	}
 }
 
+static int __maybe_unused lm90_suspend(struct device *dev)
+{
+	struct lm90_data *data = dev_get_drvdata(dev);
+	struct i2c_client *client = data->client;
+
+	if (client->irq)
+		disable_irq(client->irq);
+
+	return 0;
+}
+
+static int __maybe_unused lm90_resume(struct device *dev)
+{
+	struct lm90_data *data = dev_get_drvdata(dev);
+	struct i2c_client *client = data->client;
+
+	if (client->irq)
+		enable_irq(client->irq);
+
+	return 0;
+}
+
+static SIMPLE_DEV_PM_OPS(lm90_pm_ops, lm90_suspend, lm90_resume);
+
 static struct i2c_driver lm90_driver = {
 	.class		= I2C_CLASS_HWMON,
 	.driver = {
 		.name	= "lm90",
 		.of_match_table = of_match_ptr(lm90_of_match),
+		.pm	= &lm90_pm_ops,
 	},
 	.probe_new	= lm90_probe,
 	.alert		= lm90_alert,
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 0/4] HWMON LM90 interrupt fixes and improvements
  2021-06-18 21:54 [PATCH v3 0/4] HWMON LM90 interrupt fixes and improvements Dmitry Osipenko
                   ` (3 preceding siblings ...)
  2021-06-18 21:54 ` [PATCH v3 4/4] hwmon: (lm90) Disable interrupt on suspend Dmitry Osipenko
@ 2021-06-19 11:10 ` Guenter Roeck
  4 siblings, 0 replies; 23+ messages in thread
From: Guenter Roeck @ 2021-06-19 11:10 UTC (permalink / raw)
  To: Dmitry Osipenko; +Cc: Jean Delvare, linux-kernel, linux-hwmon, linux-tegra

On Sat, Jun 19, 2021 at 12:54:51AM +0300, Dmitry Osipenko wrote:
> Hi,
> 
> This series makes interrupt usable on NVIDIA Tegra devices, it also
> switches LM90 driver to use hwmon_notify_event().
> 

Series applied.

Thanks,
Guenter

> Changelog:
> 
> v3: - No code changes. Added changelog.
> 
> v2: - Dropped "hwmon: (lm90) Use edge-triggered interrupt" patch
>       and replaced it with "hwmon: (lm90) Don't override interrupt
>       trigger type", as was discussed during review of v1.
> 
>     - Added these new patches:
> 
>         hwmon: (lm90) Use hwmon_notify_event()
>         hwmon: (lm90) Unmask hardware interrupt
>         hwmon: (lm90) Disable interrupt on suspend
> 
> Dmitry Osipenko (4):
>   hwmon: (lm90) Don't override interrupt trigger type
>   hwmon: (lm90) Use hwmon_notify_event()
>   hwmon: (lm90) Unmask hardware interrupt
>   hwmon: (lm90) Disable interrupt on suspend
> 
>  drivers/hwmon/lm90.c | 79 ++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 66 insertions(+), 13 deletions(-)
> 
> -- 
> 2.30.2
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2021-06-18 21:54 ` [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event() Dmitry Osipenko
@ 2022-02-21 12:01   ` Jon Hunter
  2022-02-21 12:36     ` Dmitry Osipenko
  2022-02-21 15:43     ` Guenter Roeck
  0 siblings, 2 replies; 23+ messages in thread
From: Jon Hunter @ 2022-02-21 12:01 UTC (permalink / raw)
  To: Dmitry Osipenko, Jean Delvare, Guenter Roeck
  Cc: linux-kernel, linux-hwmon, linux-tegra

Hi Dmitry,

On 18/06/2021 22:54, Dmitry Osipenko wrote:
> Use hwmon_notify_event() to notify userspace and thermal core about
> temperature changes.
> 
> Suggested-by: Guenter Roeck <linux@roeck-us.net>
> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
> ---
>   drivers/hwmon/lm90.c | 44 +++++++++++++++++++++++++++++++++-----------
>   1 file changed, 33 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/hwmon/lm90.c b/drivers/hwmon/lm90.c
> index 2e057fad05b4..e7b678a40b39 100644
> --- a/drivers/hwmon/lm90.c
> +++ b/drivers/hwmon/lm90.c
> @@ -465,6 +465,7 @@ enum lm90_temp11_reg_index {
>   
>   struct lm90_data {
>   	struct i2c_client *client;
> +	struct device *hwmon_dev;
>   	u32 channel_config[4];
>   	struct hwmon_channel_info temp_info;
>   	const struct hwmon_channel_info *info[3];
> @@ -1731,22 +1732,41 @@ static bool lm90_is_tripped(struct i2c_client *client, u16 *status)
>   
>   	if ((st & (LM90_STATUS_LLOW | LM90_STATUS_LHIGH | LM90_STATUS_LTHRM)) ||
>   	    (st2 & MAX6696_STATUS2_LOT2))
> -		dev_warn(&client->dev,
> -			 "temp%d out of range, please check!\n", 1);
> +		dev_dbg(&client->dev,
> +			"temp%d out of range, please check!\n", 1);
>   	if ((st & (LM90_STATUS_RLOW | LM90_STATUS_RHIGH | LM90_STATUS_RTHRM)) ||
>   	    (st2 & MAX6696_STATUS2_ROT2))
> -		dev_warn(&client->dev,
> -			 "temp%d out of range, please check!\n", 2);
> +		dev_dbg(&client->dev,
> +			"temp%d out of range, please check!\n", 2);
>   	if (st & LM90_STATUS_ROPEN)
> -		dev_warn(&client->dev,
> -			 "temp%d diode open, please check!\n", 2);
> +		dev_dbg(&client->dev,
> +			"temp%d diode open, please check!\n", 2);
>   	if (st2 & (MAX6696_STATUS2_R2LOW | MAX6696_STATUS2_R2HIGH |
>   		   MAX6696_STATUS2_R2THRM | MAX6696_STATUS2_R2OT2))
> -		dev_warn(&client->dev,
> -			 "temp%d out of range, please check!\n", 3);
> +		dev_dbg(&client->dev,
> +			"temp%d out of range, please check!\n", 3);
>   	if (st2 & MAX6696_STATUS2_R2OPEN)
> -		dev_warn(&client->dev,
> -			 "temp%d diode open, please check!\n", 3);
> +		dev_dbg(&client->dev,
> +			"temp%d diode open, please check!\n", 3);
> +
> +	if (st & LM90_STATUS_LLOW)
> +		hwmon_notify_event(data->hwmon_dev, hwmon_temp,
> +				   hwmon_temp_min, 0);
> +	if (st & LM90_STATUS_RLOW)
> +		hwmon_notify_event(data->hwmon_dev, hwmon_temp,
> +				   hwmon_temp_min, 1);
> +	if (st2 & MAX6696_STATUS2_R2LOW)
> +		hwmon_notify_event(data->hwmon_dev, hwmon_temp,
> +				   hwmon_temp_min, 2);
> +	if (st & LM90_STATUS_LHIGH)
> +		hwmon_notify_event(data->hwmon_dev, hwmon_temp,
> +				   hwmon_temp_max, 0);
> +	if (st & LM90_STATUS_RHIGH)
> +		hwmon_notify_event(data->hwmon_dev, hwmon_temp,
> +				   hwmon_temp_max, 1);
> +	if (st2 & MAX6696_STATUS2_R2HIGH)
> +		hwmon_notify_event(data->hwmon_dev, hwmon_temp,
> +				   hwmon_temp_max, 2);


We observed a random null pointer deference crash somewhere in the
thermal core (crash log below is not very helpful) when calling
mutex_lock(). It looks like we get an interrupt when this crash
happens.

Looking at the lm90 driver, per the above, I now see we are calling
hwmon_notify_event() from the lm90 interrupt handler. Looking at
hwmon_notify_event() I see that ...

hwmon_notify_event()
   --> hwmon_thermal_notify()
     --> thermal_zone_device_update()
       --> update_temperature()
         --> mutex_lock()

So although I don't completely understand the crash, it does seem
that we should not be calling hwmon_notify_event() from the
interrupt handler.

BTW I have not reproduced this myself yet, so I have just been
reviewing the code to try and understand this.

Jon

[ 7465.595066] Unable to handle kernel NULL pointer dereference at virtual address 00000000000003cd
[ 7465.596619] Mem abort info:
[ 7465.597854]   ESR = 0x96000021
[ 7465.599097]   EC = 0x25: DABT (current EL), IL = 32 bits
[ 7465.600338]   SET = 0, FnV = 0
[ 7465.601526]   EA = 0, S1PTW = 0
[ 7465.602705]   FSC = 0x21: alignment fault
[ 7465.603885] Data abort info:
[ 7465.605017]   ISV = 0, ISS = 0x00000021
[ 7465.606171]   CM = 0, WnR = 0
[ 7465.607301] user pgtable: 64k pages, 48-bit VAs, pgdp=00000001041f1800
[ 7465.608490] [00000000000003cd] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[ 7465.609814] Internal error: Oops: 96000021 [#1] PREEMPT SMP
[ 7465.610991] Modules linked in: bridge stp llc snd_soc_tegra210_admaif snd_soc_tegra_pcm snd_soc_tegra210_mixer snd_soc_tegra210_mvc snd_soc_tegra210_i2s snd_soc_tegra210_dmic sn
d_soc_tegra210_adx snd_soc_tegra210_sfc snd_soc_tegra210_amx snd_soc_tegra210_ahub tegra210_adma snd_soc_rt5659 snd_soc_rl6231 pwm_tegra tegra_aconnect snd_hda_codec_hdmi rfkill snd_hda_tegra snd_hda_codec at24 phy_tegra194_p2u snd_hda_core lm90 snd_soc_tegra_audio_graph_card tegra_bpmp_thermal snd_soc_audio_graph_card snd_soc_simple_card_utils pwm_fan crct10dif_ce pcie_tegra194 cec drm_kms_helper drm ip_tables x_tables ipv6
[ 7465.632232] CPU: 2 PID: 433 Comm: irq/140-lm90 Tainted: G           O      5.16.0-tegra-g9d109504d83a #1
[ 7465.636285] Hardware name: Unknown NVIDIA Jetson AGX Xavier Developer Kit/NVIDIA Jetson AGX Xavier Developer Kit, BIOS v1.1.2-901d3c52 02/07/2022
[ 7465.650457] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 7465.656210] pc : mutex_lock+0x18/0x60
[ 7465.660134] lr : thermal_zone_device_update+0x40/0x2e0
[ 7465.665117] sp : ffff800014c4fc60
[ 7465.668781] x29: ffff800014c4fc60 x28: ffff365ee3f6e000 x27: ffffdde218426790
[ 7465.675882] x26: ffff365ee3f6e000 x25: 0000000000000000 x24: ffff365ee3f6e000
[ 7465.683485] x23: ffffdde218426870 x22: ffff365ee3f6e000 x21: 00000000000003cd
[ 7465.690816] x20: ffff365ee8bf3308 x19: ffffffffffffffed x18: 0000000000000000
[ 7465.697982] x17: ffffdde21842689c x16: ffffdde1cb7a0b7c x15: 0000000000000040
[ 7465.705320] x14: ffffdde21a4889a0 x13: 0000000000000228 x12: 0000000000000000
[ 7465.712493] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
[ 7465.719580] x8 : 0000000001120000 x7 : 0000000000000001 x6 : 0000000000000000
[ 7465.727099] x5 : 0068000878e20f07 x4 : 0000000000000000 x3 : 00000000000003cd
[ 7465.734348] x2 : ffff365ee3f6e000 x1 : 0000000000000000 x0 : 00000000000003cd
[ 7465.741347] Call trace:
[ 7465.744207]  mutex_lock+0x18/0x60
[ 7465.747427]  hwmon_notify_event+0xfc/0x110
[ 7465.751358]  0xffffdde1cb7a0a90
[ 7465.754574]  0xffffdde1cb7a0b7c
[ 7465.757705]  irq_thread_fn+0x2c/0xa0
[ 7465.760937]  irq_thread+0x134/0x240
[ 7465.764850]  kthread+0x178/0x190
[ 7465.768083]  ret_from_fork+0x10/0x20
[ 7465.771748] Code: d503201f d503201f d2800001 aa0103e4 (c8e47c02)
[ 7465.777865] ---[ end trace f0b3723991411538 ]---


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 12:01   ` Jon Hunter
@ 2022-02-21 12:36     ` Dmitry Osipenko
  2022-02-21 12:56       ` Jon Hunter
  2022-02-21 15:43     ` Guenter Roeck
  1 sibling, 1 reply; 23+ messages in thread
From: Dmitry Osipenko @ 2022-02-21 12:36 UTC (permalink / raw)
  To: Jon Hunter, Jean Delvare, Guenter Roeck, Matt Merhar
  Cc: linux-kernel, linux-hwmon, linux-tegra

21.02.2022 15:01, Jon Hunter пишет:
> Hi Dmitry,
> 
> On 18/06/2021 22:54, Dmitry Osipenko wrote:
>> Use hwmon_notify_event() to notify userspace and thermal core about
>> temperature changes.
>>
>> Suggested-by: Guenter Roeck <linux@roeck-us.net>
>> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
>> ---
>>   drivers/hwmon/lm90.c | 44 +++++++++++++++++++++++++++++++++-----------
>>   1 file changed, 33 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/hwmon/lm90.c b/drivers/hwmon/lm90.c
>> index 2e057fad05b4..e7b678a40b39 100644
>> --- a/drivers/hwmon/lm90.c
>> +++ b/drivers/hwmon/lm90.c
>> @@ -465,6 +465,7 @@ enum lm90_temp11_reg_index {
>>     struct lm90_data {
>>       struct i2c_client *client;
>> +    struct device *hwmon_dev;
>>       u32 channel_config[4];
>>       struct hwmon_channel_info temp_info;
>>       const struct hwmon_channel_info *info[3];
>> @@ -1731,22 +1732,41 @@ static bool lm90_is_tripped(struct i2c_client
>> *client, u16 *status)
>>         if ((st & (LM90_STATUS_LLOW | LM90_STATUS_LHIGH |
>> LM90_STATUS_LTHRM)) ||
>>           (st2 & MAX6696_STATUS2_LOT2))
>> -        dev_warn(&client->dev,
>> -             "temp%d out of range, please check!\n", 1);
>> +        dev_dbg(&client->dev,
>> +            "temp%d out of range, please check!\n", 1);
>>       if ((st & (LM90_STATUS_RLOW | LM90_STATUS_RHIGH |
>> LM90_STATUS_RTHRM)) ||
>>           (st2 & MAX6696_STATUS2_ROT2))
>> -        dev_warn(&client->dev,
>> -             "temp%d out of range, please check!\n", 2);
>> +        dev_dbg(&client->dev,
>> +            "temp%d out of range, please check!\n", 2);
>>       if (st & LM90_STATUS_ROPEN)
>> -        dev_warn(&client->dev,
>> -             "temp%d diode open, please check!\n", 2);
>> +        dev_dbg(&client->dev,
>> +            "temp%d diode open, please check!\n", 2);
>>       if (st2 & (MAX6696_STATUS2_R2LOW | MAX6696_STATUS2_R2HIGH |
>>              MAX6696_STATUS2_R2THRM | MAX6696_STATUS2_R2OT2))
>> -        dev_warn(&client->dev,
>> -             "temp%d out of range, please check!\n", 3);
>> +        dev_dbg(&client->dev,
>> +            "temp%d out of range, please check!\n", 3);
>>       if (st2 & MAX6696_STATUS2_R2OPEN)
>> -        dev_warn(&client->dev,
>> -             "temp%d diode open, please check!\n", 3);
>> +        dev_dbg(&client->dev,
>> +            "temp%d diode open, please check!\n", 3);
>> +
>> +    if (st & LM90_STATUS_LLOW)
>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>> +                   hwmon_temp_min, 0);
>> +    if (st & LM90_STATUS_RLOW)
>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>> +                   hwmon_temp_min, 1);
>> +    if (st2 & MAX6696_STATUS2_R2LOW)
>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>> +                   hwmon_temp_min, 2);
>> +    if (st & LM90_STATUS_LHIGH)
>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>> +                   hwmon_temp_max, 0);
>> +    if (st & LM90_STATUS_RHIGH)
>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>> +                   hwmon_temp_max, 1);
>> +    if (st2 & MAX6696_STATUS2_R2HIGH)
>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>> +                   hwmon_temp_max, 2);
> 
> 
> We observed a random null pointer deference crash somewhere in the
> thermal core (crash log below is not very helpful) when calling
> mutex_lock(). It looks like we get an interrupt when this crash
> happens.
> 
> Looking at the lm90 driver, per the above, I now see we are calling
> hwmon_notify_event() from the lm90 interrupt handler. Looking at
> hwmon_notify_event() I see that ...
> 
> hwmon_notify_event()
>   --> hwmon_thermal_notify()
>     --> thermal_zone_device_update()
>       --> update_temperature()
>         --> mutex_lock()
> 
> So although I don't completely understand the crash, it does seem
> that we should not be calling hwmon_notify_event() from the
> interrupt handler.
> 
> BTW I have not reproduced this myself yet, so I have just been
> reviewing the code to try and understand this.

Matt Merhar was experiencing a similar issue on T30 Ouya, but I never
managed to reproduce it on Nexus 7 and Acer A500 tablets, and couldn't
spot any problem in the code. IIRC, it was a NULL dereference of another
pointer within that code.

We tried to add couple debug printks and the problem disappears when you
try to look at it, which is very suspicious.

> 
> [ 7465.595066] Unable to handle kernel NULL pointer dereference at
> virtual address 00000000000003cd
> [ 7465.596619] Mem abort info:
> [ 7465.597854]   ESR = 0x96000021
> [ 7465.599097]   EC = 0x25: DABT (current EL), IL = 32 bits
> [ 7465.600338]   SET = 0, FnV = 0
> [ 7465.601526]   EA = 0, S1PTW = 0
> [ 7465.602705]   FSC = 0x21: alignment fault
> [ 7465.603885] Data abort info:
> [ 7465.605017]   ISV = 0, ISS = 0x00000021
> [ 7465.606171]   CM = 0, WnR = 0
> [ 7465.607301] user pgtable: 64k pages, 48-bit VAs, pgdp=00000001041f1800
> [ 7465.608490] [00000000000003cd] pgd=0000000000000000,
> p4d=0000000000000000, pud=0000000000000000
> [ 7465.609814] Internal error: Oops: 96000021 [#1] PREEMPT SMP
> [ 7465.610991] Modules linked in: bridge stp llc snd_soc_tegra210_admaif
> snd_soc_tegra_pcm snd_soc_tegra210_mixer snd_soc_tegra210_mvc
> snd_soc_tegra210_i2s snd_soc_tegra210_dmic sn
> d_soc_tegra210_adx snd_soc_tegra210_sfc snd_soc_tegra210_amx
> snd_soc_tegra210_ahub tegra210_adma snd_soc_rt5659 snd_soc_rl6231
> pwm_tegra tegra_aconnect snd_hda_codec_hdmi rfkill snd_hda_tegra
> snd_hda_codec at24 phy_tegra194_p2u snd_hda_core lm90
> snd_soc_tegra_audio_graph_card tegra_bpmp_thermal
> snd_soc_audio_graph_card snd_soc_simple_card_utils pwm_fan crct10dif_ce
> pcie_tegra194 cec drm_kms_helper drm ip_tables x_tables ipv6
> [ 7465.632232] CPU: 2 PID: 433 Comm: irq/140-lm90 Tainted: G          
> O      5.16.0-tegra-g9d109504d83a #1
> [ 7465.636285] Hardware name: Unknown NVIDIA Jetson AGX Xavier Developer
> Kit/NVIDIA Jetson AGX Xavier Developer Kit, BIOS v1.1.2-901d3c52 02/07/2022
> [ 7465.650457] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
> BTYPE=--)
> [ 7465.656210] pc : mutex_lock+0x18/0x60
> [ 7465.660134] lr : thermal_zone_device_update+0x40/0x2e0
> [ 7465.665117] sp : ffff800014c4fc60
> [ 7465.668781] x29: ffff800014c4fc60 x28: ffff365ee3f6e000 x27:
> ffffdde218426790
> [ 7465.675882] x26: ffff365ee3f6e000 x25: 0000000000000000 x24:
> ffff365ee3f6e000
> [ 7465.683485] x23: ffffdde218426870 x22: ffff365ee3f6e000 x21:
> 00000000000003cd
> [ 7465.690816] x20: ffff365ee8bf3308 x19: ffffffffffffffed x18:
> 0000000000000000
> [ 7465.697982] x17: ffffdde21842689c x16: ffffdde1cb7a0b7c x15:
> 0000000000000040
> [ 7465.705320] x14: ffffdde21a4889a0 x13: 0000000000000228 x12:
> 0000000000000000
> [ 7465.712493] x11: 0000000000000000 x10: 0000000000000000 x9 :
> 0000000000000000
> [ 7465.719580] x8 : 0000000001120000 x7 : 0000000000000001 x6 :
> 0000000000000000
> [ 7465.727099] x5 : 0068000878e20f07 x4 : 0000000000000000 x3 :
> 00000000000003cd
> [ 7465.734348] x2 : ffff365ee3f6e000 x1 : 0000000000000000 x0 :
> 00000000000003cd
> [ 7465.741347] Call trace:
> [ 7465.744207]  mutex_lock+0x18/0x60
> [ 7465.747427]  hwmon_notify_event+0xfc/0x110
> [ 7465.751358]  0xffffdde1cb7a0a90
> [ 7465.754574]  0xffffdde1cb7a0b7c
> [ 7465.757705]  irq_thread_fn+0x2c/0xa0
> [ 7465.760937]  irq_thread+0x134/0x240
> [ 7465.764850]  kthread+0x178/0x190
> [ 7465.768083]  ret_from_fork+0x10/0x20
> [ 7465.771748] Code: d503201f d503201f d2800001 aa0103e4 (c8e47c02)
> [ 7465.777865] ---[ end trace f0b3723991411538 ]---
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 12:36     ` Dmitry Osipenko
@ 2022-02-21 12:56       ` Jon Hunter
  2022-02-21 12:59         ` Dmitry Osipenko
  0 siblings, 1 reply; 23+ messages in thread
From: Jon Hunter @ 2022-02-21 12:56 UTC (permalink / raw)
  To: Dmitry Osipenko, Jean Delvare, Guenter Roeck, Matt Merhar
  Cc: linux-kernel, linux-hwmon, linux-tegra


On 21/02/2022 12:36, Dmitry Osipenko wrote:
> 21.02.2022 15:01, Jon Hunter пишет:
>> Hi Dmitry,
>>
>> On 18/06/2021 22:54, Dmitry Osipenko wrote:
>>> Use hwmon_notify_event() to notify userspace and thermal core about
>>> temperature changes.
>>>
>>> Suggested-by: Guenter Roeck <linux@roeck-us.net>
>>> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
>>> ---
>>>    drivers/hwmon/lm90.c | 44 +++++++++++++++++++++++++++++++++-----------
>>>    1 file changed, 33 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/hwmon/lm90.c b/drivers/hwmon/lm90.c
>>> index 2e057fad05b4..e7b678a40b39 100644
>>> --- a/drivers/hwmon/lm90.c
>>> +++ b/drivers/hwmon/lm90.c
>>> @@ -465,6 +465,7 @@ enum lm90_temp11_reg_index {
>>>      struct lm90_data {
>>>        struct i2c_client *client;
>>> +    struct device *hwmon_dev;
>>>        u32 channel_config[4];
>>>        struct hwmon_channel_info temp_info;
>>>        const struct hwmon_channel_info *info[3];
>>> @@ -1731,22 +1732,41 @@ static bool lm90_is_tripped(struct i2c_client
>>> *client, u16 *status)
>>>          if ((st & (LM90_STATUS_LLOW | LM90_STATUS_LHIGH |
>>> LM90_STATUS_LTHRM)) ||
>>>            (st2 & MAX6696_STATUS2_LOT2))
>>> -        dev_warn(&client->dev,
>>> -             "temp%d out of range, please check!\n", 1);
>>> +        dev_dbg(&client->dev,
>>> +            "temp%d out of range, please check!\n", 1);
>>>        if ((st & (LM90_STATUS_RLOW | LM90_STATUS_RHIGH |
>>> LM90_STATUS_RTHRM)) ||
>>>            (st2 & MAX6696_STATUS2_ROT2))
>>> -        dev_warn(&client->dev,
>>> -             "temp%d out of range, please check!\n", 2);
>>> +        dev_dbg(&client->dev,
>>> +            "temp%d out of range, please check!\n", 2);
>>>        if (st & LM90_STATUS_ROPEN)
>>> -        dev_warn(&client->dev,
>>> -             "temp%d diode open, please check!\n", 2);
>>> +        dev_dbg(&client->dev,
>>> +            "temp%d diode open, please check!\n", 2);
>>>        if (st2 & (MAX6696_STATUS2_R2LOW | MAX6696_STATUS2_R2HIGH |
>>>               MAX6696_STATUS2_R2THRM | MAX6696_STATUS2_R2OT2))
>>> -        dev_warn(&client->dev,
>>> -             "temp%d out of range, please check!\n", 3);
>>> +        dev_dbg(&client->dev,
>>> +            "temp%d out of range, please check!\n", 3);
>>>        if (st2 & MAX6696_STATUS2_R2OPEN)
>>> -        dev_warn(&client->dev,
>>> -             "temp%d diode open, please check!\n", 3);
>>> +        dev_dbg(&client->dev,
>>> +            "temp%d diode open, please check!\n", 3);
>>> +
>>> +    if (st & LM90_STATUS_LLOW)
>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>> +                   hwmon_temp_min, 0);
>>> +    if (st & LM90_STATUS_RLOW)
>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>> +                   hwmon_temp_min, 1);
>>> +    if (st2 & MAX6696_STATUS2_R2LOW)
>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>> +                   hwmon_temp_min, 2);
>>> +    if (st & LM90_STATUS_LHIGH)
>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>> +                   hwmon_temp_max, 0);
>>> +    if (st & LM90_STATUS_RHIGH)
>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>> +                   hwmon_temp_max, 1);
>>> +    if (st2 & MAX6696_STATUS2_R2HIGH)
>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>> +                   hwmon_temp_max, 2);
>>
>>
>> We observed a random null pointer deference crash somewhere in the
>> thermal core (crash log below is not very helpful) when calling
>> mutex_lock(). It looks like we get an interrupt when this crash
>> happens.
>>
>> Looking at the lm90 driver, per the above, I now see we are calling
>> hwmon_notify_event() from the lm90 interrupt handler. Looking at
>> hwmon_notify_event() I see that ...
>>
>> hwmon_notify_event()
>>    --> hwmon_thermal_notify()
>>      --> thermal_zone_device_update()
>>        --> update_temperature()
>>          --> mutex_lock()
>>
>> So although I don't completely understand the crash, it does seem
>> that we should not be calling hwmon_notify_event() from the
>> interrupt handler.
>>
>> BTW I have not reproduced this myself yet, so I have just been
>> reviewing the code to try and understand this.
> 
> Matt Merhar was experiencing a similar issue on T30 Ouya, but I never
> managed to reproduce it on Nexus 7 and Acer A500 tablets, and couldn't
> spot any problem in the code. IIRC, it was a NULL dereference of another
> pointer within that code.


OK. From looking at the above I don't think we can call 
hwmon_notify_event() from an interrupt handler because this is going to 
try and request a mutex. So we need to fix that.

Jon


-- 
nvpublic

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 12:56       ` Jon Hunter
@ 2022-02-21 12:59         ` Dmitry Osipenko
  2022-02-21 13:50           ` Jon Hunter
  2022-02-21 15:25           ` Guenter Roeck
  0 siblings, 2 replies; 23+ messages in thread
From: Dmitry Osipenko @ 2022-02-21 12:59 UTC (permalink / raw)
  To: Jon Hunter, Jean Delvare, Guenter Roeck, Matt Merhar
  Cc: linux-kernel, linux-hwmon, linux-tegra

21.02.2022 15:56, Jon Hunter пишет:
> 
> On 21/02/2022 12:36, Dmitry Osipenko wrote:
>> 21.02.2022 15:01, Jon Hunter пишет:
>>> Hi Dmitry,
>>>
>>> On 18/06/2021 22:54, Dmitry Osipenko wrote:
>>>> Use hwmon_notify_event() to notify userspace and thermal core about
>>>> temperature changes.
>>>>
>>>> Suggested-by: Guenter Roeck <linux@roeck-us.net>
>>>> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
>>>> ---
>>>>    drivers/hwmon/lm90.c | 44
>>>> +++++++++++++++++++++++++++++++++-----------
>>>>    1 file changed, 33 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/hwmon/lm90.c b/drivers/hwmon/lm90.c
>>>> index 2e057fad05b4..e7b678a40b39 100644
>>>> --- a/drivers/hwmon/lm90.c
>>>> +++ b/drivers/hwmon/lm90.c
>>>> @@ -465,6 +465,7 @@ enum lm90_temp11_reg_index {
>>>>      struct lm90_data {
>>>>        struct i2c_client *client;
>>>> +    struct device *hwmon_dev;
>>>>        u32 channel_config[4];
>>>>        struct hwmon_channel_info temp_info;
>>>>        const struct hwmon_channel_info *info[3];
>>>> @@ -1731,22 +1732,41 @@ static bool lm90_is_tripped(struct i2c_client
>>>> *client, u16 *status)
>>>>          if ((st & (LM90_STATUS_LLOW | LM90_STATUS_LHIGH |
>>>> LM90_STATUS_LTHRM)) ||
>>>>            (st2 & MAX6696_STATUS2_LOT2))
>>>> -        dev_warn(&client->dev,
>>>> -             "temp%d out of range, please check!\n", 1);
>>>> +        dev_dbg(&client->dev,
>>>> +            "temp%d out of range, please check!\n", 1);
>>>>        if ((st & (LM90_STATUS_RLOW | LM90_STATUS_RHIGH |
>>>> LM90_STATUS_RTHRM)) ||
>>>>            (st2 & MAX6696_STATUS2_ROT2))
>>>> -        dev_warn(&client->dev,
>>>> -             "temp%d out of range, please check!\n", 2);
>>>> +        dev_dbg(&client->dev,
>>>> +            "temp%d out of range, please check!\n", 2);
>>>>        if (st & LM90_STATUS_ROPEN)
>>>> -        dev_warn(&client->dev,
>>>> -             "temp%d diode open, please check!\n", 2);
>>>> +        dev_dbg(&client->dev,
>>>> +            "temp%d diode open, please check!\n", 2);
>>>>        if (st2 & (MAX6696_STATUS2_R2LOW | MAX6696_STATUS2_R2HIGH |
>>>>               MAX6696_STATUS2_R2THRM | MAX6696_STATUS2_R2OT2))
>>>> -        dev_warn(&client->dev,
>>>> -             "temp%d out of range, please check!\n", 3);
>>>> +        dev_dbg(&client->dev,
>>>> +            "temp%d out of range, please check!\n", 3);
>>>>        if (st2 & MAX6696_STATUS2_R2OPEN)
>>>> -        dev_warn(&client->dev,
>>>> -             "temp%d diode open, please check!\n", 3);
>>>> +        dev_dbg(&client->dev,
>>>> +            "temp%d diode open, please check!\n", 3);
>>>> +
>>>> +    if (st & LM90_STATUS_LLOW)
>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>> +                   hwmon_temp_min, 0);
>>>> +    if (st & LM90_STATUS_RLOW)
>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>> +                   hwmon_temp_min, 1);
>>>> +    if (st2 & MAX6696_STATUS2_R2LOW)
>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>> +                   hwmon_temp_min, 2);
>>>> +    if (st & LM90_STATUS_LHIGH)
>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>> +                   hwmon_temp_max, 0);
>>>> +    if (st & LM90_STATUS_RHIGH)
>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>> +                   hwmon_temp_max, 1);
>>>> +    if (st2 & MAX6696_STATUS2_R2HIGH)
>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>> +                   hwmon_temp_max, 2);
>>>
>>>
>>> We observed a random null pointer deference crash somewhere in the
>>> thermal core (crash log below is not very helpful) when calling
>>> mutex_lock(). It looks like we get an interrupt when this crash
>>> happens.
>>>
>>> Looking at the lm90 driver, per the above, I now see we are calling
>>> hwmon_notify_event() from the lm90 interrupt handler. Looking at
>>> hwmon_notify_event() I see that ...
>>>
>>> hwmon_notify_event()
>>>    --> hwmon_thermal_notify()
>>>      --> thermal_zone_device_update()
>>>        --> update_temperature()
>>>          --> mutex_lock()
>>>
>>> So although I don't completely understand the crash, it does seem
>>> that we should not be calling hwmon_notify_event() from the
>>> interrupt handler.
>>>
>>> BTW I have not reproduced this myself yet, so I have just been
>>> reviewing the code to try and understand this.
>>
>> Matt Merhar was experiencing a similar issue on T30 Ouya, but I never
>> managed to reproduce it on Nexus 7 and Acer A500 tablets, and couldn't
>> spot any problem in the code. IIRC, it was a NULL dereference of another
>> pointer within that code.
> 
> 
> OK. From looking at the above I don't think we can call
> hwmon_notify_event() from an interrupt handler because this is going to
> try and request a mutex. So we need to fix that.

The interrupt is threaded, so it can take a mutex.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 12:59         ` Dmitry Osipenko
@ 2022-02-21 13:50           ` Jon Hunter
  2022-02-21 13:59             ` Dmitry Osipenko
  2022-02-21 15:25           ` Guenter Roeck
  1 sibling, 1 reply; 23+ messages in thread
From: Jon Hunter @ 2022-02-21 13:50 UTC (permalink / raw)
  To: Dmitry Osipenko, Jean Delvare, Guenter Roeck, Matt Merhar
  Cc: linux-kernel, linux-hwmon, linux-tegra


On 21/02/2022 12:59, Dmitry Osipenko wrote:
> 21.02.2022 15:56, Jon Hunter пишет:
>>
>> On 21/02/2022 12:36, Dmitry Osipenko wrote:
>>> 21.02.2022 15:01, Jon Hunter пишет:
>>>> Hi Dmitry,
>>>>
>>>> On 18/06/2021 22:54, Dmitry Osipenko wrote:
>>>>> Use hwmon_notify_event() to notify userspace and thermal core about
>>>>> temperature changes.
>>>>>
>>>>> Suggested-by: Guenter Roeck <linux@roeck-us.net>
>>>>> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
>>>>> ---
>>>>>     drivers/hwmon/lm90.c | 44
>>>>> +++++++++++++++++++++++++++++++++-----------
>>>>>     1 file changed, 33 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/drivers/hwmon/lm90.c b/drivers/hwmon/lm90.c
>>>>> index 2e057fad05b4..e7b678a40b39 100644
>>>>> --- a/drivers/hwmon/lm90.c
>>>>> +++ b/drivers/hwmon/lm90.c
>>>>> @@ -465,6 +465,7 @@ enum lm90_temp11_reg_index {
>>>>>       struct lm90_data {
>>>>>         struct i2c_client *client;
>>>>> +    struct device *hwmon_dev;
>>>>>         u32 channel_config[4];
>>>>>         struct hwmon_channel_info temp_info;
>>>>>         const struct hwmon_channel_info *info[3];
>>>>> @@ -1731,22 +1732,41 @@ static bool lm90_is_tripped(struct i2c_client
>>>>> *client, u16 *status)
>>>>>           if ((st & (LM90_STATUS_LLOW | LM90_STATUS_LHIGH |
>>>>> LM90_STATUS_LTHRM)) ||
>>>>>             (st2 & MAX6696_STATUS2_LOT2))
>>>>> -        dev_warn(&client->dev,
>>>>> -             "temp%d out of range, please check!\n", 1);
>>>>> +        dev_dbg(&client->dev,
>>>>> +            "temp%d out of range, please check!\n", 1);
>>>>>         if ((st & (LM90_STATUS_RLOW | LM90_STATUS_RHIGH |
>>>>> LM90_STATUS_RTHRM)) ||
>>>>>             (st2 & MAX6696_STATUS2_ROT2))
>>>>> -        dev_warn(&client->dev,
>>>>> -             "temp%d out of range, please check!\n", 2);
>>>>> +        dev_dbg(&client->dev,
>>>>> +            "temp%d out of range, please check!\n", 2);
>>>>>         if (st & LM90_STATUS_ROPEN)
>>>>> -        dev_warn(&client->dev,
>>>>> -             "temp%d diode open, please check!\n", 2);
>>>>> +        dev_dbg(&client->dev,
>>>>> +            "temp%d diode open, please check!\n", 2);
>>>>>         if (st2 & (MAX6696_STATUS2_R2LOW | MAX6696_STATUS2_R2HIGH |
>>>>>                MAX6696_STATUS2_R2THRM | MAX6696_STATUS2_R2OT2))
>>>>> -        dev_warn(&client->dev,
>>>>> -             "temp%d out of range, please check!\n", 3);
>>>>> +        dev_dbg(&client->dev,
>>>>> +            "temp%d out of range, please check!\n", 3);
>>>>>         if (st2 & MAX6696_STATUS2_R2OPEN)
>>>>> -        dev_warn(&client->dev,
>>>>> -             "temp%d diode open, please check!\n", 3);
>>>>> +        dev_dbg(&client->dev,
>>>>> +            "temp%d diode open, please check!\n", 3);
>>>>> +
>>>>> +    if (st & LM90_STATUS_LLOW)
>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>> +                   hwmon_temp_min, 0);
>>>>> +    if (st & LM90_STATUS_RLOW)
>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>> +                   hwmon_temp_min, 1);
>>>>> +    if (st2 & MAX6696_STATUS2_R2LOW)
>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>> +                   hwmon_temp_min, 2);
>>>>> +    if (st & LM90_STATUS_LHIGH)
>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>> +                   hwmon_temp_max, 0);
>>>>> +    if (st & LM90_STATUS_RHIGH)
>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>> +                   hwmon_temp_max, 1);
>>>>> +    if (st2 & MAX6696_STATUS2_R2HIGH)
>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>> +                   hwmon_temp_max, 2);
>>>>
>>>>
>>>> We observed a random null pointer deference crash somewhere in the
>>>> thermal core (crash log below is not very helpful) when calling
>>>> mutex_lock(). It looks like we get an interrupt when this crash
>>>> happens.
>>>>
>>>> Looking at the lm90 driver, per the above, I now see we are calling
>>>> hwmon_notify_event() from the lm90 interrupt handler. Looking at
>>>> hwmon_notify_event() I see that ...
>>>>
>>>> hwmon_notify_event()
>>>>     --> hwmon_thermal_notify()
>>>>       --> thermal_zone_device_update()
>>>>         --> update_temperature()
>>>>           --> mutex_lock()
>>>>
>>>> So although I don't completely understand the crash, it does seem
>>>> that we should not be calling hwmon_notify_event() from the
>>>> interrupt handler.
>>>>
>>>> BTW I have not reproduced this myself yet, so I have just been
>>>> reviewing the code to try and understand this.
>>>
>>> Matt Merhar was experiencing a similar issue on T30 Ouya, but I never
>>> managed to reproduce it on Nexus 7 and Acer A500 tablets, and couldn't
>>> spot any problem in the code. IIRC, it was a NULL dereference of another
>>> pointer within that code.
>>
>>
>> OK. From looking at the above I don't think we can call
>> hwmon_notify_event() from an interrupt handler because this is going to
>> try and request a mutex. So we need to fix that.
> 
> The interrupt is threaded, so it can take a mutex.


Ah yes, I clearly overlooked that detail.

Good news is that I have been able to reproduce this on Jetson Xavier by ...

$ echo 40000 | sudo tee /sys/devices/platform/bpmp/bpmp\:i2c/i2c-0/0-004c/hwmon/hwmon0/temp1_max
40000
[  105.890995] Unable to handle kernel NULL pointer dereference at virtual address 00000000000003cd
[  105.900105] Mem abort info:
[  105.903328]   ESR = 0x96000021
[  105.906673]   EC = 0x25: DABT (current EL), IL = 32 bits
[  105.912407]   SET = 0, FnV = 0
[  105.915751]   EA = 0, S1PTW = 0
[  105.919230]   FSC = 0x21: alignment fault
[  105.923698] Data abort info:
[  105.926853]   ISV = 0, ISS = 0x00000021
[  105.931139]   CM = 0, WnR = 0
[  105.934420] user pgtable: 64k pages, 48-bit VAs, pgdp=0000000101f6b600
[  105.941230] [00000000000003cd] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
[  105.950864] Internal error: Oops: 96000021 [#1] PREEMPT SMP
[  105.956608] Modules linked in: btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq zstd_compress rfkill snd_soc_tegra210_mixer snd_soc_tegra210_adx snd_soc_tegra210_dmic snd_soc_tegra210_mvc snd_soc_tegra210_amx snd_soc_tegra210_sfc snd_soc_tegra210_admaif snd_soc_tegra_pcm snd_soc_tegra210_i2s snd_soc_tegra210_ahub uvcvideo videobuf2_vmalloc tegra210_adma videobuf2_memops videobuf2_v4l2 cec videobuf2_common drm_kms_
helper videodev mc snd_soc_rt5659 snd_soc_rl6231 pwm_tegra tegra_aconnect snd_hda_codec_hdmi lm90 tegra_bpmp_thermal snd_hda_tegra snd_soc_tegra_audio_graph_card snd_hda_codec snd_hda_core phy_tegra194_p2u snd_soc_
audio_graph_card at24 snd_soc_simple_card_utils pwm_fan pcie_tegra194 crct10dif_ce drm ip_tables x_tables ipv6
[  106.032497] CPU: 3 PID: 296 Comm: irq/126-lm90 Tainted: G           O      5.16.0-tegra-291805-gf905a41db850 #3
[  106.042869] Hardware name: Unknown NVIDIA Jetson AGX Xavier Developer Kit/NVIDIA Jetson AGX Xavier Developer Kit, BIOS v1.1.2-901d3c52ed23 02/14/2022
[  106.056392] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  106.063903] pc : mutex_lock+0x18/0x60
[  106.067750] lr : thermal_zone_device_update+0x40/0x2e0
[  106.073161] sp : ffff80001494fc60
[  106.076553] x29: ffff80001494fc60 x28: ffff59bb27801c00 x27: ffffa4295d826790
[  106.084052] x26: ffff59bb27801c00 x25: 0000000000000000 x24: ffff59bb27801c00
[  106.091541] x23: ffffa4295d826870 x22: ffff59bb27801c00 x21: 00000000000003cd
[  106.098905] x20: ffff59bb28078f08 x19: ffffffffffffffed x18: 0000000000000000
[  106.106387] x17: ffffa4295d82689c x16: ffffa4292d400b7c x15: 0000000000000040
[  106.113766] x14: ffffa4295f8889a0 x13: 0000000000000228 x12: 0000000000000000
[  106.121294] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
[  106.128793] x8 : 0000000000c2f000 x7 : 0000000000000001 x6 : 0000000000000000
[  106.136133] x5 : 006800047a8e0f07 x4 : 0000000000000000 x3 : 00000000000003cd
[  106.143473] x2 : ffff59bb27801c00 x1 : 0000000000000000 x0 : 00000000000003cd
[  106.150813] Call trace:
[  106.153333]  mutex_lock+0x18/0x60
[  106.156804]  hwmon_notify_event+0xfc/0x110
[  106.161164]  0xffffa4292d400a74
[  106.164417]  0xffffa4292d400b7c
[  106.167659]  irq_thread_fn+0x2c/0xa0
[  106.171359]  irq_thread+0x134/0x240
[  106.174971]  kthread+0x178/0x190
[  106.178469]  ret_from_fork+0x10/0x20
[  106.182187] Code: d503201f d503201f d2800001 aa0103e4 (c8e47c02)
[  106.188550] ---[ end trace 62bf0e0b37a16815 ]---
[  106.193261] Kernel panic - not syncing: Oops: Fatal exception
[  106.199106] SMP: stopping secondary CPUs
[  106.203401] Kernel Offset: 0x24294d740000 from 0xffff800010000000
[  106.209584] PHYS_OFFSET: 0xffffa645e0000000
[  106.214011] CPU features: 0x0,40000843,06400846
[  106.218651] Memory Limit: none
[  106.221822] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---


I am wondering if this is some sort of race condition in the thermal
shutdown path. I would be interested to know if you see the same.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 13:50           ` Jon Hunter
@ 2022-02-21 13:59             ` Dmitry Osipenko
  0 siblings, 0 replies; 23+ messages in thread
From: Dmitry Osipenko @ 2022-02-21 13:59 UTC (permalink / raw)
  To: Jon Hunter, Jean Delvare, Guenter Roeck, Matt Merhar
  Cc: linux-kernel, linux-hwmon, linux-tegra

21.02.2022 16:50, Jon Hunter пишет:
> 
> On 21/02/2022 12:59, Dmitry Osipenko wrote:
>> 21.02.2022 15:56, Jon Hunter пишет:
>>>
>>> On 21/02/2022 12:36, Dmitry Osipenko wrote:
>>>> 21.02.2022 15:01, Jon Hunter пишет:
>>>>> Hi Dmitry,
>>>>>
>>>>> On 18/06/2021 22:54, Dmitry Osipenko wrote:
>>>>>> Use hwmon_notify_event() to notify userspace and thermal core about
>>>>>> temperature changes.
>>>>>>
>>>>>> Suggested-by: Guenter Roeck <linux@roeck-us.net>
>>>>>> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
>>>>>> ---
>>>>>>     drivers/hwmon/lm90.c | 44
>>>>>> +++++++++++++++++++++++++++++++++-----------
>>>>>>     1 file changed, 33 insertions(+), 11 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/hwmon/lm90.c b/drivers/hwmon/lm90.c
>>>>>> index 2e057fad05b4..e7b678a40b39 100644
>>>>>> --- a/drivers/hwmon/lm90.c
>>>>>> +++ b/drivers/hwmon/lm90.c
>>>>>> @@ -465,6 +465,7 @@ enum lm90_temp11_reg_index {
>>>>>>       struct lm90_data {
>>>>>>         struct i2c_client *client;
>>>>>> +    struct device *hwmon_dev;
>>>>>>         u32 channel_config[4];
>>>>>>         struct hwmon_channel_info temp_info;
>>>>>>         const struct hwmon_channel_info *info[3];
>>>>>> @@ -1731,22 +1732,41 @@ static bool lm90_is_tripped(struct i2c_client
>>>>>> *client, u16 *status)
>>>>>>           if ((st & (LM90_STATUS_LLOW | LM90_STATUS_LHIGH |
>>>>>> LM90_STATUS_LTHRM)) ||
>>>>>>             (st2 & MAX6696_STATUS2_LOT2))
>>>>>> -        dev_warn(&client->dev,
>>>>>> -             "temp%d out of range, please check!\n", 1);
>>>>>> +        dev_dbg(&client->dev,
>>>>>> +            "temp%d out of range, please check!\n", 1);
>>>>>>         if ((st & (LM90_STATUS_RLOW | LM90_STATUS_RHIGH |
>>>>>> LM90_STATUS_RTHRM)) ||
>>>>>>             (st2 & MAX6696_STATUS2_ROT2))
>>>>>> -        dev_warn(&client->dev,
>>>>>> -             "temp%d out of range, please check!\n", 2);
>>>>>> +        dev_dbg(&client->dev,
>>>>>> +            "temp%d out of range, please check!\n", 2);
>>>>>>         if (st & LM90_STATUS_ROPEN)
>>>>>> -        dev_warn(&client->dev,
>>>>>> -             "temp%d diode open, please check!\n", 2);
>>>>>> +        dev_dbg(&client->dev,
>>>>>> +            "temp%d diode open, please check!\n", 2);
>>>>>>         if (st2 & (MAX6696_STATUS2_R2LOW | MAX6696_STATUS2_R2HIGH |
>>>>>>                MAX6696_STATUS2_R2THRM | MAX6696_STATUS2_R2OT2))
>>>>>> -        dev_warn(&client->dev,
>>>>>> -             "temp%d out of range, please check!\n", 3);
>>>>>> +        dev_dbg(&client->dev,
>>>>>> +            "temp%d out of range, please check!\n", 3);
>>>>>>         if (st2 & MAX6696_STATUS2_R2OPEN)
>>>>>> -        dev_warn(&client->dev,
>>>>>> -             "temp%d diode open, please check!\n", 3);
>>>>>> +        dev_dbg(&client->dev,
>>>>>> +            "temp%d diode open, please check!\n", 3);
>>>>>> +
>>>>>> +    if (st & LM90_STATUS_LLOW)
>>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>>> +                   hwmon_temp_min, 0);
>>>>>> +    if (st & LM90_STATUS_RLOW)
>>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>>> +                   hwmon_temp_min, 1);
>>>>>> +    if (st2 & MAX6696_STATUS2_R2LOW)
>>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>>> +                   hwmon_temp_min, 2);
>>>>>> +    if (st & LM90_STATUS_LHIGH)
>>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>>> +                   hwmon_temp_max, 0);
>>>>>> +    if (st & LM90_STATUS_RHIGH)
>>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>>> +                   hwmon_temp_max, 1);
>>>>>> +    if (st2 & MAX6696_STATUS2_R2HIGH)
>>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>>> +                   hwmon_temp_max, 2);
>>>>>
>>>>>
>>>>> We observed a random null pointer deference crash somewhere in the
>>>>> thermal core (crash log below is not very helpful) when calling
>>>>> mutex_lock(). It looks like we get an interrupt when this crash
>>>>> happens.
>>>>>
>>>>> Looking at the lm90 driver, per the above, I now see we are calling
>>>>> hwmon_notify_event() from the lm90 interrupt handler. Looking at
>>>>> hwmon_notify_event() I see that ...
>>>>>
>>>>> hwmon_notify_event()
>>>>>     --> hwmon_thermal_notify()
>>>>>       --> thermal_zone_device_update()
>>>>>         --> update_temperature()
>>>>>           --> mutex_lock()
>>>>>
>>>>> So although I don't completely understand the crash, it does seem
>>>>> that we should not be calling hwmon_notify_event() from the
>>>>> interrupt handler.
>>>>>
>>>>> BTW I have not reproduced this myself yet, so I have just been
>>>>> reviewing the code to try and understand this.
>>>>
>>>> Matt Merhar was experiencing a similar issue on T30 Ouya, but I never
>>>> managed to reproduce it on Nexus 7 and Acer A500 tablets, and couldn't
>>>> spot any problem in the code. IIRC, it was a NULL dereference of
>>>> another
>>>> pointer within that code.
>>>
>>>
>>> OK. From looking at the above I don't think we can call
>>> hwmon_notify_event() from an interrupt handler because this is going to
>>> try and request a mutex. So we need to fix that.
>>
>> The interrupt is threaded, so it can take a mutex.
> 
> 
> Ah yes, I clearly overlooked that detail.
> 
> Good news is that I have been able to reproduce this on Jetson Xavier by
> ...
> 
> $ echo 40000 | sudo tee
> /sys/devices/platform/bpmp/bpmp\:i2c/i2c-0/0-004c/hwmon/hwmon0/temp1_max
> 40000
> [  105.890995] Unable to handle kernel NULL pointer dereference at
> virtual address 00000000000003cd
> [  105.900105] Mem abort info:
> [  105.903328]   ESR = 0x96000021
> [  105.906673]   EC = 0x25: DABT (current EL), IL = 32 bits
> [  105.912407]   SET = 0, FnV = 0
> [  105.915751]   EA = 0, S1PTW = 0
> [  105.919230]   FSC = 0x21: alignment fault
> [  105.923698] Data abort info:
> [  105.926853]   ISV = 0, ISS = 0x00000021
> [  105.931139]   CM = 0, WnR = 0
> [  105.934420] user pgtable: 64k pages, 48-bit VAs, pgdp=0000000101f6b600
> [  105.941230] [00000000000003cd] pgd=0000000000000000,
> p4d=0000000000000000, pud=0000000000000000
> [  105.950864] Internal error: Oops: 96000021 [#1] PREEMPT SMP
> [  105.956608] Modules linked in: btrfs blake2b_generic libcrc32c xor
> xor_neon raid6_pq zstd_compress rfkill snd_soc_tegra210_mixer
> snd_soc_tegra210_adx snd_soc_tegra210_dmic snd_soc_tegra210_mvc
> snd_soc_tegra210_amx snd_soc_tegra210_sfc snd_soc_tegra210_admaif
> snd_soc_tegra_pcm snd_soc_tegra210_i2s snd_soc_tegra210_ahub uvcvideo
> videobuf2_vmalloc tegra210_adma videobuf2_memops videobuf2_v4l2 cec
> videobuf2_common drm_kms_
> helper videodev mc snd_soc_rt5659 snd_soc_rl6231 pwm_tegra
> tegra_aconnect snd_hda_codec_hdmi lm90 tegra_bpmp_thermal snd_hda_tegra
> snd_soc_tegra_audio_graph_card snd_hda_codec snd_hda_core
> phy_tegra194_p2u snd_soc_
> audio_graph_card at24 snd_soc_simple_card_utils pwm_fan pcie_tegra194
> crct10dif_ce drm ip_tables x_tables ipv6
> [  106.032497] CPU: 3 PID: 296 Comm: irq/126-lm90 Tainted: G          
> O      5.16.0-tegra-291805-gf905a41db850 #3
> [  106.042869] Hardware name: Unknown NVIDIA Jetson AGX Xavier Developer
> Kit/NVIDIA Jetson AGX Xavier Developer Kit, BIOS v1.1.2-901d3c52ed23
> 02/14/2022
> [  106.056392] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS
> BTYPE=--)
> [  106.063903] pc : mutex_lock+0x18/0x60
> [  106.067750] lr : thermal_zone_device_update+0x40/0x2e0
> [  106.073161] sp : ffff80001494fc60
> [  106.076553] x29: ffff80001494fc60 x28: ffff59bb27801c00 x27:
> ffffa4295d826790
> [  106.084052] x26: ffff59bb27801c00 x25: 0000000000000000 x24:
> ffff59bb27801c00
> [  106.091541] x23: ffffa4295d826870 x22: ffff59bb27801c00 x21:
> 00000000000003cd
> [  106.098905] x20: ffff59bb28078f08 x19: ffffffffffffffed x18:
> 0000000000000000
> [  106.106387] x17: ffffa4295d82689c x16: ffffa4292d400b7c x15:
> 0000000000000040
> [  106.113766] x14: ffffa4295f8889a0 x13: 0000000000000228 x12:
> 0000000000000000
> [  106.121294] x11: 0000000000000000 x10: 0000000000000000 x9 :
> 0000000000000000
> [  106.128793] x8 : 0000000000c2f000 x7 : 0000000000000001 x6 :
> 0000000000000000
> [  106.136133] x5 : 006800047a8e0f07 x4 : 0000000000000000 x3 :
> 00000000000003cd
> [  106.143473] x2 : ffff59bb27801c00 x1 : 0000000000000000 x0 :
> 00000000000003cd
> [  106.150813] Call trace:
> [  106.153333]  mutex_lock+0x18/0x60
> [  106.156804]  hwmon_notify_event+0xfc/0x110
> [  106.161164]  0xffffa4292d400a74
> [  106.164417]  0xffffa4292d400b7c
> [  106.167659]  irq_thread_fn+0x2c/0xa0
> [  106.171359]  irq_thread+0x134/0x240
> [  106.174971]  kthread+0x178/0x190
> [  106.178469]  ret_from_fork+0x10/0x20
> [  106.182187] Code: d503201f d503201f d2800001 aa0103e4 (c8e47c02)
> [  106.188550] ---[ end trace 62bf0e0b37a16815 ]---
> [  106.193261] Kernel panic - not syncing: Oops: Fatal exception
> [  106.199106] SMP: stopping secondary CPUs
> [  106.203401] Kernel Offset: 0x24294d740000 from 0xffff800010000000
> [  106.209584] PHYS_OFFSET: 0xffffa645e0000000
> [  106.214011] CPU features: 0x0,40000843,06400846
> [  106.218651] Memory Limit: none
> [  106.221822] ---[ end Kernel panic - not syncing: Oops: Fatal
> exception ]---
> 
> 
> I am wondering if this is some sort of race condition in the thermal
> shutdown path. I would be interested to know if you see the same.

Indeed, it feels like a race condition. I tried to reproduce again on
A500 using next-20220217 and nothing bad happens.

# sensors
nct1008-i2c-2-4c
Adapter: 7000d000.i2c
temp1:        +31.0°C  (low  = -64.0°C, high = +30.0°C)  ALARM (HIGH)
                       (crit = +120.0°C, hyst = +110.0°C)
temp2:        +37.8°C  (low  = -64.0°C, high = +30.0°C)  ALARM (HIGH)
                       (crit = +115.0°C, hyst = +105.0°C)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 12:59         ` Dmitry Osipenko
  2022-02-21 13:50           ` Jon Hunter
@ 2022-02-21 15:25           ` Guenter Roeck
  1 sibling, 0 replies; 23+ messages in thread
From: Guenter Roeck @ 2022-02-21 15:25 UTC (permalink / raw)
  To: Dmitry Osipenko, Jon Hunter, Jean Delvare, Matt Merhar
  Cc: linux-kernel, linux-hwmon, linux-tegra

On 2/21/22 04:59, Dmitry Osipenko wrote:
> 21.02.2022 15:56, Jon Hunter пишет:
>>
>> On 21/02/2022 12:36, Dmitry Osipenko wrote:
>>> 21.02.2022 15:01, Jon Hunter пишет:
>>>> Hi Dmitry,
>>>>
>>>> On 18/06/2021 22:54, Dmitry Osipenko wrote:
>>>>> Use hwmon_notify_event() to notify userspace and thermal core about
>>>>> temperature changes.
>>>>>
>>>>> Suggested-by: Guenter Roeck <linux@roeck-us.net>
>>>>> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
>>>>> ---
>>>>>     drivers/hwmon/lm90.c | 44
>>>>> +++++++++++++++++++++++++++++++++-----------
>>>>>     1 file changed, 33 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/drivers/hwmon/lm90.c b/drivers/hwmon/lm90.c
>>>>> index 2e057fad05b4..e7b678a40b39 100644
>>>>> --- a/drivers/hwmon/lm90.c
>>>>> +++ b/drivers/hwmon/lm90.c
>>>>> @@ -465,6 +465,7 @@ enum lm90_temp11_reg_index {
>>>>>       struct lm90_data {
>>>>>         struct i2c_client *client;
>>>>> +    struct device *hwmon_dev;
>>>>>         u32 channel_config[4];
>>>>>         struct hwmon_channel_info temp_info;
>>>>>         const struct hwmon_channel_info *info[3];
>>>>> @@ -1731,22 +1732,41 @@ static bool lm90_is_tripped(struct i2c_client
>>>>> *client, u16 *status)
>>>>>           if ((st & (LM90_STATUS_LLOW | LM90_STATUS_LHIGH |
>>>>> LM90_STATUS_LTHRM)) ||
>>>>>             (st2 & MAX6696_STATUS2_LOT2))
>>>>> -        dev_warn(&client->dev,
>>>>> -             "temp%d out of range, please check!\n", 1);
>>>>> +        dev_dbg(&client->dev,
>>>>> +            "temp%d out of range, please check!\n", 1);
>>>>>         if ((st & (LM90_STATUS_RLOW | LM90_STATUS_RHIGH |
>>>>> LM90_STATUS_RTHRM)) ||
>>>>>             (st2 & MAX6696_STATUS2_ROT2))
>>>>> -        dev_warn(&client->dev,
>>>>> -             "temp%d out of range, please check!\n", 2);
>>>>> +        dev_dbg(&client->dev,
>>>>> +            "temp%d out of range, please check!\n", 2);
>>>>>         if (st & LM90_STATUS_ROPEN)
>>>>> -        dev_warn(&client->dev,
>>>>> -             "temp%d diode open, please check!\n", 2);
>>>>> +        dev_dbg(&client->dev,
>>>>> +            "temp%d diode open, please check!\n", 2);
>>>>>         if (st2 & (MAX6696_STATUS2_R2LOW | MAX6696_STATUS2_R2HIGH |
>>>>>                MAX6696_STATUS2_R2THRM | MAX6696_STATUS2_R2OT2))
>>>>> -        dev_warn(&client->dev,
>>>>> -             "temp%d out of range, please check!\n", 3);
>>>>> +        dev_dbg(&client->dev,
>>>>> +            "temp%d out of range, please check!\n", 3);
>>>>>         if (st2 & MAX6696_STATUS2_R2OPEN)
>>>>> -        dev_warn(&client->dev,
>>>>> -             "temp%d diode open, please check!\n", 3);
>>>>> +        dev_dbg(&client->dev,
>>>>> +            "temp%d diode open, please check!\n", 3);
>>>>> +
>>>>> +    if (st & LM90_STATUS_LLOW)
>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>> +                   hwmon_temp_min, 0);
>>>>> +    if (st & LM90_STATUS_RLOW)
>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>> +                   hwmon_temp_min, 1);
>>>>> +    if (st2 & MAX6696_STATUS2_R2LOW)
>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>> +                   hwmon_temp_min, 2);
>>>>> +    if (st & LM90_STATUS_LHIGH)
>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>> +                   hwmon_temp_max, 0);
>>>>> +    if (st & LM90_STATUS_RHIGH)
>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>> +                   hwmon_temp_max, 1);
>>>>> +    if (st2 & MAX6696_STATUS2_R2HIGH)
>>>>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>>>>> +                   hwmon_temp_max, 2);
>>>>
>>>>
>>>> We observed a random null pointer deference crash somewhere in the
>>>> thermal core (crash log below is not very helpful) when calling
>>>> mutex_lock(). It looks like we get an interrupt when this crash
>>>> happens.
>>>>
>>>> Looking at the lm90 driver, per the above, I now see we are calling
>>>> hwmon_notify_event() from the lm90 interrupt handler. Looking at
>>>> hwmon_notify_event() I see that ...
>>>>
>>>> hwmon_notify_event()
>>>>     --> hwmon_thermal_notify()
>>>>       --> thermal_zone_device_update()
>>>>         --> update_temperature()
>>>>           --> mutex_lock()
>>>>
>>>> So although I don't completely understand the crash, it does seem
>>>> that we should not be calling hwmon_notify_event() from the
>>>> interrupt handler.
>>>>
>>>> BTW I have not reproduced this myself yet, so I have just been
>>>> reviewing the code to try and understand this.
>>>
>>> Matt Merhar was experiencing a similar issue on T30 Ouya, but I never
>>> managed to reproduce it on Nexus 7 and Acer A500 tablets, and couldn't
>>> spot any problem in the code. IIRC, it was a NULL dereference of another
>>> pointer within that code.
>>
>>
>> OK. From looking at the above I don't think we can call
>> hwmon_notify_event() from an interrupt handler because this is going to
>> try and request a mutex. So we need to fix that.
> 
> The interrupt is threaded, so it can take a mutex.

Exactly. The problem is elsewhere.

Guenter

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 12:01   ` Jon Hunter
  2022-02-21 12:36     ` Dmitry Osipenko
@ 2022-02-21 15:43     ` Guenter Roeck
  2022-02-21 15:49       ` Jon Hunter
  1 sibling, 1 reply; 23+ messages in thread
From: Guenter Roeck @ 2022-02-21 15:43 UTC (permalink / raw)
  To: Jon Hunter, Dmitry Osipenko, Jean Delvare
  Cc: linux-kernel, linux-hwmon, linux-tegra

On 2/21/22 04:01, Jon Hunter wrote:
> Hi Dmitry,
> 
> On 18/06/2021 22:54, Dmitry Osipenko wrote:
>> Use hwmon_notify_event() to notify userspace and thermal core about
>> temperature changes.
>>
>> Suggested-by: Guenter Roeck <linux@roeck-us.net>
>> Signed-off-by: Dmitry Osipenko <digetx@gmail.com>
>> ---
>>   drivers/hwmon/lm90.c | 44 +++++++++++++++++++++++++++++++++-----------
>>   1 file changed, 33 insertions(+), 11 deletions(-)
>>
>> diff --git a/drivers/hwmon/lm90.c b/drivers/hwmon/lm90.c
>> index 2e057fad05b4..e7b678a40b39 100644
>> --- a/drivers/hwmon/lm90.c
>> +++ b/drivers/hwmon/lm90.c
>> @@ -465,6 +465,7 @@ enum lm90_temp11_reg_index {
>>   struct lm90_data {
>>       struct i2c_client *client;
>> +    struct device *hwmon_dev;
>>       u32 channel_config[4];
>>       struct hwmon_channel_info temp_info;
>>       const struct hwmon_channel_info *info[3];
>> @@ -1731,22 +1732,41 @@ static bool lm90_is_tripped(struct i2c_client *client, u16 *status)
>>       if ((st & (LM90_STATUS_LLOW | LM90_STATUS_LHIGH | LM90_STATUS_LTHRM)) ||
>>           (st2 & MAX6696_STATUS2_LOT2))
>> -        dev_warn(&client->dev,
>> -             "temp%d out of range, please check!\n", 1);
>> +        dev_dbg(&client->dev,
>> +            "temp%d out of range, please check!\n", 1);
>>       if ((st & (LM90_STATUS_RLOW | LM90_STATUS_RHIGH | LM90_STATUS_RTHRM)) ||
>>           (st2 & MAX6696_STATUS2_ROT2))
>> -        dev_warn(&client->dev,
>> -             "temp%d out of range, please check!\n", 2);
>> +        dev_dbg(&client->dev,
>> +            "temp%d out of range, please check!\n", 2);
>>       if (st & LM90_STATUS_ROPEN)
>> -        dev_warn(&client->dev,
>> -             "temp%d diode open, please check!\n", 2);
>> +        dev_dbg(&client->dev,
>> +            "temp%d diode open, please check!\n", 2);
>>       if (st2 & (MAX6696_STATUS2_R2LOW | MAX6696_STATUS2_R2HIGH |
>>              MAX6696_STATUS2_R2THRM | MAX6696_STATUS2_R2OT2))
>> -        dev_warn(&client->dev,
>> -             "temp%d out of range, please check!\n", 3);
>> +        dev_dbg(&client->dev,
>> +            "temp%d out of range, please check!\n", 3);
>>       if (st2 & MAX6696_STATUS2_R2OPEN)
>> -        dev_warn(&client->dev,
>> -             "temp%d diode open, please check!\n", 3);
>> +        dev_dbg(&client->dev,
>> +            "temp%d diode open, please check!\n", 3);
>> +
>> +    if (st & LM90_STATUS_LLOW)
>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>> +                   hwmon_temp_min, 0);
>> +    if (st & LM90_STATUS_RLOW)
>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>> +                   hwmon_temp_min, 1);
>> +    if (st2 & MAX6696_STATUS2_R2LOW)
>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>> +                   hwmon_temp_min, 2);
>> +    if (st & LM90_STATUS_LHIGH)
>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>> +                   hwmon_temp_max, 0);
>> +    if (st & LM90_STATUS_RHIGH)
>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>> +                   hwmon_temp_max, 1);
>> +    if (st2 & MAX6696_STATUS2_R2HIGH)
>> +        hwmon_notify_event(data->hwmon_dev, hwmon_temp,
>> +                   hwmon_temp_max, 2);
> 
> 
> We observed a random null pointer deference crash somewhere in the
> thermal core (crash log below is not very helpful) when calling
> mutex_lock(). It looks like we get an interrupt when this crash
> happens.
> 
> Looking at the lm90 driver, per the above, I now see we are calling
> hwmon_notify_event() from the lm90 interrupt handler. Looking at
> hwmon_notify_event() I see that ...
> 
> hwmon_notify_event()
>    --> hwmon_thermal_notify()
>      --> thermal_zone_device_update()
>        --> update_temperature()
>          --> mutex_lock()
> 
> So although I don't completely understand the crash, it does seem
> that we should not be calling hwmon_notify_event() from the
> interrupt handler.
> 
As mentioned separately, this is not the problem.

I think the problem may be that this is not a devicetree system
(or the lm90 devide does not have a devicetree node), but thermal
notification currently only works in such systems because the hwmon
subsystem uses the devicetree registration method. At the same time,
CONFIG_THERMAL_OF is obviously enabled. Unfortunately, the hwmon code
does not bail out in that situation due to another bug.

I'll revert the related patches. This will have to be sorted out
separately in a later kernel release.

Thanks,
Guenter

> BTW I have not reproduced this myself yet, so I have just been
> reviewing the code to try and understand this.
> 
> Jon
> 
> [ 7465.595066] Unable to handle kernel NULL pointer dereference at virtual address 00000000000003cd
> [ 7465.596619] Mem abort info:
> [ 7465.597854]   ESR = 0x96000021
> [ 7465.599097]   EC = 0x25: DABT (current EL), IL = 32 bits
> [ 7465.600338]   SET = 0, FnV = 0
> [ 7465.601526]   EA = 0, S1PTW = 0
> [ 7465.602705]   FSC = 0x21: alignment fault
> [ 7465.603885] Data abort info:
> [ 7465.605017]   ISV = 0, ISS = 0x00000021
> [ 7465.606171]   CM = 0, WnR = 0
> [ 7465.607301] user pgtable: 64k pages, 48-bit VAs, pgdp=00000001041f1800
> [ 7465.608490] [00000000000003cd] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
> [ 7465.609814] Internal error: Oops: 96000021 [#1] PREEMPT SMP
> [ 7465.610991] Modules linked in: bridge stp llc snd_soc_tegra210_admaif snd_soc_tegra_pcm snd_soc_tegra210_mixer snd_soc_tegra210_mvc snd_soc_tegra210_i2s snd_soc_tegra210_dmic sn
> d_soc_tegra210_adx snd_soc_tegra210_sfc snd_soc_tegra210_amx snd_soc_tegra210_ahub tegra210_adma snd_soc_rt5659 snd_soc_rl6231 pwm_tegra tegra_aconnect snd_hda_codec_hdmi rfkill snd_hda_tegra snd_hda_codec at24 phy_tegra194_p2u snd_hda_core lm90 snd_soc_tegra_audio_graph_card tegra_bpmp_thermal snd_soc_audio_graph_card snd_soc_simple_card_utils pwm_fan crct10dif_ce pcie_tegra194 cec drm_kms_helper drm ip_tables x_tables ipv6
> [ 7465.632232] CPU: 2 PID: 433 Comm: irq/140-lm90 Tainted: G           O      5.16.0-tegra-g9d109504d83a #1
> [ 7465.636285] Hardware name: Unknown NVIDIA Jetson AGX Xavier Developer Kit/NVIDIA Jetson AGX Xavier Developer Kit, BIOS v1.1.2-901d3c52 02/07/2022
> [ 7465.650457] pstate: 60400009 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [ 7465.656210] pc : mutex_lock+0x18/0x60
> [ 7465.660134] lr : thermal_zone_device_update+0x40/0x2e0
> [ 7465.665117] sp : ffff800014c4fc60
> [ 7465.668781] x29: ffff800014c4fc60 x28: ffff365ee3f6e000 x27: ffffdde218426790
> [ 7465.675882] x26: ffff365ee3f6e000 x25: 0000000000000000 x24: ffff365ee3f6e000
> [ 7465.683485] x23: ffffdde218426870 x22: ffff365ee3f6e000 x21: 00000000000003cd
> [ 7465.690816] x20: ffff365ee8bf3308 x19: ffffffffffffffed x18: 0000000000000000
> [ 7465.697982] x17: ffffdde21842689c x16: ffffdde1cb7a0b7c x15: 0000000000000040
> [ 7465.705320] x14: ffffdde21a4889a0 x13: 0000000000000228 x12: 0000000000000000
> [ 7465.712493] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000
> [ 7465.719580] x8 : 0000000001120000 x7 : 0000000000000001 x6 : 0000000000000000
> [ 7465.727099] x5 : 0068000878e20f07 x4 : 0000000000000000 x3 : 00000000000003cd
> [ 7465.734348] x2 : ffff365ee3f6e000 x1 : 0000000000000000 x0 : 00000000000003cd
> [ 7465.741347] Call trace:
> [ 7465.744207]  mutex_lock+0x18/0x60
> [ 7465.747427]  hwmon_notify_event+0xfc/0x110
> [ 7465.751358]  0xffffdde1cb7a0a90
> [ 7465.754574]  0xffffdde1cb7a0b7c
> [ 7465.757705]  irq_thread_fn+0x2c/0xa0
> [ 7465.760937]  irq_thread+0x134/0x240
> [ 7465.764850]  kthread+0x178/0x190
> [ 7465.768083]  ret_from_fork+0x10/0x20
> [ 7465.771748] Code: d503201f d503201f d2800001 aa0103e4 (c8e47c02)
> [ 7465.777865] ---[ end trace f0b3723991411538 ]---
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 15:43     ` Guenter Roeck
@ 2022-02-21 15:49       ` Jon Hunter
  2022-02-21 16:02         ` Guenter Roeck
  0 siblings, 1 reply; 23+ messages in thread
From: Jon Hunter @ 2022-02-21 15:49 UTC (permalink / raw)
  To: Guenter Roeck, Dmitry Osipenko, Jean Delvare
  Cc: linux-kernel, linux-hwmon, linux-tegra


On 21/02/2022 15:43, Guenter Roeck wrote:

...

>> We observed a random null pointer deference crash somewhere in the
>> thermal core (crash log below is not very helpful) when calling
>> mutex_lock(). It looks like we get an interrupt when this crash
>> happens.
>>
>> Looking at the lm90 driver, per the above, I now see we are calling
>> hwmon_notify_event() from the lm90 interrupt handler. Looking at
>> hwmon_notify_event() I see that ...
>>
>> hwmon_notify_event()
>>    --> hwmon_thermal_notify()
>>      --> thermal_zone_device_update()
>>        --> update_temperature()
>>          --> mutex_lock()
>>
>> So although I don't completely understand the crash, it does seem
>> that we should not be calling hwmon_notify_event() from the
>> interrupt handler.
>>
> As mentioned separately, this is not the problem.

Yes I can see that now.

> I think the problem may be that this is not a devicetree system
> (or the lm90 devide does not have a devicetree node), but thermal
> notification currently only works in such systems because the hwmon
> subsystem uses the devicetree registration method. At the same time,
> CONFIG_THERMAL_OF is obviously enabled. Unfortunately, the hwmon code
> does not bail out in that situation due to another bug.

The platform I see this on does use device-tree and it does have a node 
for the ti,tmp451 device which uses the lm90 device. This platform uses 
the device-tree source 
arch/arm64/boot/dts/nvidia/tegra194-p2972-0000.dts and the tmp451 node 
is in arch/arm64/boot/dts/nvidia/tegra194-p2888.dtsi.

Cheers
Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 15:49       ` Jon Hunter
@ 2022-02-21 16:02         ` Guenter Roeck
  2022-02-21 16:13           ` Dmitry Osipenko
  2022-02-21 16:16           ` Jon Hunter
  0 siblings, 2 replies; 23+ messages in thread
From: Guenter Roeck @ 2022-02-21 16:02 UTC (permalink / raw)
  To: Jon Hunter, Dmitry Osipenko, Jean Delvare
  Cc: linux-kernel, linux-hwmon, linux-tegra

On 2/21/22 07:49, Jon Hunter wrote:
> 
> On 21/02/2022 15:43, Guenter Roeck wrote:
> 
> ...
> 
>>> We observed a random null pointer deference crash somewhere in the
>>> thermal core (crash log below is not very helpful) when calling
>>> mutex_lock(). It looks like we get an interrupt when this crash
>>> happens.
>>>
>>> Looking at the lm90 driver, per the above, I now see we are calling
>>> hwmon_notify_event() from the lm90 interrupt handler. Looking at
>>> hwmon_notify_event() I see that ...
>>>
>>> hwmon_notify_event()
>>>    --> hwmon_thermal_notify()
>>>      --> thermal_zone_device_update()
>>>        --> update_temperature()
>>>          --> mutex_lock()
>>>
>>> So although I don't completely understand the crash, it does seem
>>> that we should not be calling hwmon_notify_event() from the
>>> interrupt handler.
>>>
>> As mentioned separately, this is not the problem.
> 
> Yes I can see that now.
> 
>> I think the problem may be that this is not a devicetree system
>> (or the lm90 devide does not have a devicetree node), but thermal
>> notification currently only works in such systems because the hwmon
>> subsystem uses the devicetree registration method. At the same time,
>> CONFIG_THERMAL_OF is obviously enabled. Unfortunately, the hwmon code
>> does not bail out in that situation due to another bug.
> 
> The platform I see this on does use device-tree and it does have a node for the ti,tmp451 device which uses the lm90 device. This platform uses the device-tree source arch/arm64/boot/dts/nvidia/tegra194-p2972-0000.dts and the tmp451 node is in arch/arm64/boot/dts/nvidia/tegra194-p2888.dtsi.
> 

Interesting. It appears that the call to devm_thermal_zone_of_sensor_register()
in the hwmon core nevertheless returns -ENODEV which is not handled properly
in the hwmon core. I can see a number of reasons for this to happen:
- there is no devicetree node for the lm90 device
- there is no thermal-zones devicetree node
- there is no thermal zone entry in the thermal-zones node which matches
   the sensor

We'll have to revert the lm90 changes until this is sorted out.

Guenter

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 16:02         ` Guenter Roeck
@ 2022-02-21 16:13           ` Dmitry Osipenko
  2022-02-21 16:16           ` Jon Hunter
  1 sibling, 0 replies; 23+ messages in thread
From: Dmitry Osipenko @ 2022-02-21 16:13 UTC (permalink / raw)
  To: Guenter Roeck, Jon Hunter, Jean Delvare
  Cc: linux-kernel, linux-hwmon, linux-tegra

21.02.2022 19:02, Guenter Roeck пишет:
> On 2/21/22 07:49, Jon Hunter wrote:
>>
>> On 21/02/2022 15:43, Guenter Roeck wrote:
>>
>> ...
>>
>>>> We observed a random null pointer deference crash somewhere in the
>>>> thermal core (crash log below is not very helpful) when calling
>>>> mutex_lock(). It looks like we get an interrupt when this crash
>>>> happens.
>>>>
>>>> Looking at the lm90 driver, per the above, I now see we are calling
>>>> hwmon_notify_event() from the lm90 interrupt handler. Looking at
>>>> hwmon_notify_event() I see that ...
>>>>
>>>> hwmon_notify_event()
>>>>    --> hwmon_thermal_notify()
>>>>      --> thermal_zone_device_update()
>>>>        --> update_temperature()
>>>>          --> mutex_lock()
>>>>
>>>> So although I don't completely understand the crash, it does seem
>>>> that we should not be calling hwmon_notify_event() from the
>>>> interrupt handler.
>>>>
>>> As mentioned separately, this is not the problem.
>>
>> Yes I can see that now.
>>
>>> I think the problem may be that this is not a devicetree system
>>> (or the lm90 devide does not have a devicetree node), but thermal
>>> notification currently only works in such systems because the hwmon
>>> subsystem uses the devicetree registration method. At the same time,
>>> CONFIG_THERMAL_OF is obviously enabled. Unfortunately, the hwmon code
>>> does not bail out in that situation due to another bug.
>>
>> The platform I see this on does use device-tree and it does have a
>> node for the ti,tmp451 device which uses the lm90 device. This
>> platform uses the device-tree source
>> arch/arm64/boot/dts/nvidia/tegra194-p2972-0000.dts and the tmp451 node
>> is in arch/arm64/boot/dts/nvidia/tegra194-p2888.dtsi.
>>
> 
> Interesting. It appears that the call to
> devm_thermal_zone_of_sensor_register()
> in the hwmon core nevertheless returns -ENODEV which is not handled
> properly
> in the hwmon core. I can see a number of reasons for this to happen:
> - there is no devicetree node for the lm90 device
> - there is no thermal-zones devicetree node
> - there is no thermal zone entry in the thermal-zones node which matches
>   the sensor
> 
> We'll have to revert the lm90 changes until this is sorted out.

Oh, yeah. Seems there is a problem there and tzd pointer could be
-ENODEV. But it's a hwmon core problem, which apparently existed for a
long time, not the lm90 problem.


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 16:02         ` Guenter Roeck
  2022-02-21 16:13           ` Dmitry Osipenko
@ 2022-02-21 16:16           ` Jon Hunter
  2022-02-21 16:20             ` Dmitry Osipenko
                               ` (2 more replies)
  1 sibling, 3 replies; 23+ messages in thread
From: Jon Hunter @ 2022-02-21 16:16 UTC (permalink / raw)
  To: Guenter Roeck, Dmitry Osipenko, Jean Delvare
  Cc: linux-kernel, linux-hwmon, linux-tegra


On 21/02/2022 16:02, Guenter Roeck wrote:

...

>> The platform I see this on does use device-tree and it does have a 
>> node for the ti,tmp451 device which uses the lm90 device. This 
>> platform uses the device-tree source 
>> arch/arm64/boot/dts/nvidia/tegra194-p2972-0000.dts and the tmp451 node 
>> is in arch/arm64/boot/dts/nvidia/tegra194-p2888.dtsi.
>>
> 
> Interesting. It appears that the call to 
> devm_thermal_zone_of_sensor_register()
> in the hwmon core nevertheless returns -ENODEV which is not handled 
> properly
> in the hwmon core. I can see a number of reasons for this to happen:
> - there is no devicetree node for the lm90 device
> - there is no thermal-zones devicetree node
> - there is no thermal zone entry in the thermal-zones node which matches
>    the sensor


So we definitely have the node for the lm90 device and a thermal-zones 
node, but I do not see a thermal-sensor node. Maybe this is what we are 
missing?

Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 16:16           ` Jon Hunter
@ 2022-02-21 16:20             ` Dmitry Osipenko
  2022-02-21 16:42               ` Guenter Roeck
  2022-02-21 16:22             ` Jon Hunter
  2022-02-21 16:23             ` Guenter Roeck
  2 siblings, 1 reply; 23+ messages in thread
From: Dmitry Osipenko @ 2022-02-21 16:20 UTC (permalink / raw)
  To: Jon Hunter, Guenter Roeck, Jean Delvare
  Cc: linux-kernel, linux-hwmon, linux-tegra

21.02.2022 19:16, Jon Hunter пишет:
> 
> On 21/02/2022 16:02, Guenter Roeck wrote:
> 
> ...
> 
>>> The platform I see this on does use device-tree and it does have a
>>> node for the ti,tmp451 device which uses the lm90 device. This
>>> platform uses the device-tree source
>>> arch/arm64/boot/dts/nvidia/tegra194-p2972-0000.dts and the tmp451
>>> node is in arch/arm64/boot/dts/nvidia/tegra194-p2888.dtsi.
>>>
>>
>> Interesting. It appears that the call to
>> devm_thermal_zone_of_sensor_register()
>> in the hwmon core nevertheless returns -ENODEV which is not handled
>> properly
>> in the hwmon core. I can see a number of reasons for this to happen:
>> - there is no devicetree node for the lm90 device
>> - there is no thermal-zones devicetree node
>> - there is no thermal zone entry in the thermal-zones node which matches
>>    the sensor
> 
> 
> So we definitely have the node for the lm90 device and a thermal-zones
> node, but I do not see a thermal-sensor node. Maybe this is what we are
> missing?

Could you please try this:

diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
index 5915fedee69b..48f80bc99fe6 100644
--- a/drivers/hwmon/hwmon.c
+++ b/drivers/hwmon/hwmon.c
@@ -233,8 +233,12 @@ static int hwmon_thermal_add_sensor(struct device
*dev, int index)
 	 * If CONFIG_THERMAL_OF is disabled, this returns -ENODEV,
 	 * so ignore that error but forward any other error.
 	 */
-	if (IS_ERR(tzd) && (PTR_ERR(tzd) != -ENODEV))
-		return PTR_ERR(tzd);
+	if (IS_ERR(tzd)) {
+		if (PTR_ERR(tzd) != -ENODEV)
+			return PTR_ERR(tzd);
+
+		tzd = NULL;
+	}

 	err = devm_add_action(dev, hwmon_thermal_remove_sensor, &tdata->node);
 	if (err)
@@ -283,7 +287,7 @@ static void hwmon_thermal_notify(struct device *dev,
int index)
 	struct hwmon_thermal_data *tzdata;

 	list_for_each_entry(tzdata, &hwdev->tzdata, node) {
-		if (tzdata->index == index) {
+		if (tzdata->index == index && tzdata->tzd) {
 			thermal_zone_device_update(tzdata->tzd,
 						   THERMAL_EVENT_UNSPECIFIED);
 		}


^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 16:16           ` Jon Hunter
  2022-02-21 16:20             ` Dmitry Osipenko
@ 2022-02-21 16:22             ` Jon Hunter
  2022-02-21 18:38               ` Guenter Roeck
  2022-02-21 16:23             ` Guenter Roeck
  2 siblings, 1 reply; 23+ messages in thread
From: Jon Hunter @ 2022-02-21 16:22 UTC (permalink / raw)
  To: Guenter Roeck, Dmitry Osipenko, Jean Delvare
  Cc: linux-kernel, linux-hwmon, linux-tegra


On 21/02/2022 16:16, Jon Hunter wrote:
> 
> On 21/02/2022 16:02, Guenter Roeck wrote:
> 
> ...
> 
>>> The platform I see this on does use device-tree and it does have a 
>>> node for the ti,tmp451 device which uses the lm90 device. This 
>>> platform uses the device-tree source 
>>> arch/arm64/boot/dts/nvidia/tegra194-p2972-0000.dts and the tmp451 
>>> node is in arch/arm64/boot/dts/nvidia/tegra194-p2888.dtsi.
>>>
>>
>> Interesting. It appears that the call to 
>> devm_thermal_zone_of_sensor_register()
>> in the hwmon core nevertheless returns -ENODEV which is not handled 
>> properly
>> in the hwmon core. I can see a number of reasons for this to happen:
>> - there is no devicetree node for the lm90 device
>> - there is no thermal-zones devicetree node
>> - there is no thermal zone entry in the thermal-zones node which matches
>>    the sensor
> 
> 
> So we definitely have the node for the lm90 device and a thermal-zones 
> node, but I do not see a thermal-sensor node. Maybe this is what we are 
> missing?

Actually, that is not true. We do have thermal-sensor nodes in 
arch/arm64/boot/dts/nvidia/tegra194.dtsi.

Jon

-- 
nvpublic

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 16:16           ` Jon Hunter
  2022-02-21 16:20             ` Dmitry Osipenko
  2022-02-21 16:22             ` Jon Hunter
@ 2022-02-21 16:23             ` Guenter Roeck
  2 siblings, 0 replies; 23+ messages in thread
From: Guenter Roeck @ 2022-02-21 16:23 UTC (permalink / raw)
  To: Jon Hunter, Dmitry Osipenko, Jean Delvare
  Cc: linux-kernel, linux-hwmon, linux-tegra

On 2/21/22 08:16, Jon Hunter wrote:
> 
> On 21/02/2022 16:02, Guenter Roeck wrote:
> 
> ...
> 
>>> The platform I see this on does use device-tree and it does have a node for the ti,tmp451 device which uses the lm90 device. This platform uses the device-tree source arch/arm64/boot/dts/nvidia/tegra194-p2972-0000.dts and the tmp451 node is in arch/arm64/boot/dts/nvidia/tegra194-p2888.dtsi.
>>>
>>
>> Interesting. It appears that the call to devm_thermal_zone_of_sensor_register()
>> in the hwmon core nevertheless returns -ENODEV which is not handled properly
>> in the hwmon core. I can see a number of reasons for this to happen:
>> - there is no devicetree node for the lm90 device
>> - there is no thermal-zones devicetree node
>> - there is no thermal zone entry in the thermal-zones node which matches
>>    the sensor
> 
> 
> So we definitely have the node for the lm90 device and a thermal-zones node, but I do not see a thermal-sensor node. Maybe this is what we are missing?
> 

Correct.

Guenter


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 16:20             ` Dmitry Osipenko
@ 2022-02-21 16:42               ` Guenter Roeck
  0 siblings, 0 replies; 23+ messages in thread
From: Guenter Roeck @ 2022-02-21 16:42 UTC (permalink / raw)
  To: Dmitry Osipenko, Jon Hunter, Jean Delvare
  Cc: linux-kernel, linux-hwmon, linux-tegra

On 2/21/22 08:20, Dmitry Osipenko wrote:
> 21.02.2022 19:16, Jon Hunter пишет:
>>
>> On 21/02/2022 16:02, Guenter Roeck wrote:
>>
>> ...
>>
>>>> The platform I see this on does use device-tree and it does have a
>>>> node for the ti,tmp451 device which uses the lm90 device. This
>>>> platform uses the device-tree source
>>>> arch/arm64/boot/dts/nvidia/tegra194-p2972-0000.dts and the tmp451
>>>> node is in arch/arm64/boot/dts/nvidia/tegra194-p2888.dtsi.
>>>>
>>>
>>> Interesting. It appears that the call to
>>> devm_thermal_zone_of_sensor_register()
>>> in the hwmon core nevertheless returns -ENODEV which is not handled
>>> properly
>>> in the hwmon core. I can see a number of reasons for this to happen:
>>> - there is no devicetree node for the lm90 device
>>> - there is no thermal-zones devicetree node
>>> - there is no thermal zone entry in the thermal-zones node which matches
>>>     the sensor
>>
>>
>> So we definitely have the node for the lm90 device and a thermal-zones
>> node, but I do not see a thermal-sensor node. Maybe this is what we are
>> missing?
> 
> Could you please try this:
> 
> diff --git a/drivers/hwmon/hwmon.c b/drivers/hwmon/hwmon.c
> index 5915fedee69b..48f80bc99fe6 100644
> --- a/drivers/hwmon/hwmon.c
> +++ b/drivers/hwmon/hwmon.c
> @@ -233,8 +233,12 @@ static int hwmon_thermal_add_sensor(struct device
> *dev, int index)
>   	 * If CONFIG_THERMAL_OF is disabled, this returns -ENODEV,
>   	 * so ignore that error but forward any other error.
>   	 */
> -	if (IS_ERR(tzd) && (PTR_ERR(tzd) != -ENODEV))
> -		return PTR_ERR(tzd);
> +	if (IS_ERR(tzd)) {
> +		if (PTR_ERR(tzd) != -ENODEV)
> +			return PTR_ERR(tzd);
> +
> +		tzd = NULL;

That should just bail out. I'll send a patch in a minute.

Guenter

> +	}
> 
>   	err = devm_add_action(dev, hwmon_thermal_remove_sensor, &tdata->node);
>   	if (err)
> @@ -283,7 +287,7 @@ static void hwmon_thermal_notify(struct device *dev,
> int index)
>   	struct hwmon_thermal_data *tzdata;
> 
>   	list_for_each_entry(tzdata, &hwdev->tzdata, node) {
> -		if (tzdata->index == index) {
> +		if (tzdata->index == index && tzdata->tzd) {
>   			thermal_zone_device_update(tzdata->tzd,
>   						   THERMAL_EVENT_UNSPECIFIED);
>   		}
> 


^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event()
  2022-02-21 16:22             ` Jon Hunter
@ 2022-02-21 18:38               ` Guenter Roeck
  0 siblings, 0 replies; 23+ messages in thread
From: Guenter Roeck @ 2022-02-21 18:38 UTC (permalink / raw)
  To: Jon Hunter, Dmitry Osipenko, Jean Delvare
  Cc: linux-kernel, linux-hwmon, linux-tegra

On 2/21/22 08:22, Jon Hunter wrote:
> 
> On 21/02/2022 16:16, Jon Hunter wrote:
>>
>> On 21/02/2022 16:02, Guenter Roeck wrote:
>>
>> ...
>>
>>>> The platform I see this on does use device-tree and it does have a node for the ti,tmp451 device which uses the lm90 device. This platform uses the device-tree source arch/arm64/boot/dts/nvidia/tegra194-p2972-0000.dts and the tmp451 node is in arch/arm64/boot/dts/nvidia/tegra194-p2888.dtsi.
>>>>
>>>
>>> Interesting. It appears that the call to devm_thermal_zone_of_sensor_register()
>>> in the hwmon core nevertheless returns -ENODEV which is not handled properly
>>> in the hwmon core. I can see a number of reasons for this to happen:
>>> - there is no devicetree node for the lm90 device
>>> - there is no thermal-zones devicetree node
>>> - there is no thermal zone entry in the thermal-zones node which matches
>>>    the sensor
>>
>>
>> So we definitely have the node for the lm90 device and a thermal-zones node, but I do not see a thermal-sensor node. Maybe this is what we are missing?
> 
> Actually, that is not true. We do have thermal-sensor nodes in arch/arm64/boot/dts/nvidia/tegra194.dtsi.
> 

There is probably a zone to sensor id mismatch. hwmon sends the sensor index
as sensor_id to the thermal subsystem. Those sensor IDs would be 0, 1, and
possibly 2 for the lm90 driver. Assuming this should match the thermal-sensors
values in arch/arm64/boot/dts/nvidia/tegra194.dtsi, those start with 2,
so there would be a likely mismatch. Also, all those dtsi entries match
against pbmp/thermal, not against the lm90 sensor(s).

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2022-02-21 18:40 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-18 21:54 [PATCH v3 0/4] HWMON LM90 interrupt fixes and improvements Dmitry Osipenko
2021-06-18 21:54 ` [PATCH v3 1/4] hwmon: (lm90) Don't override interrupt trigger type Dmitry Osipenko
2021-06-18 21:54 ` [PATCH v3 2/4] hwmon: (lm90) Use hwmon_notify_event() Dmitry Osipenko
2022-02-21 12:01   ` Jon Hunter
2022-02-21 12:36     ` Dmitry Osipenko
2022-02-21 12:56       ` Jon Hunter
2022-02-21 12:59         ` Dmitry Osipenko
2022-02-21 13:50           ` Jon Hunter
2022-02-21 13:59             ` Dmitry Osipenko
2022-02-21 15:25           ` Guenter Roeck
2022-02-21 15:43     ` Guenter Roeck
2022-02-21 15:49       ` Jon Hunter
2022-02-21 16:02         ` Guenter Roeck
2022-02-21 16:13           ` Dmitry Osipenko
2022-02-21 16:16           ` Jon Hunter
2022-02-21 16:20             ` Dmitry Osipenko
2022-02-21 16:42               ` Guenter Roeck
2022-02-21 16:22             ` Jon Hunter
2022-02-21 18:38               ` Guenter Roeck
2022-02-21 16:23             ` Guenter Roeck
2021-06-18 21:54 ` [PATCH v3 3/4] hwmon: (lm90) Unmask hardware interrupt Dmitry Osipenko
2021-06-18 21:54 ` [PATCH v3 4/4] hwmon: (lm90) Disable interrupt on suspend Dmitry Osipenko
2021-06-19 11:10 ` [PATCH v3 0/4] HWMON LM90 interrupt fixes and improvements Guenter Roeck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.