All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
       [not found] <f5990ef1-4efe-d2b1-8e50-c6890526c054@gmail.com>
@ 2022-08-09 20:56 ` Zev Weiss
  2022-08-09 21:28   ` Zoltán Kővágó
  0 siblings, 1 reply; 14+ messages in thread
From: Zev Weiss @ 2022-08-09 20:56 UTC (permalink / raw)
  To: Zoltán Kővágó; +Cc: Guenter Roeck, linux-hwmon

On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
>Hi,
>
>[1.] One line summary of the problem:
>NCT6775: suspend doesn't work after updating to Linux 5.19
>
>[2.] Full description of the problem/report:
>After updating my kernel from 5.18.11 to 5.19, I've noticed that 
>resuming after suspend no longer works: fans start up, then about a 
>second later, the computer just shuts down, leaving the USB ports 
>powered up (normally it turns them off on shutdown). The screens don't 
>turn on during this timeframe, so I can't see any useful log messages.
>
>Bisecting between 5.18 (where it still worked) and 5.19 lead me to 
>commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd "hwmon: (nct6775) 
>Split core and platform driver" which looks like a refactor commit, 
>but apparently it broke something.
>

Hi Zoltán,

Thanks for the thorough bug report.  You're right that that commit was 
essentially just a refactor, though there was one slight change to the 
nct6775_suspend() function introduced during the review process that 
may perhaps have had some subtle unintended side-effects.

Can you test the following patch and report if it resolves the problem?


Thanks,
Zev

diff --git a/drivers/hwmon/nct6775-core.c b/drivers/hwmon/nct6775-core.c
index 446964cbae4c..da9ec6983e13 100644
--- a/drivers/hwmon/nct6775-core.c
+++ b/drivers/hwmon/nct6775-core.c
@@ -1480,7 +1480,7 @@ static int nct6775_update_pwm_limits(struct device *dev)
  	return 0;
  }
  
-static struct nct6775_data *nct6775_update_device(struct device *dev)
+struct nct6775_data *nct6775_update_device(struct device *dev)
  {
  	struct nct6775_data *data = dev_get_drvdata(dev);
  	int i, j, err = 0;
@@ -1615,6 +1615,7 @@ static struct nct6775_data *nct6775_update_device(struct device *dev)
  	mutex_unlock(&data->update_lock);
  	return err ? ERR_PTR(err) : data;
  }
+EXPORT_SYMBOL_GPL(nct6775_update_device);
  
  /*
   * Sysfs callback functions
diff --git a/drivers/hwmon/nct6775-platform.c b/drivers/hwmon/nct6775-platform.c
index ab30437221ce..41c97cfacfb8 100644
--- a/drivers/hwmon/nct6775-platform.c
+++ b/drivers/hwmon/nct6775-platform.c
@@ -359,7 +359,7 @@ static int __maybe_unused nct6775_suspend(struct device *dev)
  {
  	int err;
  	u16 tmp;
-	struct nct6775_data *data = dev_get_drvdata(dev);
+	struct nct6775_data *data = nct6775_update_device(dev);
  
  	if (IS_ERR(data))
  		return PTR_ERR(data);
diff --git a/drivers/hwmon/nct6775.h b/drivers/hwmon/nct6775.h
index 93f708148e65..be41848c3cd2 100644
--- a/drivers/hwmon/nct6775.h
+++ b/drivers/hwmon/nct6775.h
@@ -196,6 +196,8 @@ static inline int nct6775_write_value(struct nct6775_data *data, u16 reg, u16 va
  	return regmap_write(data->regmap, reg, value);
  }
  
+struct nct6775_data *nct6775_update_device(struct device *dev);
+
  bool nct6775_reg_is_word_sized(struct nct6775_data *data, u16 reg);
  int nct6775_probe(struct device *dev, struct nct6775_data *data,
  		  const struct regmap_config *regmapcfg);


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
  2022-08-09 20:56 ` PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19 Zev Weiss
@ 2022-08-09 21:28   ` Zoltán Kővágó
  2022-08-09 22:34     ` Zev Weiss
  0 siblings, 1 reply; 14+ messages in thread
From: Zoltán Kővágó @ 2022-08-09 21:28 UTC (permalink / raw)
  To: Zev Weiss; +Cc: Guenter Roeck, linux-hwmon

On 2022-08-09 22:56, Zev Weiss wrote:
> On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
>> Hi,
>>
>> [1.] One line summary of the problem:
>> NCT6775: suspend doesn't work after updating to Linux 5.19
>>
>> [2.] Full description of the problem/report:
>> After updating my kernel from 5.18.11 to 5.19, I've noticed that 
>> resuming after suspend no longer works: fans start up, then about a 
>> second later, the computer just shuts down, leaving the USB ports 
>> powered up (normally it turns them off on shutdown). The screens don't 
>> turn on during this timeframe, so I can't see any useful log messages.
>>
>> Bisecting between 5.18 (where it still worked) and 5.19 lead me to 
>> commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd "hwmon: (nct6775) 
>> Split core and platform driver" which looks like a refactor commit, 
>> but apparently it broke something.
>>
> 
> Hi Zoltán,
> 
> Thanks for the thorough bug report.  You're right that that commit was 
> essentially just a refactor, though there was one slight change to the 
> nct6775_suspend() function introduced during the review process that may 
> perhaps have had some subtle unintended side-effects.
> 
> Can you test the following patch and report if it resolves the problem?
> 
> 
> Thanks,
> Zev

Hi Zev,

Thanks for the quick reply. Yes, it looks like your patch does solve the 
problem (I've applied it on top of 5.19 (after fighting with my mail 
client for a while) and suspended a few times, it's working so far).

Regards,
Zoltan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
  2022-08-09 21:28   ` Zoltán Kővágó
@ 2022-08-09 22:34     ` Zev Weiss
  2022-08-09 23:50       ` Zoltán Kővágó
  2022-08-10  0:03       ` Guenter Roeck
  0 siblings, 2 replies; 14+ messages in thread
From: Zev Weiss @ 2022-08-09 22:34 UTC (permalink / raw)
  To: Zoltán Kővágó; +Cc: Guenter Roeck, linux-hwmon

On Tue, Aug 09, 2022 at 02:28:20PM PDT, Zoltán Kővágó wrote:
>On 2022-08-09 22:56, Zev Weiss wrote:
>>On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
>>>Hi,
>>>
>>>[1.] One line summary of the problem:
>>>NCT6775: suspend doesn't work after updating to Linux 5.19
>>>
>>>[2.] Full description of the problem/report:
>>>After updating my kernel from 5.18.11 to 5.19, I've noticed that 
>>>resuming after suspend no longer works: fans start up, then about 
>>>a second later, the computer just shuts down, leaving the USB 
>>>ports powered up (normally it turns them off on shutdown). The 
>>>screens don't turn on during this timeframe, so I can't see any 
>>>useful log messages.
>>>
>>>Bisecting between 5.18 (where it still worked) and 5.19 lead me to 
>>>commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd "hwmon: (nct6775) 
>>>Split core and platform driver" which looks like a refactor 
>>>commit, but apparently it broke something.
>>>
>>
>>Hi Zoltán,
>>
>>Thanks for the thorough bug report.  You're right that that commit 
>>was essentially just a refactor, though there was one slight change 
>>to the nct6775_suspend() function introduced during the review 
>>process that may perhaps have had some subtle unintended 
>>side-effects.
>>
>>Can you test the following patch and report if it resolves the problem?
>>
>>
>>Thanks,
>>Zev
>
>Hi Zev,
>
>Thanks for the quick reply. Yes, it looks like your patch does solve 
>the problem (I've applied it on top of 5.19 (after fighting with my 
>mail client for a while) and suspended a few times, it's working so 
>far).
>
>Regards,
>Zoltan

Great, thanks.

Guenter, it looks like nct6775_suspend() really does in fact need to use 
nct6775_update_device() instead of dev_get_drvdata(), though it's not 
immediately obvious to me why.  Though given that the bulk of of the 
body of nct6775_update_device() is inside an 'if' block that might not 
necessarily execute every time, I also wonder if it might be vulnerable 
to exhibiting the same problem depending on timing.

Zoltán, if you could try another experiment to try to gather some data 
on that -- with the patch from my previous email still applied, could 
you try suspending via:

     $ cat /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input && echo mem > /sys/power/state

and see if suspend is broken again?  The idea being to read from the 
hwmon device's sensors and then immediately suspend so that the 
nct6775_update_device() call from nct6775_suspend() falls within the 
driver's 1.5 second cache window and hence skips most of the work that 
function does.  If the same problem starts occurring again that way, it 
seems we've had essentially the same bug lurking for a while and the 
change in the patch you bisected to just made it happen consistently 
instead of unpredictably.  If suspend/resume continues working however, 
then I'm quite mystified, because the only other thing that's happening 
in that function is a mutex lock/unlock.


Thanks,
Zev


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
  2022-08-09 22:34     ` Zev Weiss
@ 2022-08-09 23:50       ` Zoltán Kővágó
  2022-08-10  1:39         ` Zev Weiss
  2022-08-10  0:03       ` Guenter Roeck
  1 sibling, 1 reply; 14+ messages in thread
From: Zoltán Kővágó @ 2022-08-09 23:50 UTC (permalink / raw)
  To: Zev Weiss; +Cc: Guenter Roeck, linux-hwmon

[-- Attachment #1: Type: text/plain, Size: 2711 bytes --]

On 2022-08-10 00:34, Zev Weiss wrote:
> On Tue, Aug 09, 2022 at 02:28:20PM PDT, Zoltán Kővágó wrote:
>> On 2022-08-09 22:56, Zev Weiss wrote:
>>> On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
>>>> Hi,
>>>>
>>>> [1.] One line summary of the problem:
>>>> NCT6775: suspend doesn't work after updating to Linux 5.19
>>>>
>>>> [2.] Full description of the problem/report:
>>>> After updating my kernel from 5.18.11 to 5.19, I've noticed that 
>>>> resuming after suspend no longer works: fans start up, then about a 
>>>> second later, the computer just shuts down, leaving the USB ports 
>>>> powered up (normally it turns them off on shutdown). The screens 
>>>> don't turn on during this timeframe, so I can't see any useful log 
>>>> messages.
>>>>
>>>> Bisecting between 5.18 (where it still worked) and 5.19 lead me to 
>>>> commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd "hwmon: (nct6775) 
>>>> Split core and platform driver" which looks like a refactor commit, 
>>>> but apparently it broke something.
>>>>
>>>
>>> Hi Zoltán,
>>>
>>> Thanks for the thorough bug report.  You're right that that commit 
>>> was essentially just a refactor, though there was one slight change 
>>> to the nct6775_suspend() function introduced during the review 
>>> process that may perhaps have had some subtle unintended side-effects.
>>>
>>> Can you test the following patch and report if it resolves the problem?
>>>
>>>
>>> Thanks,
>>> Zev
>>
>> Hi Zev,
>>
>> Thanks for the quick reply. Yes, it looks like your patch does solve 
>> the problem (I've applied it on top of 5.19 (after fighting with my 
>> mail client for a while) and suspended a few times, it's working so far).
>>
>> Regards,
>> Zoltan
> 
> Great, thanks.
> 
> Guenter, it looks like nct6775_suspend() really does in fact need to use 
> nct6775_update_device() instead of dev_get_drvdata(), though it's not 
> immediately obvious to me why.  Though given that the bulk of of the 
> body of nct6775_update_device() is inside an 'if' block that might not 
> necessarily execute every time, I also wonder if it might be vulnerable 
> to exhibiting the same problem depending on timing.
> 
> Zoltán, if you could try another experiment to try to gather some data 
> on that -- with the patch from my previous email still applied, could 
> you try suspending via:
> 
>      $ cat 
> /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input && echo 
> mem > /sys/power/state

Tried it, three times in fact, and it worked fine every time. Looking at 
the dmesg, though, it looks like it needs a bit more than 1.5 sec to 
suspend. Where's that 1.5 sec limit defined? I will try to increase it 
tomorrow.

Regards,
Zoltan

[-- Attachment #2: dmesg.log --]
[-- Type: text/x-log, Size: 14281 bytes --]

[ 4248.840743] PM: suspend entry (deep)
[ 4248.870251] Filesystems sync: 0.029 seconds
[ 4248.870305] Loading firmware: rtl_nic/rtl8125b-2.fw
[ 4248.870320] Loading firmware: amdgpu/polaris10_mc.bin
[ 4248.870324] Loading firmware: amdgpu/polaris10_pfp_2.bin
[ 4248.870324] Loading firmware: amdgpu/polaris10_me_2.bin
[ 4248.870332] Loading firmware: amdgpu/polaris10_rlc.bin
[ 4248.870331] Loading firmware: amdgpu/polaris10_ce_2.bin
[ 4248.870332] Loading firmware: amdgpu/polaris10_mec_2.bin
[ 4248.870334] Loading firmware: amdgpu/polaris10_mec2_2.bin
[ 4248.870337] Loading firmware: amdgpu/polaris10_sdma.bin
[ 4248.870338] Loading firmware: amdgpu/polaris10_sdma1.bin
[ 4248.870339] Loading firmware: amdgpu/polaris10_uvd.bin
[ 4248.870340] Loading firmware: amdgpu/polaris10_k_smc.bin
[ 4248.870341] Loading firmware: amdgpu/polaris10_vce.bin
[ 4248.870342] Loading firmware: amdgpu/navy_flounder_sos.bin
[ 4248.870343] Loading firmware: amdgpu/navy_flounder_ta.bin
[ 4248.870344] Loading firmware: amdgpu/navy_flounder_smc.bin
[ 4248.870346] Loading firmware: amdgpu/navy_flounder_dmcub.bin
[ 4248.870347] Loading firmware: amdgpu/navy_flounder_pfp.bin
[ 4248.870348] Loading firmware: amdgpu/navy_flounder_me.bin
[ 4248.870348] Loading firmware: amdgpu/navy_flounder_ce.bin
[ 4248.870349] Loading firmware: amdgpu/navy_flounder_rlc.bin
[ 4248.870350] Loading firmware: amdgpu/navy_flounder_mec.bin
[ 4248.870350] Loading firmware: amdgpu/navy_flounder_sdma.bin
[ 4248.870352] Loading firmware: amdgpu/navy_flounder_vcn.bin
[ 4248.870351] Loading firmware: amdgpu/navy_flounder_mec2.bin
[ 4248.977483] Freezing user space processes ... (elapsed 0.001 seconds) done.
[ 4248.978766] OOM killer disabled.
[ 4248.978768] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
[ 4248.979919] printk: Suspending console(s) (use no_console_suspend to debug)
[ 4248.980771] r8169 0000:03:00.0 eth0: Link is Down
[ 4248.990975] sd 1:0:0:0: [sda] Synchronizing SCSI cache
[ 4248.991061] sd 1:0:0:0: [sda] Stopping disk
[ 4249.088481] [drm] free PSP TMR buffer
[ 4249.406971] amdgpu 0000:08:00.0: amdgpu: PCI CONFIG reset
[ 4249.408318] amdgpu 0000:0b:00.0: amdgpu: MODE1 reset
[ 4249.408322] amdgpu 0000:0b:00.0: amdgpu: GPU mode1 reset
[ 4249.408865] amdgpu 0000:0b:00.0: amdgpu: GPU smu mode1 reset
[ 4249.966981] ACPI: PM: Preparing to enter system sleep state S3
[ 4250.484183] ACPI: PM: Saving platform NVS memory
[ 4250.484251] Disabling non-boot CPUs ...
[ 4250.485740] smpboot: CPU 1 is now offline
[ 4250.488120] smpboot: CPU 2 is now offline
[ 4250.489771] smpboot: CPU 3 is now offline
[ 4250.491405] smpboot: CPU 4 is now offline
[ 4250.493129] smpboot: CPU 5 is now offline
[ 4250.494788] smpboot: CPU 6 is now offline
[ 4250.496353] smpboot: CPU 7 is now offline
[ 4250.497959] smpboot: CPU 8 is now offline
[ 4250.499704] smpboot: CPU 9 is now offline
[ 4250.501226] smpboot: CPU 10 is now offline
[ 4250.502751] smpboot: CPU 11 is now offline
[ 4250.504297] smpboot: CPU 12 is now offline
[ 4250.505822] smpboot: CPU 13 is now offline
[ 4250.507370] smpboot: CPU 14 is now offline
[ 4250.508832] smpboot: CPU 15 is now offline
[ 4250.510417] smpboot: CPU 16 is now offline
[ 4250.510511] Spectre V2 : Update user space SMT mitigation: STIBP off
[ 4250.511869] smpboot: CPU 17 is now offline
[ 4250.513373] smpboot: CPU 18 is now offline
[ 4250.514849] smpboot: CPU 19 is now offline
[ 4250.516272] smpboot: CPU 20 is now offline
[ 4250.517786] smpboot: CPU 21 is now offline
[ 4250.519228] smpboot: CPU 22 is now offline
[ 4250.520722] smpboot: CPU 23 is now offline
[ 4250.522197] smpboot: CPU 24 is now offline
[ 4250.523609] smpboot: CPU 25 is now offline
[ 4250.525114] smpboot: CPU 26 is now offline
[ 4250.526490] smpboot: CPU 27 is now offline
[ 4250.527915] smpboot: CPU 28 is now offline
[ 4250.529306] smpboot: CPU 29 is now offline
[ 4250.530724] smpboot: CPU 30 is now offline
[ 4250.532144] smpboot: CPU 31 is now offline
[ 4250.532524] ACPI: PM: Low-level resume complete
[ 4250.532546] ACPI: PM: Restoring platform NVS memory
[ 4250.532735] LVT offset 0 assigned for vector 0x400
[ 4250.533277] Enabling non-boot CPUs ...
[ 4250.533303] x86: Booting SMP configuration:
[ 4250.533304] smpboot: Booting Node 0 Processor 1 APIC 0x2
[ 4250.533430] microcode: CPU1: patch_level=0x0a201016
[ 4250.535718] ACPI: \_PR_.C002: Found 2 idle states
[ 4250.535840] CPU1 is up
[ 4250.535852] smpboot: Booting Node 0 Processor 2 APIC 0x4
[ 4250.535959] microcode: CPU2: patch_level=0x0a201016
[ 4250.538192] ACPI: \_PR_.C004: Found 2 idle states
[ 4250.538296] CPU2 is up
[ 4250.538306] smpboot: Booting Node 0 Processor 3 APIC 0x6
[ 4250.538414] microcode: CPU3: patch_level=0x0a201016
[ 4250.540636] ACPI: \_PR_.C006: Found 2 idle states
[ 4250.540756] CPU3 is up
[ 4250.540766] smpboot: Booting Node 0 Processor 4 APIC 0x8
[ 4250.540874] microcode: CPU4: patch_level=0x0a201016
[ 4250.543101] ACPI: \_PR_.C008: Found 2 idle states
[ 4250.543219] CPU4 is up
[ 4250.543228] smpboot: Booting Node 0 Processor 5 APIC 0xa
[ 4250.543336] microcode: CPU5: patch_level=0x0a201016
[ 4250.545561] ACPI: \_PR_.C00A: Found 2 idle states
[ 4250.545691] CPU5 is up
[ 4250.545700] smpboot: Booting Node 0 Processor 6 APIC 0xc
[ 4250.545811] microcode: CPU6: patch_level=0x0a201016
[ 4250.548033] ACPI: \_PR_.C00C: Found 2 idle states
[ 4250.548156] CPU6 is up
[ 4250.548165] smpboot: Booting Node 0 Processor 7 APIC 0xe
[ 4250.548274] microcode: CPU7: patch_level=0x0a201016
[ 4250.550497] ACPI: \_PR_.C00E: Found 2 idle states
[ 4250.550628] CPU7 is up
[ 4250.550639] smpboot: Booting Node 0 Processor 8 APIC 0x10
[ 4250.550788] microcode: CPU8: patch_level=0x0a201016
[ 4250.553040] ACPI: \_PR_.C010: Found 2 idle states
[ 4250.553194] CPU8 is up
[ 4250.553204] smpboot: Booting Node 0 Processor 9 APIC 0x12
[ 4250.553353] microcode: CPU9: patch_level=0x0a201016
[ 4250.555599] ACPI: \_PR_.C012: Found 2 idle states
[ 4250.555747] CPU9 is up
[ 4250.555757] smpboot: Booting Node 0 Processor 10 APIC 0x14
[ 4250.555905] microcode: CPU10: patch_level=0x0a201016
[ 4250.558152] ACPI: \_PR_.C014: Found 2 idle states
[ 4250.558298] CPU10 is up
[ 4250.558309] smpboot: Booting Node 0 Processor 11 APIC 0x16
[ 4250.558459] microcode: CPU11: patch_level=0x0a201016
[ 4250.560718] ACPI: \_PR_.C016: Found 2 idle states
[ 4250.560868] CPU11 is up
[ 4250.560879] smpboot: Booting Node 0 Processor 12 APIC 0x18
[ 4250.561028] microcode: CPU12: patch_level=0x0a201016
[ 4250.563272] ACPI: \_PR_.C018: Found 2 idle states
[ 4250.563427] CPU12 is up
[ 4250.563437] smpboot: Booting Node 0 Processor 13 APIC 0x1a
[ 4250.563586] microcode: CPU13: patch_level=0x0a201016
[ 4250.565841] ACPI: \_PR_.C01A: Found 2 idle states
[ 4250.566001] CPU13 is up
[ 4250.566011] smpboot: Booting Node 0 Processor 14 APIC 0x1c
[ 4250.566159] microcode: CPU14: patch_level=0x0a201016
[ 4250.568400] ACPI: \_PR_.C01C: Found 2 idle states
[ 4250.568565] CPU14 is up
[ 4250.568576] smpboot: Booting Node 0 Processor 15 APIC 0x1e
[ 4250.568725] microcode: CPU15: patch_level=0x0a201016
[ 4250.570981] ACPI: \_PR_.C01E: Found 2 idle states
[ 4250.571142] CPU15 is up
[ 4250.571152] smpboot: Booting Node 0 Processor 16 APIC 0x1
[ 4250.571291] microcode: CPU16: patch_level=0x0a201016
[ 4250.573530] ACPI: \_PR_.C001: Found 2 idle states
[ 4250.573759] Spectre V2 : Update user space SMT mitigation: STIBP always-on
[ 4250.573765] CPU16 is up
[ 4250.573779] smpboot: Booting Node 0 Processor 17 APIC 0x3
[ 4250.573887] microcode: CPU17: patch_level=0x0a201016
[ 4250.576138] ACPI: \_PR_.C003: Found 2 idle states
[ 4250.576330] CPU17 is up
[ 4250.576340] smpboot: Booting Node 0 Processor 18 APIC 0x5
[ 4250.576447] microcode: CPU18: patch_level=0x0a201016
[ 4250.578710] ACPI: \_PR_.C005: Found 2 idle states
[ 4250.578908] CPU18 is up
[ 4250.578918] smpboot: Booting Node 0 Processor 19 APIC 0x7
[ 4250.579025] microcode: CPU19: patch_level=0x0a201016
[ 4250.581268] ACPI: \_PR_.C007: Found 2 idle states
[ 4250.581473] CPU19 is up
[ 4250.581483] smpboot: Booting Node 0 Processor 20 APIC 0x9
[ 4250.581590] microcode: CPU20: patch_level=0x0a201016
[ 4250.583844] ACPI: \_PR_.C009: Found 2 idle states
[ 4250.584058] CPU20 is up
[ 4250.584067] smpboot: Booting Node 0 Processor 21 APIC 0xb
[ 4250.584175] microcode: CPU21: patch_level=0x0a201016
[ 4250.586418] ACPI: \_PR_.C00B: Found 2 idle states
[ 4250.586638] CPU21 is up
[ 4250.586648] smpboot: Booting Node 0 Processor 22 APIC 0xd
[ 4250.586754] microcode: CPU22: patch_level=0x0a201016
[ 4250.589013] ACPI: \_PR_.C00D: Found 2 idle states
[ 4250.589236] CPU22 is up
[ 4250.589245] smpboot: Booting Node 0 Processor 23 APIC 0xf
[ 4250.589353] microcode: CPU23: patch_level=0x0a201016
[ 4250.591596] ACPI: \_PR_.C00F: Found 2 idle states
[ 4250.591834] CPU23 is up
[ 4250.591843] smpboot: Booting Node 0 Processor 24 APIC 0x11
[ 4250.591986] microcode: CPU24: patch_level=0x0a201016
[ 4250.594245] ACPI: \_PR_.C011: Found 2 idle states
[ 4250.594493] CPU24 is up
[ 4250.594504] smpboot: Booting Node 0 Processor 25 APIC 0x13
[ 4250.594647] microcode: CPU25: patch_level=0x0a201016
[ 4250.596904] ACPI: \_PR_.C013: Found 2 idle states
[ 4250.597157] CPU25 is up
[ 4250.597167] smpboot: Booting Node 0 Processor 26 APIC 0x15
[ 4250.597310] microcode: CPU26: patch_level=0x0a201016
[ 4250.599596] ACPI: \_PR_.C015: Found 2 idle states
[ 4250.599858] CPU26 is up
[ 4250.599870] smpboot: Booting Node 0 Processor 27 APIC 0x17
[ 4250.600013] microcode: CPU27: patch_level=0x0a201016
[ 4250.602268] ACPI: \_PR_.C017: Found 2 idle states
[ 4250.602535] CPU27 is up
[ 4250.602546] smpboot: Booting Node 0 Processor 28 APIC 0x19
[ 4250.602692] microcode: CPU28: patch_level=0x0a201016
[ 4250.604970] ACPI: \_PR_.C019: Found 2 idle states
[ 4250.605242] CPU28 is up
[ 4250.605253] smpboot: Booting Node 0 Processor 29 APIC 0x1b
[ 4250.605396] microcode: CPU29: patch_level=0x0a201016
[ 4250.607663] ACPI: \_PR_.C01B: Found 2 idle states
[ 4250.607942] CPU29 is up
[ 4250.607952] smpboot: Booting Node 0 Processor 30 APIC 0x1d
[ 4250.608095] microcode: CPU30: patch_level=0x0a201016
[ 4250.610351] ACPI: \_PR_.C01D: Found 2 idle states
[ 4250.610643] CPU30 is up
[ 4250.610653] smpboot: Booting Node 0 Processor 31 APIC 0x1f
[ 4250.610800] microcode: CPU31: patch_level=0x0a201016
[ 4250.613089] ACPI: \_PR_.C01F: Found 2 idle states
[ 4250.613397] CPU31 is up
[ 4250.615024] ACPI: PM: Waking up from system sleep state S3
[ 4250.621087] sd 1:0:0:0: [sda] Starting disk
[ 4250.621211] [drm] PCIE GART of 512M enabled (table at 0x00000080008CA000).
[ 4250.621243] [drm] PSP is resuming...
[ 4250.622832] nvme nvme0: Shutdown timeout set to 10 seconds
[ 4250.628133] nvme nvme0: 32/0/0 default/read/poll queues
[ 4250.696827] [drm] reserve 0xa00000 from 0x82fe000000 for PSP TMR
[ 4250.742897] [drm] PCIE GART of 256M enabled (table at 0x000000F4008CA000).
[ 4250.784703] amdgpu 0000:0b:00.0: amdgpu: RAS: optional ras ta ucode is not available
[ 4250.786748] r8169 0000:03:00.0 eth0: Link is Down
[ 4250.797815] amdgpu 0000:0b:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[ 4250.797816] amdgpu 0000:0b:00.0: amdgpu: SMU is resuming...
[ 4250.797820] amdgpu 0000:0b:00.0: amdgpu: smu driver if version = 0x0000000e, smu fw if version = 0x00000012, smu fw program = 0, version = 0x00413500 (65.53.0)
[ 4250.797822] amdgpu 0000:0b:00.0: amdgpu: SMU driver if version not matched
[ 4250.797869] amdgpu 0000:0b:00.0: amdgpu: use vbios provided pptable
[ 4250.845339] usb 5-4: reset high-speed USB device number 3 using xhci_hcd
[ 4250.858850] amdgpu 0000:0b:00.0: amdgpu: SMU is resumed successfully!
[ 4250.860030] [drm] DMUB hardware initialized: version=0x0202000F
[ 4250.909094] usb 3-4: reset full-speed USB device number 4 using xhci_hcd
[ 4250.929892] ata3: SATA link down (SStatus 0 SControl 300)
[ 4251.028762] br0: port 1(eth0) entered disabled state
[ 4251.155090] usb 3-5: reset full-speed USB device number 5 using xhci_hcd
[ 4251.193871] [drm] kiq ring mec 2 pipe 1 q 0
[ 4251.198962] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 4251.199243] [drm] JPEG decode initialized successfully.
[ 4251.199262] amdgpu 0000:0b:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 4251.199263] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 4251.199264] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 4251.199264] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[ 4251.199265] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[ 4251.199265] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[ 4251.199266] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[ 4251.199266] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[ 4251.199267] amdgpu 0000:0b:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[ 4251.199267] amdgpu 0000:0b:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[ 4251.199268] amdgpu 0000:0b:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 4251.199268] amdgpu 0000:0b:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[ 4251.199269] amdgpu 0000:0b:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
[ 4251.199269] amdgpu 0000:0b:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
[ 4251.199270] amdgpu 0000:0b:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
[ 4251.199270] amdgpu 0000:0b:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[ 4251.436704] [drm] Fence fallback timer expired on ring sdma0
[ 4251.940701] [drm] Fence fallback timer expired on ring sdma0
[ 4251.982227] [drm] UVD and UVD ENC initialized successfully.
[ 4252.083249] [drm] VCE initialized successfully.
[ 4253.361328] r8169 0000:03:00.0 eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[ 4253.361334] br0: port 1(eth0) entered blocking state
[ 4253.361335] br0: port 1(eth0) entered forwarding state
[ 4255.961849] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[ 4256.025527] ata2.00: configured for UDMA/133
[ 4256.031075] OOM killer enabled.
[ 4256.031076] Restarting tasks ... done.
[ 4256.031357] random: crng reseeded on system resumption
[ 4256.035197] PM: suspend exit

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
  2022-08-09 22:34     ` Zev Weiss
  2022-08-09 23:50       ` Zoltán Kővágó
@ 2022-08-10  0:03       ` Guenter Roeck
  1 sibling, 0 replies; 14+ messages in thread
From: Guenter Roeck @ 2022-08-10  0:03 UTC (permalink / raw)
  To: Zev Weiss, Zoltán Kővágó; +Cc: linux-hwmon

On 8/9/22 15:34, Zev Weiss wrote:
> On Tue, Aug 09, 2022 at 02:28:20PM PDT, Zoltán Kővágó wrote:
>> On 2022-08-09 22:56, Zev Weiss wrote:
>>> On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
>>>> Hi,
>>>>
>>>> [1.] One line summary of the problem:
>>>> NCT6775: suspend doesn't work after updating to Linux 5.19
>>>>
>>>> [2.] Full description of the problem/report:
>>>> After updating my kernel from 5.18.11 to 5.19, I've noticed that resuming after suspend no longer works: fans start up, then about a second later, the computer just shuts down, leaving the USB ports powered up (normally it turns them off on shutdown). The screens don't turn on during this timeframe, so I can't see any useful log messages.
>>>>
>>>> Bisecting between 5.18 (where it still worked) and 5.19 lead me to commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd "hwmon: (nct6775) Split core and platform driver" which looks like a refactor commit, but apparently it broke something.
>>>>
>>>
>>> Hi Zoltán,
>>>
>>> Thanks for the thorough bug report.  You're right that that commit was essentially just a refactor, though there was one slight change to the nct6775_suspend() function introduced during the review process that may perhaps have had some subtle unintended side-effects.
>>>
>>> Can you test the following patch and report if it resolves the problem?
>>>
>>>
>>> Thanks,
>>> Zev
>>
>> Hi Zev,
>>
>> Thanks for the quick reply. Yes, it looks like your patch does solve the problem (I've applied it on top of 5.19 (after fighting with my mail client for a while) and suspended a few times, it's working so far).
>>
>> Regards,
>> Zoltan
> 
> Great, thanks.
> 
> Guenter, it looks like nct6775_suspend() really does in fact need to use nct6775_update_device() instead of dev_get_drvdata(), though it's not immediately obvious to me why.  Though given that the bulk of of the body of nct6775_update_device() is inside an 'if' block that might not necessarily execute every time, I also wonder if it might be vulnerable to exhibiting the same problem depending on timing.
> 

It isn't obvious to me either except ... is it possible that nct6775_update_device()
was never called (ie that the 'sensors' command was never executed) ?
That might just possibly explain the problem because in that case the values
restored in the resume() function were actually never read from the chip.

Guenter

> Zoltán, if you could try another experiment to try to gather some data on that -- with the patch from my previous email still applied, could you try suspending via:
> 
>      $ cat /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input && echo mem > /sys/power/state
> 
> and see if suspend is broken again?  The idea being to read from the hwmon device's sensors and then immediately suspend so that the nct6775_update_device() call from nct6775_suspend() falls within the driver's 1.5 second cache window and hence skips most of the work that function does.  If the same problem starts occurring again that way, it seems we've had essentially the same bug lurking for a while and the change in the patch you bisected to just made it happen consistently instead of unpredictably.  If suspend/resume continues working however, then I'm quite mystified, because the only other thing that's happening in that function is a mutex lock/unlock.
> 
> 
> Thanks,
> Zev
> 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
  2022-08-09 23:50       ` Zoltán Kővágó
@ 2022-08-10  1:39         ` Zev Weiss
  2022-08-10  3:00           ` Armin Wolf
  2022-08-10  8:22           ` Zoltán Kővágó
  0 siblings, 2 replies; 14+ messages in thread
From: Zev Weiss @ 2022-08-10  1:39 UTC (permalink / raw)
  To: Zoltán Kővágó; +Cc: Guenter Roeck, linux-hwmon

On Tue, Aug 09, 2022 at 04:50:20PM PDT, Zoltán Kővágó wrote:
>On 2022-08-10 00:34, Zev Weiss wrote:
>>On Tue, Aug 09, 2022 at 02:28:20PM PDT, Zoltán Kővágó wrote:
>>>On 2022-08-09 22:56, Zev Weiss wrote:
>>>>On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
>>>>>Hi,
>>>>>
>>>>>[1.] One line summary of the problem:
>>>>>NCT6775: suspend doesn't work after updating to Linux 5.19
>>>>>
>>>>>[2.] Full description of the problem/report:
>>>>>After updating my kernel from 5.18.11 to 5.19, I've noticed 
>>>>>that resuming after suspend no longer works: fans start up, 
>>>>>then about a second later, the computer just shuts down, 
>>>>>leaving the USB ports powered up (normally it turns them off 
>>>>>on shutdown). The screens don't turn on during this timeframe, 
>>>>>so I can't see any useful log messages.
>>>>>
>>>>>Bisecting between 5.18 (where it still worked) and 5.19 lead 
>>>>>me to commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd "hwmon: 
>>>>>(nct6775) Split core and platform driver" which looks like a 
>>>>>refactor commit, but apparently it broke something.
>>>>>
>>>>
>>>>Hi Zoltán,
>>>>
>>>>Thanks for the thorough bug report.  You're right that that 
>>>>commit was essentially just a refactor, though there was one 
>>>>slight change to the nct6775_suspend() function introduced 
>>>>during the review process that may perhaps have had some subtle 
>>>>unintended side-effects.
>>>>
>>>>Can you test the following patch and report if it resolves the problem?
>>>>
>>>>
>>>>Thanks,
>>>>Zev
>>>
>>>Hi Zev,
>>>
>>>Thanks for the quick reply. Yes, it looks like your patch does 
>>>solve the problem (I've applied it on top of 5.19 (after fighting 
>>>with my mail client for a while) and suspended a few times, it's 
>>>working so far).
>>>
>>>Regards,
>>>Zoltan
>>
>>Great, thanks.
>>
>>Guenter, it looks like nct6775_suspend() really does in fact need to 
>>use nct6775_update_device() instead of dev_get_drvdata(), though 
>>it's not immediately obvious to me why.  Though given that the bulk 
>>of of the body of nct6775_update_device() is inside an 'if' block 
>>that might not necessarily execute every time, I also wonder if it 
>>might be vulnerable to exhibiting the same problem depending on 
>>timing.
>>
>>Zoltán, if you could try another experiment to try to gather some 
>>data on that -- with the patch from my previous email still applied, 
>>could you try suspending via:
>>
>>     $ cat 
>>/sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input && 
>>echo mem > /sys/power/state
>
>Tried it, three times in fact, and it worked fine every time. Looking 
>at the dmesg, though, it looks like it needs a bit more than 1.5 sec 
>to suspend. Where's that 1.5 sec limit defined? I will try to increase 
>it tomorrow.
>

The 1.5 second duration comes from this line in nct6775_update_device():

   if (time_after(jiffies, data->last_updated + HZ + HZ / 2)

Each 'HZ' represents one second, so HZ + HZ / 2 = 1.5 sec; if you want 
to lengthen it you could do e.g. 10 * HZ or something instead.

Though as Guenter noted, one other possibility is that with the previous 
(buggy) version nct6775_update_device() might never have gotten called 
at all -- do you know if that might be the case on your system?  (i.e.  
do you have any userspace monitoring program or the like that would have 
been reading from the nct6775 device's sensors?)  If something like that 
never ran between the driver getting loaded and the system suspending, 
that might also (partially) explain things; to test that out you could 
revert to the old buggy code and see if the suspend problem still occurs 
if you explicitly run

   $ cat /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input

(or just 'sensors' if you've got the lm-sensors package installed).  That will
ensure that nct6775_update_device() gets called before the suspend 
operation, which could help narrow things down further.


Thanks,
Zev


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
  2022-08-10  1:39         ` Zev Weiss
@ 2022-08-10  3:00           ` Armin Wolf
  2022-08-10  3:36             ` Guenter Roeck
  2022-08-10  8:22           ` Zoltán Kővágó
  1 sibling, 1 reply; 14+ messages in thread
From: Armin Wolf @ 2022-08-10  3:00 UTC (permalink / raw)
  To: Zev Weiss, Zoltán Kővágó; +Cc: Guenter Roeck, linux-hwmon

Am 10.08.22 um 03:39 schrieb Zev Weiss:

> On Tue, Aug 09, 2022 at 04:50:20PM PDT, Zoltán Kővágó wrote:
>> On 2022-08-10 00:34, Zev Weiss wrote:
>>> On Tue, Aug 09, 2022 at 02:28:20PM PDT, Zoltán Kővágó wrote:
>>>> On 2022-08-09 22:56, Zev Weiss wrote:
>>>>> On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
>>>>>> Hi,
>>>>>>
>>>>>> [1.] One line summary of the problem:
>>>>>> NCT6775: suspend doesn't work after updating to Linux 5.19
>>>>>>
>>>>>> [2.] Full description of the problem/report:
>>>>>> After updating my kernel from 5.18.11 to 5.19, I've noticed that
>>>>>> resuming after suspend no longer works: fans start up, then about
>>>>>> a second later, the computer just shuts down, leaving the USB
>>>>>> ports powered up (normally it turns them off on shutdown). The
>>>>>> screens don't turn on during this timeframe, so I can't see any
>>>>>> useful log messages.
>>>>>>
>>>>>> Bisecting between 5.18 (where it still worked) and 5.19 lead me
>>>>>> to commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd "hwmon:
>>>>>> (nct6775) Split core and platform driver" which looks like a
>>>>>> refactor commit, but apparently it broke something.
>>>>>>
>>>>>
>>>>> Hi Zoltán,
>>>>>
>>>>> Thanks for the thorough bug report.  You're right that that commit
>>>>> was essentially just a refactor, though there was one slight
>>>>> change to the nct6775_suspend() function introduced during the
>>>>> review process that may perhaps have had some subtle unintended
>>>>> side-effects.
>>>>>
>>>>> Can you test the following patch and report if it resolves the
>>>>> problem?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Zev
>>>>
>>>> Hi Zev,
>>>>
>>>> Thanks for the quick reply. Yes, it looks like your patch does
>>>> solve the problem (I've applied it on top of 5.19 (after fighting
>>>> with my mail client for a while) and suspended a few times, it's
>>>> working so far).
>>>>
>>>> Regards,
>>>> Zoltan
>>>
>>> Great, thanks.
>>>
>>> Guenter, it looks like nct6775_suspend() really does in fact need to
>>> use nct6775_update_device() instead of dev_get_drvdata(), though
>>> it's not immediately obvious to me why.  Though given that the bulk
>>> of of the body of nct6775_update_device() is inside an 'if' block
>>> that might not necessarily execute every time, I also wonder if it
>>> might be vulnerable to exhibiting the same problem depending on timing.
>>>
>>> Zoltán, if you could try another experiment to try to gather some
>>> data on that -- with the patch from my previous email still applied,
>>> could you try suspending via:
>>>
>>>     $ cat
>>> /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input &&
>>> echo mem > /sys/power/state
>>
>> Tried it, three times in fact, and it worked fine every time. Looking
>> at the dmesg, though, it looks like it needs a bit more than 1.5 sec
>> to suspend. Where's that 1.5 sec limit defined? I will try to
>> increase it tomorrow.
>>
>
> The 1.5 second duration comes from this line in nct6775_update_device():
>
>   if (time_after(jiffies, data->last_updated + HZ + HZ / 2)
>
> Each 'HZ' represents one second, so HZ + HZ / 2 = 1.5 sec; if you want
> to lengthen it you could do e.g. 10 * HZ or something instead.
>
> Though as Guenter noted, one other possibility is that with the
> previous (buggy) version nct6775_update_device() might never have
> gotten called at all -- do you know if that might be the case on your
> system?  (i.e.  do you have any userspace monitoring program or the
> like that would have been reading from the nct6775 device's sensors?) 
> If something like that never ran between the driver getting loaded and
> the system suspending, that might also (partially) explain things; to
> test that out you could revert to the old buggy code and see if the
> suspend problem still occurs if you explicitly run
>
>   $ cat /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input
>
> (or just 'sensors' if you've got the lm-sensors package installed). 
> That will
> ensure that nct6775_update_device() gets called before the suspend
> operation, which could help narrow things down further.
>
In case you are running a Debian-based distribution, please note that Debian unconditionally executes sensors on startup, something
i discovered when the dell-smm-hwmon driver was causing issues on my machine.

So you may need to temporarily disable lm-sensors.service and/or /etc/init.d/lm-sensors for testing.

Armin Wolf

>
> Thanks,
> Zev
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
  2022-08-10  3:00           ` Armin Wolf
@ 2022-08-10  3:36             ` Guenter Roeck
  2022-08-10  6:03               ` Zev Weiss
  0 siblings, 1 reply; 14+ messages in thread
From: Guenter Roeck @ 2022-08-10  3:36 UTC (permalink / raw)
  To: Armin Wolf, Zev Weiss, Zoltán Kővágó; +Cc: linux-hwmon

On 8/9/22 20:00, Armin Wolf wrote:
> Am 10.08.22 um 03:39 schrieb Zev Weiss:
> 
>> On Tue, Aug 09, 2022 at 04:50:20PM PDT, Zoltán Kővágó wrote:
>>> On 2022-08-10 00:34, Zev Weiss wrote:
>>>> On Tue, Aug 09, 2022 at 02:28:20PM PDT, Zoltán Kővágó wrote:
>>>>> On 2022-08-09 22:56, Zev Weiss wrote:
>>>>>> On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> [1.] One line summary of the problem:
>>>>>>> NCT6775: suspend doesn't work after updating to Linux 5.19
>>>>>>>
>>>>>>> [2.] Full description of the problem/report:
>>>>>>> After updating my kernel from 5.18.11 to 5.19, I've noticed that
>>>>>>> resuming after suspend no longer works: fans start up, then about
>>>>>>> a second later, the computer just shuts down, leaving the USB
>>>>>>> ports powered up (normally it turns them off on shutdown). The
>>>>>>> screens don't turn on during this timeframe, so I can't see any
>>>>>>> useful log messages.
>>>>>>>
>>>>>>> Bisecting between 5.18 (where it still worked) and 5.19 lead me
>>>>>>> to commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd "hwmon:
>>>>>>> (nct6775) Split core and platform driver" which looks like a
>>>>>>> refactor commit, but apparently it broke something.
>>>>>>>
>>>>>>
>>>>>> Hi Zoltán,
>>>>>>
>>>>>> Thanks for the thorough bug report.  You're right that that commit
>>>>>> was essentially just a refactor, though there was one slight
>>>>>> change to the nct6775_suspend() function introduced during the
>>>>>> review process that may perhaps have had some subtle unintended
>>>>>> side-effects.
>>>>>>
>>>>>> Can you test the following patch and report if it resolves the
>>>>>> problem?
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Zev
>>>>>
>>>>> Hi Zev,
>>>>>
>>>>> Thanks for the quick reply. Yes, it looks like your patch does
>>>>> solve the problem (I've applied it on top of 5.19 (after fighting
>>>>> with my mail client for a while) and suspended a few times, it's
>>>>> working so far).
>>>>>
>>>>> Regards,
>>>>> Zoltan
>>>>
>>>> Great, thanks.
>>>>
>>>> Guenter, it looks like nct6775_suspend() really does in fact need to
>>>> use nct6775_update_device() instead of dev_get_drvdata(), though
>>>> it's not immediately obvious to me why.  Though given that the bulk
>>>> of of the body of nct6775_update_device() is inside an 'if' block
>>>> that might not necessarily execute every time, I also wonder if it
>>>> might be vulnerable to exhibiting the same problem depending on timing.
>>>>
>>>> Zoltán, if you could try another experiment to try to gather some
>>>> data on that -- with the patch from my previous email still applied,
>>>> could you try suspending via:
>>>>
>>>>     $ cat
>>>> /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input &&
>>>> echo mem > /sys/power/state
>>>
>>> Tried it, three times in fact, and it worked fine every time. Looking
>>> at the dmesg, though, it looks like it needs a bit more than 1.5 sec
>>> to suspend. Where's that 1.5 sec limit defined? I will try to
>>> increase it tomorrow.
>>>
>>
>> The 1.5 second duration comes from this line in nct6775_update_device():
>>
>>   if (time_after(jiffies, data->last_updated + HZ + HZ / 2)
>>
>> Each 'HZ' represents one second, so HZ + HZ / 2 = 1.5 sec; if you want
>> to lengthen it you could do e.g. 10 * HZ or something instead.
>>
>> Though as Guenter noted, one other possibility is that with the
>> previous (buggy) version nct6775_update_device() might never have
>> gotten called at all -- do you know if that might be the case on your
>> system?  (i.e.  do you have any userspace monitoring program or the
>> like that would have been reading from the nct6775 device's sensors?)
>> If something like that never ran between the driver getting loaded and
>> the system suspending, that might also (partially) explain things; to
>> test that out you could revert to the old buggy code and see if the
>> suspend problem still occurs if you explicitly run
>>
>>   $ cat /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input
>>
>> (or just 'sensors' if you've got the lm-sensors package installed).
>> That will
>> ensure that nct6775_update_device() gets called before the suspend
>> operation, which could help narrow things down further.
>>
> In case you are running a Debian-based distribution, please note that Debian unconditionally executes sensors on startup, something
> i discovered when the dell-smm-hwmon driver was causing issues on my machine.
> 
> So you may need to temporarily disable lm-sensors.service and/or /etc/init.d/lm-sensors for testing.
> 

The assumption was that the sensors command was _not_ executed prior
to suspend and that this triggers the problem. If it was executed at least
once and the crash on resume still happens, something else must be going
on. Either case, I think it is best to apply the patch suggested by Zev.
We can try to figure out the exact cause of the problem later. Zev, can
you send me a formal patch ? I would like to apply and send it to Linus
as soon as possible.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
  2022-08-10  3:36             ` Guenter Roeck
@ 2022-08-10  6:03               ` Zev Weiss
  2022-08-10  7:33                 ` Guenter Roeck
  0 siblings, 1 reply; 14+ messages in thread
From: Zev Weiss @ 2022-08-10  6:03 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Armin Wolf, Zoltán Kővágó, linux-hwmon

On Tue, Aug 09, 2022 at 08:36:24PM PDT, Guenter Roeck wrote:
>On 8/9/22 20:00, Armin Wolf wrote:
>>Am 10.08.22 um 03:39 schrieb Zev Weiss:
>>
>>>On Tue, Aug 09, 2022 at 04:50:20PM PDT, Zoltán Kővágó wrote:
>>>>On 2022-08-10 00:34, Zev Weiss wrote:
>>>>>On Tue, Aug 09, 2022 at 02:28:20PM PDT, Zoltán Kővágó wrote:
>>>>>>On 2022-08-09 22:56, Zev Weiss wrote:
>>>>>>>On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
>>>>>>>>Hi,
>>>>>>>>
>>>>>>>>[1.] One line summary of the problem:
>>>>>>>>NCT6775: suspend doesn't work after updating to Linux 5.19
>>>>>>>>
>>>>>>>>[2.] Full description of the problem/report:
>>>>>>>>After updating my kernel from 5.18.11 to 5.19, I've noticed that
>>>>>>>>resuming after suspend no longer works: fans start up, then about
>>>>>>>>a second later, the computer just shuts down, leaving the USB
>>>>>>>>ports powered up (normally it turns them off on shutdown). The
>>>>>>>>screens don't turn on during this timeframe, so I can't see any
>>>>>>>>useful log messages.
>>>>>>>>
>>>>>>>>Bisecting between 5.18 (where it still worked) and 5.19 lead me
>>>>>>>>to commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd "hwmon:
>>>>>>>>(nct6775) Split core and platform driver" which looks like a
>>>>>>>>refactor commit, but apparently it broke something.
>>>>>>>>
>>>>>>>
>>>>>>>Hi Zoltán,
>>>>>>>
>>>>>>>Thanks for the thorough bug report.  You're right that that commit
>>>>>>>was essentially just a refactor, though there was one slight
>>>>>>>change to the nct6775_suspend() function introduced during the
>>>>>>>review process that may perhaps have had some subtle unintended
>>>>>>>side-effects.
>>>>>>>
>>>>>>>Can you test the following patch and report if it resolves the
>>>>>>>problem?
>>>>>>>
>>>>>>>
>>>>>>>Thanks,
>>>>>>>Zev
>>>>>>
>>>>>>Hi Zev,
>>>>>>
>>>>>>Thanks for the quick reply. Yes, it looks like your patch does
>>>>>>solve the problem (I've applied it on top of 5.19 (after fighting
>>>>>>with my mail client for a while) and suspended a few times, it's
>>>>>>working so far).
>>>>>>
>>>>>>Regards,
>>>>>>Zoltan
>>>>>
>>>>>Great, thanks.
>>>>>
>>>>>Guenter, it looks like nct6775_suspend() really does in fact need to
>>>>>use nct6775_update_device() instead of dev_get_drvdata(), though
>>>>>it's not immediately obvious to me why.  Though given that the bulk
>>>>>of of the body of nct6775_update_device() is inside an 'if' block
>>>>>that might not necessarily execute every time, I also wonder if it
>>>>>might be vulnerable to exhibiting the same problem depending on timing.
>>>>>
>>>>>Zoltán, if you could try another experiment to try to gather some
>>>>>data on that -- with the patch from my previous email still applied,
>>>>>could you try suspending via:
>>>>>
>>>>>    $ cat
>>>>>/sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input &&
>>>>>echo mem > /sys/power/state
>>>>
>>>>Tried it, three times in fact, and it worked fine every time. Looking
>>>>at the dmesg, though, it looks like it needs a bit more than 1.5 sec
>>>>to suspend. Where's that 1.5 sec limit defined? I will try to
>>>>increase it tomorrow.
>>>>
>>>
>>>The 1.5 second duration comes from this line in nct6775_update_device():
>>>
>>>  if (time_after(jiffies, data->last_updated + HZ + HZ / 2)
>>>
>>>Each 'HZ' represents one second, so HZ + HZ / 2 = 1.5 sec; if you want
>>>to lengthen it you could do e.g. 10 * HZ or something instead.
>>>
>>>Though as Guenter noted, one other possibility is that with the
>>>previous (buggy) version nct6775_update_device() might never have
>>>gotten called at all -- do you know if that might be the case on your
>>>system?  (i.e.  do you have any userspace monitoring program or the
>>>like that would have been reading from the nct6775 device's sensors?)
>>>If something like that never ran between the driver getting loaded and
>>>the system suspending, that might also (partially) explain things; to
>>>test that out you could revert to the old buggy code and see if the
>>>suspend problem still occurs if you explicitly run
>>>
>>>  $ cat /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input
>>>
>>>(or just 'sensors' if you've got the lm-sensors package installed).
>>>That will
>>>ensure that nct6775_update_device() gets called before the suspend
>>>operation, which could help narrow things down further.
>>>
>>In case you are running a Debian-based distribution, please note that Debian unconditionally executes sensors on startup, something
>>i discovered when the dell-smm-hwmon driver was causing issues on my machine.
>>
>>So you may need to temporarily disable lm-sensors.service and/or /etc/init.d/lm-sensors for testing.
>>

I'm not sure offhand what it does in that area, but Zoltán's initial bug 
report indicated that the problem occurred on Gentoo, FWIW.

>
>The assumption was that the sensors command was _not_ executed prior
>to suspend and that this triggers the problem. If it was executed at least
>once and the crash on resume still happens, something else must be going
>on. Either case, I think it is best to apply the patch suggested by Zev.
>We can try to figure out the exact cause of the problem later. Zev, can
>you send me a formal patch ? I would like to apply and send it to Linus
>as soon as possible.
>

Ack, patch sent.

 From what we've seen so far though, it seems likely (if for no other 
reason than that if this _isn't_ the cause there must be something truly 
weird going on) that the problem stems from nct6775_update_device() 
simply having never been called at all.  If we can confirm that that's 
the case, should we perhaps just add an unconditional call to 
nct6775_update_device() at the end of nct6775_probe(), and revert the 
patch I just sent so as to avoid the single-consumer symbol export?


Zev


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
  2022-08-10  6:03               ` Zev Weiss
@ 2022-08-10  7:33                 ` Guenter Roeck
  2022-08-10 22:56                   ` Zev Weiss
  0 siblings, 1 reply; 14+ messages in thread
From: Guenter Roeck @ 2022-08-10  7:33 UTC (permalink / raw)
  To: Zev Weiss; +Cc: Armin Wolf, Zoltán Kővágó, linux-hwmon

On 8/9/22 23:03, Zev Weiss wrote:
> On Tue, Aug 09, 2022 at 08:36:24PM PDT, Guenter Roeck wrote:
>> On 8/9/22 20:00, Armin Wolf wrote:
>>> Am 10.08.22 um 03:39 schrieb Zev Weiss:
>>>
>>>> On Tue, Aug 09, 2022 at 04:50:20PM PDT, Zoltán Kővágó wrote:
>>>>> On 2022-08-10 00:34, Zev Weiss wrote:
>>>>>> On Tue, Aug 09, 2022 at 02:28:20PM PDT, Zoltán Kővágó wrote:
>>>>>>> On 2022-08-09 22:56, Zev Weiss wrote:
>>>>>>>> On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> [1.] One line summary of the problem:
>>>>>>>>> NCT6775: suspend doesn't work after updating to Linux 5.19
>>>>>>>>>
>>>>>>>>> [2.] Full description of the problem/report:
>>>>>>>>> After updating my kernel from 5.18.11 to 5.19, I've noticed that
>>>>>>>>> resuming after suspend no longer works: fans start up, then about
>>>>>>>>> a second later, the computer just shuts down, leaving the USB
>>>>>>>>> ports powered up (normally it turns them off on shutdown). The
>>>>>>>>> screens don't turn on during this timeframe, so I can't see any
>>>>>>>>> useful log messages.
>>>>>>>>>
>>>>>>>>> Bisecting between 5.18 (where it still worked) and 5.19 lead me
>>>>>>>>> to commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd "hwmon:
>>>>>>>>> (nct6775) Split core and platform driver" which looks like a
>>>>>>>>> refactor commit, but apparently it broke something.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Zoltán,
>>>>>>>>
>>>>>>>> Thanks for the thorough bug report.  You're right that that commit
>>>>>>>> was essentially just a refactor, though there was one slight
>>>>>>>> change to the nct6775_suspend() function introduced during the
>>>>>>>> review process that may perhaps have had some subtle unintended
>>>>>>>> side-effects.
>>>>>>>>
>>>>>>>> Can you test the following patch and report if it resolves the
>>>>>>>> problem?
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Zev
>>>>>>>
>>>>>>> Hi Zev,
>>>>>>>
>>>>>>> Thanks for the quick reply. Yes, it looks like your patch does
>>>>>>> solve the problem (I've applied it on top of 5.19 (after fighting
>>>>>>> with my mail client for a while) and suspended a few times, it's
>>>>>>> working so far).
>>>>>>>
>>>>>>> Regards,
>>>>>>> Zoltan
>>>>>>
>>>>>> Great, thanks.
>>>>>>
>>>>>> Guenter, it looks like nct6775_suspend() really does in fact need to
>>>>>> use nct6775_update_device() instead of dev_get_drvdata(), though
>>>>>> it's not immediately obvious to me why.  Though given that the bulk
>>>>>> of of the body of nct6775_update_device() is inside an 'if' block
>>>>>> that might not necessarily execute every time, I also wonder if it
>>>>>> might be vulnerable to exhibiting the same problem depending on timing.
>>>>>>
>>>>>> Zoltán, if you could try another experiment to try to gather some
>>>>>> data on that -- with the patch from my previous email still applied,
>>>>>> could you try suspending via:
>>>>>>
>>>>>>     $ cat
>>>>>> /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input &&
>>>>>> echo mem > /sys/power/state
>>>>>
>>>>> Tried it, three times in fact, and it worked fine every time. Looking
>>>>> at the dmesg, though, it looks like it needs a bit more than 1.5 sec
>>>>> to suspend. Where's that 1.5 sec limit defined? I will try to
>>>>> increase it tomorrow.
>>>>>
>>>>
>>>> The 1.5 second duration comes from this line in nct6775_update_device():
>>>>
>>>>   if (time_after(jiffies, data->last_updated + HZ + HZ / 2)
>>>>
>>>> Each 'HZ' represents one second, so HZ + HZ / 2 = 1.5 sec; if you want
>>>> to lengthen it you could do e.g. 10 * HZ or something instead.
>>>>
>>>> Though as Guenter noted, one other possibility is that with the
>>>> previous (buggy) version nct6775_update_device() might never have
>>>> gotten called at all -- do you know if that might be the case on your
>>>> system?  (i.e.  do you have any userspace monitoring program or the
>>>> like that would have been reading from the nct6775 device's sensors?)
>>>> If something like that never ran between the driver getting loaded and
>>>> the system suspending, that might also (partially) explain things; to
>>>> test that out you could revert to the old buggy code and see if the
>>>> suspend problem still occurs if you explicitly run
>>>>
>>>>   $ cat /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input
>>>>
>>>> (or just 'sensors' if you've got the lm-sensors package installed).
>>>> That will
>>>> ensure that nct6775_update_device() gets called before the suspend
>>>> operation, which could help narrow things down further.
>>>>
>>> In case you are running a Debian-based distribution, please note that Debian unconditionally executes sensors on startup, something
>>> i discovered when the dell-smm-hwmon driver was causing issues on my machine.
>>>
>>> So you may need to temporarily disable lm-sensors.service and/or /etc/init.d/lm-sensors for testing.
>>>
> 
> I'm not sure offhand what it does in that area, but Zoltán's initial bug report indicated that the problem occurred on Gentoo, FWIW.
> 
>>
>> The assumption was that the sensors command was _not_ executed prior
>> to suspend and that this triggers the problem. If it was executed at least
>> once and the crash on resume still happens, something else must be going
>> on. Either case, I think it is best to apply the patch suggested by Zev.
>> We can try to figure out the exact cause of the problem later. Zev, can
>> you send me a formal patch ? I would like to apply and send it to Linus
>> as soon as possible.
>>
> 
> Ack, patch sent.
> 
>  From what we've seen so far though, it seems likely (if for no other reason than that if this _isn't_ the cause there must be something truly weird going on) that the problem stems from nct6775_update_device() simply having never been called at all.  If we can confirm that that's the case, should we perhaps just add an unconditional call to nct6775_update_device() at the end of nct6775_probe(), and revert the patch I just sent so as to avoid the single-consumer symbol export?
> 

I thought about calling it in probe, but I don't really like it because
it would slow down boot. Let's introduce a namespace (say, NCT6775) instead.
Change all EXPORT_SYMBOL_GPL(x) to EXPORT_SYMBOL_NS_GPL(x, NCT6775) and
add MODULE_IMPORT_NS(NCT6775) to nct6775-platform.c and nct6775-i2c.c.
This will ensure that the exported symbols can only be called from the
nct6775 code.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
  2022-08-10  1:39         ` Zev Weiss
  2022-08-10  3:00           ` Armin Wolf
@ 2022-08-10  8:22           ` Zoltán Kővágó
  2022-08-10 22:59             ` Zev Weiss
  1 sibling, 1 reply; 14+ messages in thread
From: Zoltán Kővágó @ 2022-08-10  8:22 UTC (permalink / raw)
  To: Zev Weiss; +Cc: Guenter Roeck, linux-hwmon

On 2022-08-10 03:39, Zev Weiss wrote:
> On Tue, Aug 09, 2022 at 04:50:20PM PDT, Zoltán Kővágó wrote:
>> On 2022-08-10 00:34, Zev Weiss wrote:
>>> On Tue, Aug 09, 2022 at 02:28:20PM PDT, Zoltán Kővágó wrote:
>>>> On 2022-08-09 22:56, Zev Weiss wrote:
>>>>> On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
>>>>>> Hi,
>>>>>>
>>>>>> [1.] One line summary of the problem:
>>>>>> NCT6775: suspend doesn't work after updating to Linux 5.19
>>>>>>
>>>>>> [2.] Full description of the problem/report:
>>>>>> After updating my kernel from 5.18.11 to 5.19, I've noticed that 
>>>>>> resuming after suspend no longer works: fans start up, then about 
>>>>>> a second later, the computer just shuts down, leaving the USB 
>>>>>> ports powered up (normally it turns them off on shutdown). The 
>>>>>> screens don't turn on during this timeframe, so I can't see any 
>>>>>> useful log messages.
>>>>>>
>>>>>> Bisecting between 5.18 (where it still worked) and 5.19 lead me to 
>>>>>> commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd "hwmon: (nct6775) 
>>>>>> Split core and platform driver" which looks like a refactor 
>>>>>> commit, but apparently it broke something.
>>>>>>
>>>>>
>>>>> Hi Zoltán,
>>>>>
>>>>> Thanks for the thorough bug report.  You're right that that commit 
>>>>> was essentially just a refactor, though there was one slight change 
>>>>> to the nct6775_suspend() function introduced during the review 
>>>>> process that may perhaps have had some subtle unintended side-effects.
>>>>>
>>>>> Can you test the following patch and report if it resolves the 
>>>>> problem?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Zev
>>>>
>>>> Hi Zev,
>>>>
>>>> Thanks for the quick reply. Yes, it looks like your patch does solve 
>>>> the problem (I've applied it on top of 5.19 (after fighting with my 
>>>> mail client for a while) and suspended a few times, it's working so 
>>>> far).
>>>>
>>>> Regards,
>>>> Zoltan
>>>
>>> Great, thanks.
>>>
>>> Guenter, it looks like nct6775_suspend() really does in fact need to 
>>> use nct6775_update_device() instead of dev_get_drvdata(), though it's 
>>> not immediately obvious to me why.  Though given that the bulk of of 
>>> the body of nct6775_update_device() is inside an 'if' block that 
>>> might not necessarily execute every time, I also wonder if it might 
>>> be vulnerable to exhibiting the same problem depending on timing.
>>>
>>> Zoltán, if you could try another experiment to try to gather some 
>>> data on that -- with the patch from my previous email still applied, 
>>> could you try suspending via:
>>>
>>>     $ cat 
>>> /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input && 
>>> echo mem > /sys/power/state
>>
>> Tried it, three times in fact, and it worked fine every time. Looking 
>> at the dmesg, though, it looks like it needs a bit more than 1.5 sec 
>> to suspend. Where's that 1.5 sec limit defined? I will try to increase 
>> it tomorrow.
>>
> 
> The 1.5 second duration comes from this line in nct6775_update_device():
> 
>    if (time_after(jiffies, data->last_updated + HZ + HZ / 2)
> 
> Each 'HZ' represents one second, so HZ + HZ / 2 = 1.5 sec; if you want 
> to lengthen it you could do e.g. 10 * HZ or something instead.

Tried that, this isn't the problem.

> 
> Though as Guenter noted, one other possibility is that with the previous 
> (buggy) version nct6775_update_device() might never have gotten called 
> at all -- do you know if that might be the case on your system?  (i.e. 
> do you have any userspace monitoring program or the like that would have 
> been reading from the nct6775 device's sensors?)  If something like that 
> never ran between the driver getting loaded and the system suspending, 
> that might also (partially) explain things; to test that out you could 
> revert to the old buggy code and see if the suspend problem still occurs 
> if you explicitly run
> 
>    $ cat /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input
> 
> (or just 'sensors' if you've got the lm-sensors package installed).  
> That will
> ensure that nct6775_update_device() gets called before the suspend 
> operation, which could help narrow things down further.

Yup, this was it. It looks like I remembered it wrong and my monitoring 
widget in the end only used k10temp and not nct6798, so I could very 
easily suspend without any reads from nct6775 before (and that widget 
itself even only ran when I had an X session).

Regards,
Zoltan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
  2022-08-10  7:33                 ` Guenter Roeck
@ 2022-08-10 22:56                   ` Zev Weiss
  2022-08-11  0:20                     ` Guenter Roeck
  0 siblings, 1 reply; 14+ messages in thread
From: Zev Weiss @ 2022-08-10 22:56 UTC (permalink / raw)
  To: Guenter Roeck; +Cc: Armin Wolf, Zoltán Kővágó, linux-hwmon

On Wed, Aug 10, 2022 at 12:33:43AM PDT, Guenter Roeck wrote:
>On 8/9/22 23:03, Zev Weiss wrote:
>>On Tue, Aug 09, 2022 at 08:36:24PM PDT, Guenter Roeck wrote:
>>>On 8/9/22 20:00, Armin Wolf wrote:
>>>>Am 10.08.22 um 03:39 schrieb Zev Weiss:
>>>>
>>>>>On Tue, Aug 09, 2022 at 04:50:20PM PDT, Zoltán Kővágó wrote:
>>>>>>On 2022-08-10 00:34, Zev Weiss wrote:
>>>>>>>On Tue, Aug 09, 2022 at 02:28:20PM PDT, Zoltán Kővágó wrote:
>>>>>>>>On 2022-08-09 22:56, Zev Weiss wrote:
>>>>>>>>>On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
>>>>>>>>>>Hi,
>>>>>>>>>>
>>>>>>>>>>[1.] One line summary of the problem:
>>>>>>>>>>NCT6775: suspend doesn't work after updating to Linux 5.19
>>>>>>>>>>
>>>>>>>>>>[2.] Full description of the problem/report:
>>>>>>>>>>After updating my kernel from 5.18.11 to 5.19, I've noticed that
>>>>>>>>>>resuming after suspend no longer works: fans start up, then about
>>>>>>>>>>a second later, the computer just shuts down, leaving the USB
>>>>>>>>>>ports powered up (normally it turns them off on shutdown). The
>>>>>>>>>>screens don't turn on during this timeframe, so I can't see any
>>>>>>>>>>useful log messages.
>>>>>>>>>>
>>>>>>>>>>Bisecting between 5.18 (where it still worked) and 5.19 lead me
>>>>>>>>>>to commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd "hwmon:
>>>>>>>>>>(nct6775) Split core and platform driver" which looks like a
>>>>>>>>>>refactor commit, but apparently it broke something.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>Hi Zoltán,
>>>>>>>>>
>>>>>>>>>Thanks for the thorough bug report.  You're right that that commit
>>>>>>>>>was essentially just a refactor, though there was one slight
>>>>>>>>>change to the nct6775_suspend() function introduced during the
>>>>>>>>>review process that may perhaps have had some subtle unintended
>>>>>>>>>side-effects.
>>>>>>>>>
>>>>>>>>>Can you test the following patch and report if it resolves the
>>>>>>>>>problem?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>Thanks,
>>>>>>>>>Zev
>>>>>>>>
>>>>>>>>Hi Zev,
>>>>>>>>
>>>>>>>>Thanks for the quick reply. Yes, it looks like your patch does
>>>>>>>>solve the problem (I've applied it on top of 5.19 (after fighting
>>>>>>>>with my mail client for a while) and suspended a few times, it's
>>>>>>>>working so far).
>>>>>>>>
>>>>>>>>Regards,
>>>>>>>>Zoltan
>>>>>>>
>>>>>>>Great, thanks.
>>>>>>>
>>>>>>>Guenter, it looks like nct6775_suspend() really does in fact need to
>>>>>>>use nct6775_update_device() instead of dev_get_drvdata(), though
>>>>>>>it's not immediately obvious to me why.  Though given that the bulk
>>>>>>>of of the body of nct6775_update_device() is inside an 'if' block
>>>>>>>that might not necessarily execute every time, I also wonder if it
>>>>>>>might be vulnerable to exhibiting the same problem depending on timing.
>>>>>>>
>>>>>>>Zoltán, if you could try another experiment to try to gather some
>>>>>>>data on that -- with the patch from my previous email still applied,
>>>>>>>could you try suspending via:
>>>>>>>
>>>>>>>    $ cat
>>>>>>>/sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input &&
>>>>>>>echo mem > /sys/power/state
>>>>>>
>>>>>>Tried it, three times in fact, and it worked fine every time. Looking
>>>>>>at the dmesg, though, it looks like it needs a bit more than 1.5 sec
>>>>>>to suspend. Where's that 1.5 sec limit defined? I will try to
>>>>>>increase it tomorrow.
>>>>>>
>>>>>
>>>>>The 1.5 second duration comes from this line in nct6775_update_device():
>>>>>
>>>>>  if (time_after(jiffies, data->last_updated + HZ + HZ / 2)
>>>>>
>>>>>Each 'HZ' represents one second, so HZ + HZ / 2 = 1.5 sec; if you want
>>>>>to lengthen it you could do e.g. 10 * HZ or something instead.
>>>>>
>>>>>Though as Guenter noted, one other possibility is that with the
>>>>>previous (buggy) version nct6775_update_device() might never have
>>>>>gotten called at all -- do you know if that might be the case on your
>>>>>system?  (i.e.  do you have any userspace monitoring program or the
>>>>>like that would have been reading from the nct6775 device's sensors?)
>>>>>If something like that never ran between the driver getting loaded and
>>>>>the system suspending, that might also (partially) explain things; to
>>>>>test that out you could revert to the old buggy code and see if the
>>>>>suspend problem still occurs if you explicitly run
>>>>>
>>>>>  $ cat /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input
>>>>>
>>>>>(or just 'sensors' if you've got the lm-sensors package installed).
>>>>>That will
>>>>>ensure that nct6775_update_device() gets called before the suspend
>>>>>operation, which could help narrow things down further.
>>>>>
>>>>In case you are running a Debian-based distribution, please note that Debian unconditionally executes sensors on startup, something
>>>>i discovered when the dell-smm-hwmon driver was causing issues on my machine.
>>>>
>>>>So you may need to temporarily disable lm-sensors.service and/or /etc/init.d/lm-sensors for testing.
>>>>
>>
>>I'm not sure offhand what it does in that area, but Zoltán's initial bug report indicated that the problem occurred on Gentoo, FWIW.
>>
>>>
>>>The assumption was that the sensors command was _not_ executed prior
>>>to suspend and that this triggers the problem. If it was executed at least
>>>once and the crash on resume still happens, something else must be going
>>>on. Either case, I think it is best to apply the patch suggested by Zev.
>>>We can try to figure out the exact cause of the problem later. Zev, can
>>>you send me a formal patch ? I would like to apply and send it to Linus
>>>as soon as possible.
>>>
>>
>>Ack, patch sent.
>>
>> From what we've seen so far though, it seems likely (if for no other reason than that if this _isn't_ the cause there must be something truly weird going on) that the problem stems from nct6775_update_device() simply having never been called at all.  If we can confirm that that's the case, should we perhaps just add an unconditional call to nct6775_update_device() at the end of nct6775_probe(), and revert the patch I just sent so as to avoid the single-consumer symbol export?
>>
>
>I thought about calling it in probe, but I don't really like it because
>it would slow down boot.

True, that's a fair point.

>Let's introduce a namespace (say, NCT6775) instead.
>Change all EXPORT_SYMBOL_GPL(x) to EXPORT_SYMBOL_NS_GPL(x, NCT6775) and
>add MODULE_IMPORT_NS(NCT6775) to nct6775-platform.c and nct6775-i2c.c.
>This will ensure that the exported symbols can only be called from the
>nct6775 code.
>

nct6775-core.c currently has DEFAULT_SYMBOL_NAMESPACE #defined to 
HWMON_NCT6775 (and the platform and i2c drivers are using 
MODULE_IMPORT_NS(HWMON_NCT6775)), so I think that's already in place -- 
I guess we can just leave it as is with the patch that's now upstream 
then.


Zev


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
  2022-08-10  8:22           ` Zoltán Kővágó
@ 2022-08-10 22:59             ` Zev Weiss
  0 siblings, 0 replies; 14+ messages in thread
From: Zev Weiss @ 2022-08-10 22:59 UTC (permalink / raw)
  To: Zoltán Kővágó; +Cc: Guenter Roeck, linux-hwmon

On Wed, Aug 10, 2022 at 01:22:04AM PDT, Zoltán Kővágó wrote:
>On 2022-08-10 03:39, Zev Weiss wrote:
>>On Tue, Aug 09, 2022 at 04:50:20PM PDT, Zoltán Kővágó wrote:
>>>On 2022-08-10 00:34, Zev Weiss wrote:
>>>>On Tue, Aug 09, 2022 at 02:28:20PM PDT, Zoltán Kővágó wrote:
>>>>>On 2022-08-09 22:56, Zev Weiss wrote:
>>>>>>On Tue, Aug 09, 2022 at 01:27:48PM PDT, Zoltán Kővágó wrote:
>>>>>>>Hi,
>>>>>>>
>>>>>>>[1.] One line summary of the problem:
>>>>>>>NCT6775: suspend doesn't work after updating to Linux 5.19
>>>>>>>
>>>>>>>[2.] Full description of the problem/report:
>>>>>>>After updating my kernel from 5.18.11 to 5.19, I've 
>>>>>>>noticed that resuming after suspend no longer works: fans 
>>>>>>>start up, then about a second later, the computer just 
>>>>>>>shuts down, leaving the USB ports powered up (normally it 
>>>>>>>turns them off on shutdown). The screens don't turn on 
>>>>>>>during this timeframe, so I can't see any useful log 
>>>>>>>messages.
>>>>>>>
>>>>>>>Bisecting between 5.18 (where it still worked) and 5.19 
>>>>>>>lead me to commit c3963bc0a0cf9ecb205a9d4976eb92b6df2fa3fd 
>>>>>>>"hwmon: (nct6775) Split core and platform driver" which 
>>>>>>>looks like a refactor commit, but apparently it broke 
>>>>>>>something.
>>>>>>>
>>>>>>
>>>>>>Hi Zoltán,
>>>>>>
>>>>>>Thanks for the thorough bug report.  You're right that that 
>>>>>>commit was essentially just a refactor, though there was one 
>>>>>>slight change to the nct6775_suspend() function introduced 
>>>>>>during the review process that may perhaps have had some 
>>>>>>subtle unintended side-effects.
>>>>>>
>>>>>>Can you test the following patch and report if it resolves 
>>>>>>the problem?
>>>>>>
>>>>>>
>>>>>>Thanks,
>>>>>>Zev
>>>>>
>>>>>Hi Zev,
>>>>>
>>>>>Thanks for the quick reply. Yes, it looks like your patch does 
>>>>>solve the problem (I've applied it on top of 5.19 (after 
>>>>>fighting with my mail client for a while) and suspended a few 
>>>>>times, it's working so far).
>>>>>
>>>>>Regards,
>>>>>Zoltan
>>>>
>>>>Great, thanks.
>>>>
>>>>Guenter, it looks like nct6775_suspend() really does in fact 
>>>>need to use nct6775_update_device() instead of 
>>>>dev_get_drvdata(), though it's not immediately obvious to me 
>>>>why.  Though given that the bulk of of the body of 
>>>>nct6775_update_device() is inside an 'if' block that might not 
>>>>necessarily execute every time, I also wonder if it might be 
>>>>vulnerable to exhibiting the same problem depending on timing.
>>>>
>>>>Zoltán, if you could try another experiment to try to gather 
>>>>some data on that -- with the patch from my previous email still 
>>>>applied, could you try suspending via:
>>>>
>>>>    $ cat 
>>>>/sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input 
>>>>&& echo mem > /sys/power/state
>>>
>>>Tried it, three times in fact, and it worked fine every time. 
>>>Looking at the dmesg, though, it looks like it needs a bit more 
>>>than 1.5 sec to suspend. Where's that 1.5 sec limit defined? I 
>>>will try to increase it tomorrow.
>>>
>>
>>The 1.5 second duration comes from this line in nct6775_update_device():
>>
>>   if (time_after(jiffies, data->last_updated + HZ + HZ / 2)
>>
>>Each 'HZ' represents one second, so HZ + HZ / 2 = 1.5 sec; if you 
>>want to lengthen it you could do e.g. 10 * HZ or something instead.
>
>Tried that, this isn't the problem.
>
>>
>>Though as Guenter noted, one other possibility is that with the 
>>previous (buggy) version nct6775_update_device() might never have 
>>gotten called at all -- do you know if that might be the case on 
>>your system?  (i.e. do you have any userspace monitoring program or 
>>the like that would have been reading from the nct6775 device's 
>>sensors?)  If something like that never ran between the driver 
>>getting loaded and the system suspending, that might also 
>>(partially) explain things; to test that out you could revert to the 
>>old buggy code and see if the suspend problem still occurs if you 
>>explicitly run
>>
>>   $ cat /sys/bus/platform/drivers/nct6775/nct6775.*/hwmon/hwmon*/*_input
>>
>>(or just 'sensors' if you've got the lm-sensors package installed).  
>>That will
>>ensure that nct6775_update_device() gets called before the suspend 
>>operation, which could help narrow things down further.
>
>Yup, this was it. It looks like I remembered it wrong and my 
>monitoring widget in the end only used k10temp and not nct6798, so I 
>could very easily suspend without any reads from nct6775 before (and 
>that widget itself even only ran when I had an X session).
>
>Regards,
>Zoltan

Great, glad that's confirmed.  The fix is now in Linus's tree and should 
presumably get picked up in the -stable tree for 5.19.1.

Thanks for the bug report & testing!


Zev


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19
  2022-08-10 22:56                   ` Zev Weiss
@ 2022-08-11  0:20                     ` Guenter Roeck
  0 siblings, 0 replies; 14+ messages in thread
From: Guenter Roeck @ 2022-08-11  0:20 UTC (permalink / raw)
  To: Zev Weiss; +Cc: Armin Wolf, Zoltán Kővágó, linux-hwmon

On Wed, Aug 10, 2022 at 03:56:08PM -0700, Zev Weiss wrote:
> > 
> > I thought about calling it in probe, but I don't really like it because
> > it would slow down boot.
> 
> True, that's a fair point.
> 
> > Let's introduce a namespace (say, NCT6775) instead.
> > Change all EXPORT_SYMBOL_GPL(x) to EXPORT_SYMBOL_NS_GPL(x, NCT6775) and
> > add MODULE_IMPORT_NS(NCT6775) to nct6775-platform.c and nct6775-i2c.c.
> > This will ensure that the exported symbols can only be called from the
> > nct6775 code.
> > 
> 
> nct6775-core.c currently has DEFAULT_SYMBOL_NAMESPACE #defined to
> HWMON_NCT6775 (and the platform and i2c drivers are using
> MODULE_IMPORT_NS(HWMON_NCT6775)), so I think that's already in place -- I
> guess we can just leave it as is with the patch that's now upstream then.
> 
Excellent, looks like we are all set.

Thanks,
Guenter

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2022-08-11  0:20 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <f5990ef1-4efe-d2b1-8e50-c6890526c054@gmail.com>
2022-08-09 20:56 ` PROBLEM: NCT6775: suspend doesn't work after updating to Linux 5.19 Zev Weiss
2022-08-09 21:28   ` Zoltán Kővágó
2022-08-09 22:34     ` Zev Weiss
2022-08-09 23:50       ` Zoltán Kővágó
2022-08-10  1:39         ` Zev Weiss
2022-08-10  3:00           ` Armin Wolf
2022-08-10  3:36             ` Guenter Roeck
2022-08-10  6:03               ` Zev Weiss
2022-08-10  7:33                 ` Guenter Roeck
2022-08-10 22:56                   ` Zev Weiss
2022-08-11  0:20                     ` Guenter Roeck
2022-08-10  8:22           ` Zoltán Kővágó
2022-08-10 22:59             ` Zev Weiss
2022-08-10  0:03       ` Guenter Roeck

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.