linux-scsi.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
@ 2022-06-27  4:09 ` bugzilla-daemon
  2022-06-28 21:17 ` bugzilla-daemon
                   ` (31 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-06-27  4:09 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

Zhang Rui (rui.zhang@intel.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |rui.zhang@intel.com
          Component|Hibernation/Suspend         |Other
           Hardware|All                         |AMD
           Assignee|rjw@rjwysocki.net           |scsi_drivers-other@kernel-b
                   |                            |ugs.osdl.org
            Product|Power Management            |SCSI Drivers

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
  2022-06-27  4:09 ` [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16 bugzilla-daemon
@ 2022-06-28 21:17 ` bugzilla-daemon
  2022-06-28 22:30 ` bugzilla-daemon
                   ` (30 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-06-28 21:17 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

jason600 (jason600.groome@gmail.com) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |jason600.groome@gmail.com

--- Comment #26 from jason600 (jason600.groome@gmail.com) ---
I have been experiencing the same problem since 5.16. I have a hd and ssd (boot
from the ssd). Since 5.16 resume from suspend to ram is taking 6-8 seconds
(with 5.15 and before, it was less then 2 seconds. 

I narrowed it down to the hd waiting for spinup to finish suspend. Hdparm had
not effect, also unmounting the drive had no effect. The only was to fix it was
to remove the hd (mounted on sda) with 

# echo 1 > /sys/block/sda/device/delete

Then it resumed in 2 seconds again, but I couldn’t access the hd again until
reboot.

I installed the patches from https://github.com/bvanassche/linux/tree/sd-resume
on Opensuse Tumbleweed kernel 5.18.6.

I can confirm the patches work, resume from suspend is now back to under 2
seconds, with no obvious problems.

Thank you very much.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
  2022-06-27  4:09 ` [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16 bugzilla-daemon
  2022-06-28 21:17 ` bugzilla-daemon
@ 2022-06-28 22:30 ` bugzilla-daemon
  2022-07-15 18:44 ` bugzilla-daemon
                   ` (29 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-06-28 22:30 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #27 from Bart Van Assche (bvanassche@acm.org) ---
Thanks for testing! The patches from the sd-resume branch have been posted on
the linux-scsi mailing list. See also
https://lore.kernel.org/linux-scsi/20220628222131.14780-1-bvanassche@acm.org/T/#t

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (2 preceding siblings ...)
  2022-06-28 22:30 ` bugzilla-daemon
@ 2022-07-15 18:44 ` bugzilla-daemon
  2022-08-15 11:02 ` bugzilla-daemon
                   ` (28 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-07-15 18:44 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

Nils (nilsdev@cocosmail.de) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |nilsdev@cocosmail.de

--- Comment #28 from Nils (nilsdev@cocosmail.de) ---
Possibly related:

https://bugzilla.kernel.org/show_bug.cgi?id=216087

"Only Toshiba MG08/MG09 SATA hard disks disappear after resume from S3 suspend
(reproducible) - AMD Ryzen 5 PRO 4650G"

on an X570 board (0x7901).

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (3 preceding siblings ...)
  2022-07-15 18:44 ` bugzilla-daemon
@ 2022-08-15 11:02 ` bugzilla-daemon
  2022-08-15 13:36 ` bugzilla-daemon
                   ` (27 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-08-15 11:02 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #29 from Nils (nilsdev@cocosmail.de) ---
(In reply to Bart Van Assche from comment #27)
> Thanks for testing! The patches from the sd-resume branch have been posted
> on the linux-scsi mailing list. See also
> https://lore.kernel.org/linux-scsi/20220628222131.14780-1-bvanassche@acm.org/
> T/#t

When will the changes be merged into mainline please?

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (4 preceding siblings ...)
  2022-08-15 11:02 ` bugzilla-daemon
@ 2022-08-15 13:36 ` bugzilla-daemon
  2022-08-16 11:08 ` bugzilla-daemon
                   ` (26 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-08-15 13:36 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #30 from Bart Van Assche (bvanassche@acm.org) ---
I think that commit 88f1669019bd ("scsi: sd: Rework asynchronous resume
support") got merged in Linus' tree (kernel v6.0-rc1) about two days ago.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (5 preceding siblings ...)
  2022-08-15 13:36 ` bugzilla-daemon
@ 2022-08-16 11:08 ` bugzilla-daemon
  2022-08-16 15:44 ` bugzilla-daemon
                   ` (25 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-08-16 11:08 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

gzhqyz@gmail.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |gzhqyz@gmail.com

--- Comment #31 from gzhqyz@gmail.com ---
Hi, on my laptop (500GB SATA HDD, Disk model: WDC WD5000LPVX-2), commit
88f1669019bd cause system hang after resume from suspend and reverting the
commit fixes my problem.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (6 preceding siblings ...)
  2022-08-16 11:08 ` bugzilla-daemon
@ 2022-08-16 15:44 ` bugzilla-daemon
  2022-08-16 16:10 ` bugzilla-daemon
                   ` (24 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-08-16 15:44 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #32 from Damien Le Moal (damien.lemoal@wdc.com) ---
(In reply to gzhqyz from comment #31)
> Hi, on my laptop (500GB SATA HDD, Disk model: WDC WD5000LPVX-2), commit
> 88f1669019bd cause system hang after resume from suspend and reverting the
> commit fixes my problem.

Can you try kernel 6.0-rc1 ?

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (7 preceding siblings ...)
  2022-08-16 15:44 ` bugzilla-daemon
@ 2022-08-16 16:10 ` bugzilla-daemon
  2022-08-16 16:14 ` bugzilla-daemon
                   ` (23 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-08-16 16:10 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #33 from Nils (nilsdev@cocosmail.de) ---
(In reply to Bart Van Assche from comment #30)
> I think that commit 88f1669019bd ("scsi: sd: Rework asynchronous resume
> support") got merged in Linus' tree (kernel v6.0-rc1) about two days ago.

Great, thanks.  Looking forward to testing it once my distributor ships it.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (8 preceding siblings ...)
  2022-08-16 16:10 ` bugzilla-daemon
@ 2022-08-16 16:14 ` bugzilla-daemon
  2022-08-16 17:11 ` bugzilla-daemon
                   ` (22 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-08-16 16:14 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #34 from Damien Le Moal (damien.lemoal@wdc.com) ---
(In reply to Nils from comment #33)
> (In reply to Bart Van Assche from comment #30)
> > I think that commit 88f1669019bd ("scsi: sd: Rework asynchronous resume
> > support") got merged in Linus' tree (kernel v6.0-rc1) about two days ago.
> 
> Great, thanks.  Looking forward to testing it once my distributor ships it.

The patch has a fixes tag, so it will get backported to the latest 5.19 stable
too.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (9 preceding siblings ...)
  2022-08-16 16:14 ` bugzilla-daemon
@ 2022-08-16 17:11 ` bugzilla-daemon
  2022-08-25 20:01 ` bugzilla-daemon
                   ` (21 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-08-16 17:11 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #35 from gzhqyz@gmail.com ---
(In reply to Damien Le Moal from comment #32)
> (In reply to gzhqyz from comment #31)
> > Hi, on my laptop (500GB SATA HDD, Disk model: WDC WD5000LPVX-2), commit
> > 88f1669019bd cause system hang after resume from suspend and reverting the
> > commit fixes my problem.
> 
> Can you try kernel 6.0-rc1 ?

Hi, the issue I reported is found in 6.0-rc1. Here is the kernel I am using
(without commit 88f1669019bd):

Linux archlinux 6.0.0-rc1-2-mainline #1 SMP Tue, 16 Aug 2022 03:18:24 +0000
x86_64 GNU/Linux

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (10 preceding siblings ...)
  2022-08-16 17:11 ` bugzilla-daemon
@ 2022-08-25 20:01 ` bugzilla-daemon
  2022-08-25 20:31 ` bugzilla-daemon
                   ` (20 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-08-25 20:01 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #36 from jason600 (jason600.groome@gmail.com) ---
(In reply to Bart Van Assche from comment #27)
> Thanks for testing! The patches from the sd-resume branch have been posted
> on the linux-scsi mailing list. See also
> https://lore.kernel.org/linux-scsi/20220628222131.14780-1-bvanassche@acm.org/
> T/#t

Hi Bart, just an update for you. I noticed this had been removed from the
6.0-rc1 for freezing after suspend.

I've been compiling my kernel with this fix on various 5.18 kernels (with
opensuse tumbleweed), it has worked fine, no freezing on resume as others have
mentioned.

Yesterday, I updated to 5.19.2 kernel, applied the fix, recompiled, and it
froze after the first suspend. Rebooted and the same thing happened again. I
recompiled the kernel with the fix, just to make sure i didn't mess it up, and
the same happened again.

When you originally did this fix, you based it on 5.18, and indeed, it works
fine on 5.18 for me. There were a lot of changes to the drivers/scsi/sd.c file
for 5.19, presumably it was those changes that made this fix start freezing
after suspend.

Perhaps you could check if the other people that experienced freezing were
using either 5.19 or 6.0-rc1.

Hopefully, this information might help in finding the cause and re-fixing this
issue.

Thanks for all you help

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (11 preceding siblings ...)
  2022-08-25 20:01 ` bugzilla-daemon
@ 2022-08-25 20:31 ` bugzilla-daemon
  2022-08-25 22:15 ` bugzilla-daemon
                   ` (19 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-08-25 20:31 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #37 from Bart Van Assche (bvanassche@acm.org) ---
On 8/25/22 13:01, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215880
> 
> --- Comment #36 from jason600 (jason600.groome@gmail.com) ---
> (In reply to Bart Van Assche from comment #27)
>> Thanks for testing! The patches from the sd-resume branch have been posted
>> on the linux-scsi mailing list. See also
>>
>> https://lore.kernel.org/linux-scsi/20220628222131.14780-1-bvanassche@acm.org/
>> T/#t
> 
> Hi Bart, just an update for you. I noticed this had been removed from the
> 6.0-rc1 for freezing after suspend.
> 
> I've been compiling my kernel with this fix on various 5.18 kernels (with
> opensuse tumbleweed), it has worked fine, no freezing on resume as others
> have
> mentioned.
> 
> Yesterday, I updated to 5.19.2 kernel, applied the fix, recompiled, and it
> froze after the first suspend. Rebooted and the same thing happened again. I
> recompiled the kernel with the fix, just to make sure i didn't mess it up,
> and
> the same happened again.
> 
> When you originally did this fix, you based it on 5.18, and indeed, it works
> fine on 5.18 for me. There were a lot of changes to the drivers/scsi/sd.c
> file
> for 5.19, presumably it was those changes that made this fix start freezing
> after suspend.
> 
> Perhaps you could check if the other people that experienced freezing were
> using either 5.19 or 6.0-rc1.

Multiple people reported issues with freezes during suspend with kernel 
v6.0-rc1. Please take a look at the following report: 
https://lore.kernel.org/all/dd6844e7-f338-a4e9-2dad-0960e25b2ca1@redhat.com/. 
It shows that if zoned ATA disks are present that blk_mq_freeze_queue() 
may be called from inside ata_scsi_dev_rescan() on the context of a work 
queue. ATA rescanning happens from inside the SCSI error handler. So 
there is potential for a lockup because of the following:
* Execution of the START command being postponed because the SCSI error 
handler is active.
* blk_mq_freeze_queue() waiting for the START command to finish.
* The START completion handler not being executed because it got queued 
on the same work queue as the ATA rescan work.

Unfortunately I do not know enough about the ATA core to proceed. I need 
help from an ATA expert.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (12 preceding siblings ...)
  2022-08-25 20:31 ` bugzilla-daemon
@ 2022-08-25 22:15 ` bugzilla-daemon
  2022-08-26  7:00   ` Damien Le Moal
  2022-08-26  7:00 ` bugzilla-daemon
                   ` (18 subsequent siblings)
  32 siblings, 1 reply; 34+ messages in thread
From: bugzilla-daemon @ 2022-08-25 22:15 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #38 from damien.lemoal@opensource.wdc.com ---
On 8/26/22 05:31, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215880
> 
> --- Comment #37 from Bart Van Assche (bvanassche@acm.org) ---
> On 8/25/22 13:01, bugzilla-daemon@kernel.org wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=215880
>>
>> --- Comment #36 from jason600 (jason600.groome@gmail.com) ---
>> (In reply to Bart Van Assche from comment #27)
>>> Thanks for testing! The patches from the sd-resume branch have been posted
>>> on the linux-scsi mailing list. See also
>>>
>>>
>>> https://lore.kernel.org/linux-scsi/20220628222131.14780-1-bvanassche@acm.org/
>>> T/#t
>>
>> Hi Bart, just an update for you. I noticed this had been removed from the
>> 6.0-rc1 for freezing after suspend.
>>
>> I've been compiling my kernel with this fix on various 5.18 kernels (with
>> opensuse tumbleweed), it has worked fine, no freezing on resume as others
>> have
>> mentioned.
>>
>> Yesterday, I updated to 5.19.2 kernel, applied the fix, recompiled, and it
>> froze after the first suspend. Rebooted and the same thing happened again. I
>> recompiled the kernel with the fix, just to make sure i didn't mess it up,
>> and
>> the same happened again.
>>
>> When you originally did this fix, you based it on 5.18, and indeed, it works
>> fine on 5.18 for me. There were a lot of changes to the drivers/scsi/sd.c
>> file
>> for 5.19, presumably it was those changes that made this fix start freezing
>> after suspend.
>>
>> Perhaps you could check if the other people that experienced freezing were
>> using either 5.19 or 6.0-rc1.
> 
> Multiple people reported issues with freezes during suspend with kernel 
> v6.0-rc1. Please take a look at the following report: 
> https://lore.kernel.org/all/dd6844e7-f338-a4e9-2dad-0960e25b2ca1@redhat.com/. 
> It shows that if zoned ATA disks are present that blk_mq_freeze_queue() 
> may be called from inside ata_scsi_dev_rescan() on the context of a work 
> queue. ATA rescanning happens from inside the SCSI error handler. So 
> there is potential for a lockup because of the following:
> * Execution of the START command being postponed because the SCSI error 
> handler is active.
> * blk_mq_freeze_queue() waiting for the START command to finish.
> * The START completion handler not being executed because it got queued 
> on the same work queue as the ATA rescan work.
> 
> Unfortunately I do not know enough about the ATA core to proceed. I need 
> help from an ATA expert.

I will have a look today.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
  2022-08-25 22:15 ` bugzilla-daemon
@ 2022-08-26  7:00   ` Damien Le Moal
  0 siblings, 0 replies; 34+ messages in thread
From: Damien Le Moal @ 2022-08-26  7:00 UTC (permalink / raw)
  To: bugzilla-daemon, linux-scsi

On 8/26/22 07:15, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215880
> 
> --- Comment #38 from damien.lemoal@opensource.wdc.com ---
> On 8/26/22 05:31, bugzilla-daemon@kernel.org wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=215880
>>
>> --- Comment #37 from Bart Van Assche (bvanassche@acm.org) ---
>> On 8/25/22 13:01, bugzilla-daemon@kernel.org wrote:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=215880
>>>
>>> --- Comment #36 from jason600 (jason600.groome@gmail.com) ---
>>> (In reply to Bart Van Assche from comment #27)
>>>> Thanks for testing! The patches from the sd-resume branch have been posted
>>>> on the linux-scsi mailing list. See also
>>>>
>>>>
>>>> https://lore.kernel.org/linux-scsi/20220628222131.14780-1-bvanassche@acm.org/
>>>> T/#t
>>>
>>> Hi Bart, just an update for you. I noticed this had been removed from the
>>> 6.0-rc1 for freezing after suspend.
>>>
>>> I've been compiling my kernel with this fix on various 5.18 kernels (with
>>> opensuse tumbleweed), it has worked fine, no freezing on resume as others
>>> have
>>> mentioned.
>>>
>>> Yesterday, I updated to 5.19.2 kernel, applied the fix, recompiled, and it
>>> froze after the first suspend. Rebooted and the same thing happened again. I
>>> recompiled the kernel with the fix, just to make sure i didn't mess it up,
>>> and
>>> the same happened again.
>>>
>>> When you originally did this fix, you based it on 5.18, and indeed, it works
>>> fine on 5.18 for me. There were a lot of changes to the drivers/scsi/sd.c
>>> file
>>> for 5.19, presumably it was those changes that made this fix start freezing
>>> after suspend.
>>>
>>> Perhaps you could check if the other people that experienced freezing were
>>> using either 5.19 or 6.0-rc1.
>>
>> Multiple people reported issues with freezes during suspend with kernel 
>> v6.0-rc1. Please take a look at the following report: 
>> https://lore.kernel.org/all/dd6844e7-f338-a4e9-2dad-0960e25b2ca1@redhat.com/. 
>> It shows that if zoned ATA disks are present that blk_mq_freeze_queue() 
>> may be called from inside ata_scsi_dev_rescan() on the context of a work 
>> queue. ATA rescanning happens from inside the SCSI error handler. So 
>> there is potential for a lockup because of the following:
>> * Execution of the START command being postponed because the SCSI error 
>> handler is active.
>> * blk_mq_freeze_queue() waiting for the START command to finish.
>> * The START completion handler not being executed because it got queued 
>> on the same work queue as the ATA rescan work.

Checking the code, the dev pm resume call chain for ATA look like this.

ata_port_resume() -> ata_port_request_pm() -> ata_port_schedule_eh() ->
scsi_error_handler() thread runs -> shost->transportt->eh_strategy_handler
== ata_scsi_error() -> ata_scsi_port_error_handler() ->
ata_eh_handle_port_resume() -> ap->ops->port_resume() ==
ahci_port_resume() for AHCI adapters.

There are no commands issued to the device by this chain, only
registers/port settings being changed. So this should always complete
quickly and in itself not be the reason for the START_STOP command issued
by the sd driver to get stuck.

After calling ata_eh_handle_port_resume(), ata_scsi_port_error_handler()
will trigger a reset through ata_std_error_handler() -> ata_do_eh() and
that will cause the device rescan in EH context. Device rescan will
definitely spin up the device so the START_STOP command that sd_resume()
issues seems rather useless...

As a hack, it would be good to try something like this:

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 29e2f55c6faa..1bc92c04f048 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1081,6 +1081,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev,
struct ata_device *dev)
 	} else {
 		sdev->sector_size = ata_id_logical_sector_size(dev->id);
 		sdev->manage_start_stop = 1;
+		sdev->no_start_on_resume = 1;
 	}

 	/*
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8f79fa6318fe..4c28ca4d038b 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3761,7 +3761,7 @@ static int sd_suspend_runtime(struct device *dev)
 static int sd_resume(struct device *dev)
 {
 	struct scsi_disk *sdkp = dev_get_drvdata(dev);
-	int ret;
+	int ret = 0;

 	if (!sdkp)	/* E.g.: runtime resume at the start of sd_probe() */
 		return 0;
@@ -3770,7 +3770,8 @@ static int sd_resume(struct device *dev)
 		return 0;

 	sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
-	ret = sd_start_stop_device(sdkp, 1);
+	if (!sdkp->device->no_start_on_resume)
+		ret = sd_start_stop_device(sdkp, 1);
 	if (!ret)
 		opal_unlock_from_suspend(sdkp->opal_dev);
 	return ret;
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 3113471ca375..92e141536c6c 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -192,6 +192,7 @@ struct scsi_device {
 	unsigned no_start_on_add:1;	/* do not issue start on add */
 	unsigned allow_restart:1; /* issue START_UNIT in error handler */
 	unsigned manage_start_stop:1;	/* Let HLD (sd) manage start/stop */
+	unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on resume */
 	unsigned start_stop_pwr_cond:1;	/* Set power cond. in START_STOP_UNIT */
 	unsigned no_uld_attach:1; /* disable connecting to upper level drivers */
 	unsigned select_no_atn:1;

I am not sure at all this is correct though. This actually may break other
suspend/resume flavors. If we could somehow synchronize scsi pm resume to
run *after* ata pm resume, all problems should go away I think.


-- 
Damien Le Moal
Western Digital Research

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (13 preceding siblings ...)
  2022-08-25 22:15 ` bugzilla-daemon
@ 2022-08-26  7:00 ` bugzilla-daemon
  2022-10-06 15:37 ` bugzilla-daemon
                   ` (17 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-08-26  7:00 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #39 from damien.lemoal@opensource.wdc.com ---
On 8/26/22 07:15, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215880
> 
> --- Comment #38 from damien.lemoal@opensource.wdc.com ---
> On 8/26/22 05:31, bugzilla-daemon@kernel.org wrote:
>> https://bugzilla.kernel.org/show_bug.cgi?id=215880
>>
>> --- Comment #37 from Bart Van Assche (bvanassche@acm.org) ---
>> On 8/25/22 13:01, bugzilla-daemon@kernel.org wrote:
>>> https://bugzilla.kernel.org/show_bug.cgi?id=215880
>>>
>>> --- Comment #36 from jason600 (jason600.groome@gmail.com) ---
>>> (In reply to Bart Van Assche from comment #27)
>>>> Thanks for testing! The patches from the sd-resume branch have been posted
>>>> on the linux-scsi mailing list. See also
>>>>
>>>>
>>>>
>>>> https://lore.kernel.org/linux-scsi/20220628222131.14780-1-bvanassche@acm.org/
>>>> T/#t
>>>
>>> Hi Bart, just an update for you. I noticed this had been removed from the
>>> 6.0-rc1 for freezing after suspend.
>>>
>>> I've been compiling my kernel with this fix on various 5.18 kernels (with
>>> opensuse tumbleweed), it has worked fine, no freezing on resume as others
>>> have
>>> mentioned.
>>>
>>> Yesterday, I updated to 5.19.2 kernel, applied the fix, recompiled, and it
>>> froze after the first suspend. Rebooted and the same thing happened again.
>>> I
>>> recompiled the kernel with the fix, just to make sure i didn't mess it up,
>>> and
>>> the same happened again.
>>>
>>> When you originally did this fix, you based it on 5.18, and indeed, it
>>> works
>>> fine on 5.18 for me. There were a lot of changes to the drivers/scsi/sd.c
>>> file
>>> for 5.19, presumably it was those changes that made this fix start freezing
>>> after suspend.
>>>
>>> Perhaps you could check if the other people that experienced freezing were
>>> using either 5.19 or 6.0-rc1.
>>
>> Multiple people reported issues with freezes during suspend with kernel 
>> v6.0-rc1. Please take a look at the following report: 
>>
>> https://lore.kernel.org/all/dd6844e7-f338-a4e9-2dad-0960e25b2ca1@redhat.com/. 
>> It shows that if zoned ATA disks are present that blk_mq_freeze_queue() 
>> may be called from inside ata_scsi_dev_rescan() on the context of a work 
>> queue. ATA rescanning happens from inside the SCSI error handler. So 
>> there is potential for a lockup because of the following:
>> * Execution of the START command being postponed because the SCSI error 
>> handler is active.
>> * blk_mq_freeze_queue() waiting for the START command to finish.
>> * The START completion handler not being executed because it got queued 
>> on the same work queue as the ATA rescan work.

Checking the code, the dev pm resume call chain for ATA look like this.

ata_port_resume() -> ata_port_request_pm() -> ata_port_schedule_eh() ->
scsi_error_handler() thread runs -> shost->transportt->eh_strategy_handler
== ata_scsi_error() -> ata_scsi_port_error_handler() ->
ata_eh_handle_port_resume() -> ap->ops->port_resume() ==
ahci_port_resume() for AHCI adapters.

There are no commands issued to the device by this chain, only
registers/port settings being changed. So this should always complete
quickly and in itself not be the reason for the START_STOP command issued
by the sd driver to get stuck.

After calling ata_eh_handle_port_resume(), ata_scsi_port_error_handler()
will trigger a reset through ata_std_error_handler() -> ata_do_eh() and
that will cause the device rescan in EH context. Device rescan will
definitely spin up the device so the START_STOP command that sd_resume()
issues seems rather useless...

As a hack, it would be good to try something like this:

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 29e2f55c6faa..1bc92c04f048 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1081,6 +1081,7 @@ int ata_scsi_dev_config(struct scsi_device *sdev,
struct ata_device *dev)
        } else {
                sdev->sector_size = ata_id_logical_sector_size(dev->id);
                sdev->manage_start_stop = 1;
+               sdev->no_start_on_resume = 1;
        }

        /*
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 8f79fa6318fe..4c28ca4d038b 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3761,7 +3761,7 @@ static int sd_suspend_runtime(struct device *dev)
 static int sd_resume(struct device *dev)
 {
        struct scsi_disk *sdkp = dev_get_drvdata(dev);
-       int ret;
+       int ret = 0;

        if (!sdkp)      /* E.g.: runtime resume at the start of sd_probe() */
                return 0;
@@ -3770,7 +3770,8 @@ static int sd_resume(struct device *dev)
                return 0;

        sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
-       ret = sd_start_stop_device(sdkp, 1);
+       if (!sdkp->device->no_start_on_resume)
+               ret = sd_start_stop_device(sdkp, 1);
        if (!ret)
                opal_unlock_from_suspend(sdkp->opal_dev);
        return ret;
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 3113471ca375..92e141536c6c 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -192,6 +192,7 @@ struct scsi_device {
        unsigned no_start_on_add:1;     /* do not issue start on add */
        unsigned allow_restart:1; /* issue START_UNIT in error handler */
        unsigned manage_start_stop:1;   /* Let HLD (sd) manage start/stop */
+       unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on
resume */
        unsigned start_stop_pwr_cond:1; /* Set power cond. in START_STOP_UNIT
*/
        unsigned no_uld_attach:1; /* disable connecting to upper level drivers
*/
        unsigned select_no_atn:1;

I am not sure at all this is correct though. This actually may break other
suspend/resume flavors. If we could somehow synchronize scsi pm resume to
run *after* ata pm resume, all problems should go away I think.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (14 preceding siblings ...)
  2022-08-26  7:00 ` bugzilla-daemon
@ 2022-10-06 15:37 ` bugzilla-daemon
  2022-10-06 23:48 ` bugzilla-daemon
                   ` (16 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-10-06 15:37 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #40 from jason600 (jason600.groome@gmail.com) ---
I have been using Damien Lemoal's patch for the last couple of days on kernel
5.19.8, it works fine for me, and no obvious issues.

I also tried in on kernel 6.0, again works fine and no obvious issues.

(In reply to damien.lemoal from comment #39)
> I am not sure at all this is correct though. This actually may break other
> suspend/resume flavors. If we could somehow synchronize scsi pm resume to
> run *after* ata pm resume, all problems should go away I think.

With regard to Damiens comments, and the problems with this patch last time it
was submitted, I think it would be a good idea to have more people testing this
patch before re-submitting it.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (15 preceding siblings ...)
  2022-10-06 15:37 ` bugzilla-daemon
@ 2022-10-06 23:48 ` bugzilla-daemon
  2022-10-07  0:10 ` bugzilla-daemon
                   ` (15 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-10-06 23:48 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #41 from Damien Le Moal (damien.lemoal@wdc.com) ---
(In reply to jason600 from comment #40)
> I have been using Damien Lemoal's patch for the last couple of days on
> kernel 5.19.8, it works fine for me, and no obvious issues.
> 
> I also tried in on kernel 6.0, again works fine and no obvious issues.
> 
> (In reply to damien.lemoal from comment #39)
> > I am not sure at all this is correct though. This actually may break other
> > suspend/resume flavors. If we could somehow synchronize scsi pm resume to
> > run *after* ata pm resume, all problems should go away I think.
> 
> With regard to Damiens comments, and the problems with this patch last time
> it was submitted, I think it would be a good idea to have more people
> testing this patch before re-submitting it.

I did not submit that patch on the lists, because I am not convinced this is
the right approach. It is a "hack" right now. As mentioned, the problem is that
scsi and ata pm resume are running at the same time because the dev/pm code
does not seem to know the dependency between the devices (struct dev). Would
need to dig further into that first to see if a cleaner solution can be coded.
Will try, but busy, so it may take some time. I first need to be able to
recreate the issue on my systems. Will start with that :)

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (16 preceding siblings ...)
  2022-10-06 23:48 ` bugzilla-daemon
@ 2022-10-07  0:10 ` bugzilla-daemon
  2022-10-07  0:15 ` bugzilla-daemon
                   ` (14 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-10-07  0:10 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #42 from Bart Van Assche (bvanassche@acm.org) ---
Has it been considered to use device_link_add() to enforce the order in which
devices are resumed?

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (17 preceding siblings ...)
  2022-10-07  0:10 ` bugzilla-daemon
@ 2022-10-07  0:15 ` bugzilla-daemon
  2023-07-08 23:17 ` bugzilla-daemon
                   ` (13 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2022-10-07  0:15 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #43 from damien.lemoal@opensource.wdc.com ---
On 10/7/22 09:10, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=215880
> 
> --- Comment #42 from Bart Van Assche (bvanassche@acm.org) ---
> Has it been considered to use device_link_add() to enforce the order in which
> devices are resumed?
> 

I am not very familiar with that API level and pm code in general. So I
cannot say. Will need to look into it.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (18 preceding siblings ...)
  2022-10-07  0:15 ` bugzilla-daemon
@ 2023-07-08 23:17 ` bugzilla-daemon
  2023-07-09  7:09 ` bugzilla-daemon
                   ` (12 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-08 23:17 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

Paul Ausbeck (paula@alumni.cse.ucsc.edu) changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |paula@alumni.cse.ucsc.edu

--- Comment #44 from Paul Ausbeck (paula@alumni.cse.ucsc.edu) ---
I'd like to add that I've also noticed this regression after upgrading from
Debian 11/5.10 to Debian 12/6.1

I've copied the resume portion of the dmesg log below. Note that the pci bus
activity and related gpu messages don't happen until after my 10 year old 2TB
hdd spins up. Restarting of tasks is also deferred until then. I also note that
the mouse cursor is also frozen until all those platters spin up..

To my mind, it would be a major improvement/tour de force if linux could defer
spinning up the hdd at all until it is needed. Think of all the drive wear and
energy savings that would ensue. I use my 2TB hdd for backing up smaller drives
and for storing movies. It would be quite nice if the drive were to not spin up
at all until accessed.


[167498.654484] ACPI: PM: Waking up from system sleep state S3
[167498.655607] pcieport 0000:00:1c.0: Enabling MPC IRBNCE
[167498.655608] pcieport 0000:00:1c.7: Enabling MPC IRBNCE
[167498.655613] pcieport 0000:00:1c.0: Intel PCH root port ACS workaround
enabled
[167498.655614] pcieport 0000:00:1c.7: Intel PCH root port ACS workaround
enabled
[167498.657222] parport_pc 00:05: activated
[167498.657638] serial 00:07: activated
[167498.666692] sd 2:0:0:0: [sdb] Starting disk
[167498.666709] sd 3:0:0:0: [sdd] Starting disk
[167498.666720] sd 1:0:0:0: [sdc] Starting disk
[167498.666725] sd 0:0:0:0: [sda] Starting disk
[167498.751660] input input0: event field not found
[167499.041464] ata6: SATA link down (SStatus 0 SControl 300)
[167499.041525] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[167499.041556] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[167499.041581] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[167499.042450] ata2.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK)
filtered out
[167499.042455] ata2.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION
OVERLAY) filtered out
[167499.042479] ata1.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK)
filtered out
[167499.042484] ata1.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION
OVERLAY) filtered out
[167499.042605] ata1.00: supports DRM functions and may not be fully accessible
[167499.043225] ata2.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK)
filtered out
[167499.043233] ata2.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION
OVERLAY) filtered out
[167499.043464] ata2.00: configured for UDMA/133
[167499.053394] ata5.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK)
filtered out
[167499.053402] ata5.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION
OVERLAY) filtered out
[167499.074315] ata5.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK)
filtered out
[167499.074323] ata5.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION
OVERLAY) filtered out
[167499.079649] ata5.00: configured for UDMA/100
[167499.130641] ata1.00: NCQ Send/Recv Log not supported
[167499.132638] ata1.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK)
filtered out
[167499.132646] ata1.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION
OVERLAY) filtered out
[167499.132837] ata1.00: supports DRM functions and may not be fully accessible
[167499.132990] ata1.00: NCQ Send/Recv Log not supported
[167499.134317] ata1.00: configured for UDMA/133
[167499.247035] firewire_core 0000:03:01.0: rediscovered device fw0
[167499.247089] br0: port 1(eth0) entered disabled state
[167501.553034] e1000e 0000:00:19.0 eth0: NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: Rx/Tx
[167501.553075] br0: port 1(eth0) entered blocking state
[167501.553078] br0: port 1(eth0) entered forwarding state
[167504.079109] ata4: link is slow to respond, please be patient (ready=0)
[167504.079117] ata3: link is slow to respond, please be patient (ready=0)
[167504.799136] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[167504.917319] ata4.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK)
filtered out
[167504.917324] ata4.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION
OVERLAY) filtered out
[167504.918915] ata4.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK)
filtered out
[167504.918922] ata4.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION
OVERLAY) filtered out
[167504.919646] ata4.00: configured for UDMA/133
[167508.699207] ata3: COMRESET failed (errno=-16)
[167510.871278] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[167511.040204] ata3.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK)
filtered out
[167511.040212] ata3.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION
OVERLAY) filtered out
[167511.078770] ata3.00: ACPI cmd f5/00:00:00:00:00:00(SECURITY FREEZE LOCK)
filtered out
[167511.078778] ata3.00: ACPI cmd b1/c1:00:00:00:00:00(DEVICE CONFIGURATION
OVERLAY) filtered out
[167511.082920] ata3.00: configured for UDMA/133
[167511.107913] pci_bus 0000:03: Allocating resources
[167511.107932] pci 0000:02:00.0: bridge window [io  0x1000-0x0fff] to [bus 03]
add_size 1000
[167511.107939] pci 0000:02:00.0: bridge window [mem 0x00100000-0x000fffff
64bit pref] to [bus 03] add_size 200000 add_align 100000
[167511.107949] pci 0000:02:00.0: BAR 15: no space for [mem size 0x00200000
64bit pref]
[167511.107952] pci 0000:02:00.0: BAR 15: failed to assign [mem size 0x00200000
64bit pref]
[167511.107955] pci 0000:02:00.0: BAR 13: no space for [io  size 0x1000]
[167511.107958] pci 0000:02:00.0: BAR 13: failed to assign [io  size 0x1000]
[167511.107962] pci 0000:02:00.0: BAR 15: no space for [mem size 0x00200000
64bit pref]
[167511.107965] pci 0000:02:00.0: BAR 15: failed to assign [mem size 0x00200000
64bit pref]
[167511.107967] pci 0000:02:00.0: BAR 13: no space for [io  size 0x1000]
[167511.107970] pci 0000:02:00.0: BAR 13: failed to assign [io  size 0x1000]
[167511.108011] OOM killer enabled.
[167511.108014] Restarting tasks ... done.
[167511.111619] random: crng reseeded on system resumption
[167511.111760] PM: suspend exit

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (19 preceding siblings ...)
  2023-07-08 23:17 ` bugzilla-daemon
@ 2023-07-09  7:09 ` bugzilla-daemon
  2023-07-09 23:18 ` bugzilla-daemon
                   ` (11 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-09  7:09 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #45 from The Linux kernel's regression tracker (Thorsten Leemhuis) (regressions@leemhuis.info) ---
(In reply to Paul Ausbeck from comment #44)
> I'd like to add that I've also noticed this regression after upgrading from
> Debian 11/5.10 to Debian 12/6.1

Can't remember for sure (Bart and Damien might), but I think this issue should
have been fixed by this revert that went into 6.0:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=785538bfdd6

So if you see similar symptoms it's likely a different issue. You thus might
want to open a new report, as things otherwise get messy; it's also in your own
interest, see:
https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/
especially
https://linux-regtracking.leemhuis.info/post/frequent-reasons-why-linux-kernel-bug-reports-are-ignored/#you-reported-your-issue-in-a-reply-to-an-earlier-report

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (20 preceding siblings ...)
  2023-07-09  7:09 ` bugzilla-daemon
@ 2023-07-09 23:18 ` bugzilla-daemon
  2023-07-10  0:18 ` bugzilla-daemon
                   ` (10 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-09 23:18 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #46 from Damien Le Moal (damien.lemoal@wdc.com) ---
(In reply to Paul Ausbeck from comment #44)
> I'd like to add that I've also noticed this regression after upgrading from
> Debian 11/5.10 to Debian 12/6.1

> I've copied the resume portion of the dmesg log below. Note that the pci bus
> activity and related gpu messages don't happen until after my 10 year old
> 2TB hdd spins up. Restarting of tasks is also deferred until then. I also
> note that the mouse cursor is also frozen until all those platters spin up..

The GPU related messages are unrelated to ATA subsystem. Obviously, libata does
not attempt to use the GPU for anything :) Restarting of tasks being differed
until the hdd spins up is also normal. hdd spinning up and then starting
responding to commands is part of the normal resume process and tasks will be
resumed only after the kernel has finished resuming all devices. Resuming of
devices is asynchronous and devices that are unrelated can be resumed
simultaneously. So I am not sure why it seems that your GPU and mouse resuming
is delaying only after the HDD spin up. Note that this change to asynchronous
resume did cause some issues with ata/scsi (see below).

> To my mind, it would be a major improvement/tour de force if linux could
> defer spinning up the hdd at all until it is needed. Think of all the drive
> wear and energy savings that would ensue. I use my 2TB hdd for backing up
> smaller drives and for storing movies. It would be quite nice if the drive
> were to not spin up at all until accessed.

Having the drive sleep when unused can be achieved using power state timers
setting on the drive itself. Resume is different: the kernel must ensure that
the device that was there before suspend is still here, and in the case of
storage, that the device is still the same. This is necessary to avoid issues
with file system (that will resume later) and also with tasks that were doing
IOs when the system was suspended. If file systems or tasks were resumed
without the storage device ready and checked first, all sorts of bad things can
happen.

Note that commit 6aa0365a3c85 ("ata: libata-scsi: Avoid deadlock on rescan
after device resume") was added to kernel 6.4 (at rc7) to fix another issue
with resume: a hard hang (not a delay). This patch synchronizes scsi and ata
resume by having scsi resume wait for ata resume to complete, which means
waiting for the device re-scan to complete. And that in turn means that the HDD
must complete spin up first.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (21 preceding siblings ...)
  2023-07-09 23:18 ` bugzilla-daemon
@ 2023-07-10  0:18 ` bugzilla-daemon
  2023-07-10  0:47 ` bugzilla-daemon
                   ` (9 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-10  0:18 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #47 from Paul Ausbeck (paula@alumni.cse.ucsc.edu) ---
The PCI/GPU messages are only connected to the ATA messages in the sense that
the PCI/GPU activity is unexpectedly deferred until after ATA init is complete.
Since resuming of devices is supposedly asynchronous, one would have thought
that PCI/GPU init activity would be completed long before hdd spin up is
complete.

I realize that deferring drive spin up until needed would not be easy, that's
why I called such an idea a tour de force. In the past it may have been common
for an ata device to not resume reliably, but today even the 10 year old ata
devices on my Ivy Bridge machine resume just as reliably as they normally
operate, which is quite reliably. If it weren't for power outages, I'd have
years of continuous uptime. It's just a thought, but it may be time to revisit
how disks, especially spinning disks, are resumed. It seems to me that the
chance of a hdd failure during resume is not any greater than at any other
time. It should be theoretically possible to queue up resume commands and
execute them only when needed to service actual demand. The latency would have
to eventually be absorbed, but if the api's are designed and implemented
properly that would just happen when needed.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (22 preceding siblings ...)
  2023-07-10  0:18 ` bugzilla-daemon
@ 2023-07-10  0:47 ` bugzilla-daemon
  2023-07-10  1:02 ` bugzilla-daemon
                   ` (8 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-10  0:47 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #48 from Damien Le Moal (damien.lemoal@wdc.com) ---
(In reply to Paul Ausbeck from comment #47)
> The PCI/GPU messages are only connected to the ATA messages in the sense
> that the PCI/GPU activity is unexpectedly deferred until after ATA init is
> complete. Since resuming of devices is supposedly asynchronous, one would
> have thought that PCI/GPU init activity would be completed long before hdd
> spin up is complete.

Maybe try enabling PM debug messages ? That could tell us what is going on.

> I realize that deferring drive spin up until needed would not be easy,
> that's why I called such an idea a tour de force. In the past it may have
> been common for an ata device to not resume reliably, but today even the 10
> year old ata devices on my Ivy Bridge machine resume just as reliably as
> they normally operate, which is quite reliably. If it weren't for power
> outages, I'd have years of continuous uptime. It's just a thought, but it
> may be time to revisit how disks, especially spinning disks, are resumed. It
> seems to me that the chance of a hdd failure during resume is not any
> greater than at any other time. It should be theoretically possible to queue
> up resume commands and execute them only when needed to service actual
> demand. The latency would have to eventually be absorbed, but if the api's
> are designed and implemented properly that would just happen when needed.

Spinning up the disk on resume is needed as many drives require the disk to be
spun up to reply to commands, and that even applies to commands that do not
access the media (e.g. IDENTIFY, READ LOG, etc). The reason is that many drives
save their meta-data on a reserved area of the media (the spinning disk).

So even if the kernel were to explicitly defer spinning up the disk, the disk
itself would in many instances automatically spinup, and that would manifest
itself with a long delay to the reply to the initial IDENTIFY command issued
during the rescan initiated by resume. And delaying that rescan is also not
desired as that would force the PM code to report "OK, this device is resumed
and ready" while it is in fact not ready at all to receive commands. So the
"tour de force" is really not desired at all in practice.

You likely can avoid the hdd spinup on resume by soft-removing the device
before suspend. Then we could add a libata option to ignore some ports on
resume so that they are not probed. But then, reusing the drive after resume
will require a manual rescan for the disk to show up again. And all of this
also assumes that there is no file system mounted on the disk, of course.

Given that most laptop these days do not use HDDs anymore, I do not think all
of this is worth the effort.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (23 preceding siblings ...)
  2023-07-10  0:47 ` bugzilla-daemon
@ 2023-07-10  1:02 ` bugzilla-daemon
  2023-07-10  1:07 ` bugzilla-daemon
                   ` (7 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-10  1:02 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #49 from Paul Ausbeck (paula@alumni.cse.ucsc.edu) ---
My reflex memory says that when resuming Debian 11/5.10 the mouse cursor was
live immediately after the LCD monitor came on. Current testing says that when
resuming Debian 12/6.1 the mouse cursor is frozen until at least five seconds
after the monitor comes on.

In order to verify that my reflex memory was correct, I booted Debian 12 but
with the 5.10 kernel. Sure enough, upon resume the mouse cursor is immediately
alive once the LCD monitor comes on.

I've attached the end portion of the Debian 12/5.10 resume dmesg log below.
Note that the restarting of tasks and the PCI/GPU activity that happened at the
end of resume for the Debian 12/6.1 combination happens before ATA bring up
with the Debian 12/5.10 combination.

This is starting to look like a regression to me.

[   45.516511] OOM killer enabled.
[   45.516512] Restarting tasks ... 
[   45.516726] pci_bus 0000:03: Allocating resources
[   45.516743] pci 0000:02:00.0: bridge window [io  0x1000-0x0fff] to [bus 03]
add_size 1000
[   45.516748] pci 0000:02:00.0: bridge window [mem 0x00100000-0x000fffff 64bit
pref] to [bus 03] add_size 200000 add_align 100000
[   45.516756] pci 0000:02:00.0: BAR 15: no space for [mem size 0x00200000
64bit pref]
[   45.516759] pci 0000:02:00.0: BAR 15: failed to assign [mem size 0x00200000
64bit pref]
[   45.516762] pci 0000:02:00.0: BAR 13: no space for [io  size 0x1000]
[   45.516764] pci 0000:02:00.0: BAR 13: failed to assign [io  size 0x1000]
[   45.516768] pci 0000:02:00.0: BAR 15: no space for [mem size 0x00200000
64bit pref]
[   45.516769] pci 0000:02:00.0: BAR 15: failed to assign [mem size 0x00200000
64bit pref]
[   45.516771] pci 0000:02:00.0: BAR 13: no space for [io  size 0x1000]
[   45.516772] pci 0000:02:00.0: BAR 13: failed to assign [io  size 0x1000]
[   45.518520] done.
[   45.518693] video LNXVIDEO:00: Restoring backlight state
[   45.518696] PM: suspend exit
[   45.641461] br0: port 1(eth0) entered disabled state
[   45.734342] ata6: SATA link down (SStatus 0 SControl 300)
[   45.734372] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   45.734405] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[   45.734435] ata5: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
[   45.735377] ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[   45.735382] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK)
filtered out
[   45.735384] ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION
OVERLAY) filtered out
[   45.735433] ata2.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[   45.735437] ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK)
filtered out
[   45.735440] ata2.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION
OVERLAY) filtered out
[   45.735499] ata1.00: supports DRM functions and may not be fully accessible
[   45.736244] ata2.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[   45.736249] ata2.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK)
filtered out
[   45.736251] ata2.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION
OVERLAY) filtered out
[   45.736531] ata2.00: configured for UDMA/133
[   45.746293] ata5.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[   45.746298] ata5.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK)
filtered out
[   45.746301] ata5.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION
OVERLAY) filtered out
[   45.767136] ata5.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[   45.767141] ata5.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK)
filtered out
[   45.767144] ata5.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION
OVERLAY) filtered out
[   45.772405] ata5.00: configured for UDMA/100
[   45.819126] ata1.00: NCQ Send/Recv Log not supported
[   45.820557] ata1.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[   45.820571] ata1.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK)
filtered out
[   45.820573] ata1.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION
OVERLAY) filtered out
[   45.820687] ata1.00: supports DRM functions and may not be fully accessible
[   45.820786] ata1.00: NCQ Send/Recv Log not supported
[   45.821701] ata1.00: configured for UDMA/133
[   45.947955] firewire_core 0000:03:01.0: rediscovered device fw0
[   48.373868] e1000e 0000:00:19.0 eth0: NIC Link is Up 1000 Mbps Full Duplex,
Flow Control: Rx/Tx
[   48.373927] br0: port 1(eth0) entered blocking state

[   48.373930] br0: port 1(eth0) entered forwarding state
[   50.775921] ata3: link is slow to respond, please be patient (ready=0)
[   50.783922] ata4: link is slow to respond, please be patient (ready=0)
[   51.143938] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   51.262481] ata4.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[   51.262485] ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK)
filtered out
[   51.262488] ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION
OVERLAY) filtered out
[   51.264139] ata4.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[   51.264144] ata4.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK)
filtered out
[   51.264147] ata4.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION
OVERLAY) filtered out
[   51.264802] ata4.00: configured for UDMA/133
[   55.403919] ata3: COMRESET failed (errno=-16)
[   56.799938] ata3: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[   56.981696] ata3.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[   56.981702] ata3.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK)
filtered out
[   56.981705] ata3.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION
OVERLAY) filtered out
[   57.008271] ata3.00: ACPI cmd ef/10:06:00:00:00:00 (SET FEATURES) succeeded
[   57.008277] ata3.00: ACPI cmd f5/00:00:00:00:00:00 (SECURITY FREEZE LOCK)
filtered out
[   57.008280] ata3.00: ACPI cmd b1/c1:00:00:00:00:00 (DEVICE CONFIGURATION
OVERLAY) filtered out
[   57.012179] ata3.00: configured for UDMA/133

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (24 preceding siblings ...)
  2023-07-10  1:02 ` bugzilla-daemon
@ 2023-07-10  1:07 ` bugzilla-daemon
  2023-07-10  2:51 ` bugzilla-daemon
                   ` (6 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-10  1:07 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #50 from Damien Le Moal (damien.lemoal@wdc.com) ---
Hmmm... PCI device scan is supposed to be asynchronous. So having to wait for
the hdd to spinup should not impact the GPU/mouse (which I assume is USB ?).
The messages "pci 0000:02:00.0: BAR 15: failed to assign" are weird... What is
the device on port 0000:02:00.0 ? (see lspci).

Might want to raise this issue on linux-pci list.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (25 preceding siblings ...)
  2023-07-10  1:07 ` bugzilla-daemon
@ 2023-07-10  2:51 ` bugzilla-daemon
  2023-07-10  3:48 ` bugzilla-daemon
                   ` (5 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-10  2:51 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #51 from Paul Ausbeck (paula@alumni.cse.ucsc.edu) ---
The PCIe BAR (Base address register) is usually GPU related as lspci verifies.
BAR mapping failures are pretty routine on my older machines. My assessment is
that they aren't relevant except to show that at least some GPU activity is
deferred until after ATA bring up.

I think more important is that Restarting tasks ...  and even PM: suspend exit
are happening before ATA bring up on 5.10 but after ATA bring up on 6.1.

This seems like a pretty serious change to me.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (26 preceding siblings ...)
  2023-07-10  2:51 ` bugzilla-daemon
@ 2023-07-10  3:48 ` bugzilla-daemon
  2023-07-10 17:00 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-10  3:48 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #52 from Damien Le Moal (damien.lemoal@wdc.com) ---
(In reply to Paul Ausbeck from comment #51)
> The PCIe BAR (Base address register) is usually GPU related as lspci
> verifies. BAR mapping failures are pretty routine on my older machines. My
> assessment is that they aren't relevant except to show that at least some
> GPU activity is deferred until after ATA bring up.
> 
> I think more important is that Restarting tasks ...  and even PM: suspend
> exit are happening before ATA bring up on 5.10 but after ATA bring up on 6.1.
> 
> This seems like a pretty serious change to me.

The "serious change" is a bug fix. "PM: suspend exit" sets the system out of
suspend mode but before commit 6aa0365a3c85, libata was "lying" about the state
of its devices. They were not really resumed and that was causing issues: some
users reported inability to use storage because of a deadlock between ata
resume and scsi resume. So we are not going back on that. This commit fixes a
serious issue and makes libata compliant with PM semantic.

I can try to revisit that fix to see if I can shorten the time to signaling a
"resumed" state for ata devices, but I doubt much is possible.

In any case, for now, having resume wait for the hdd spinup is safer than the
alternative. Your other issue about the GPU and mouse is I think different and
not related to libata resume. Not sure exactly what is happening. As I said, I
think it is worth signaling this to the pci, drm and or pm dev lists.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (27 preceding siblings ...)
  2023-07-10  3:48 ` bugzilla-daemon
@ 2023-07-10 17:00 ` bugzilla-daemon
  2023-07-10 22:48 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-10 17:00 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #53 from Paul Ausbeck (paula@alumni.cse.ucsc.edu) ---
I sense that we are in disagreement. In my way of thinking disagreement is
overcome through logical argument. So here goes.

I argue that you can't call a bug fixed if the fix introduces a serious,
entirely separate, regression. I further argue that at resume the mouse pointer
should be interactive at the same time that the display becomes active. I
further argue that if the mouse pointer is not interactive at any time that the
display is active, that is a serious UX bug. I further argue that before the
recent kernel ATA changes this particular bug did not exist and therefore it is
a regression. I further argue that though the regression is UX related it
cannot be patched over outside of the kernel. I further argue that this
regression likely affects any personal machine that contains a spinning disk
and that this class of machines is substantial in size. I further argue that
your reported deadlock between ata resume and scsi resume likely affects a far
smaller class of machines. I further argue that though for a single machine
deadlock is a more important problem than temporary lack of mouse pointer
interactivity, for the kernel as a whole, problem importance is proportional to
the multiplication of the individual seriousness and the size of the set of
affected machines. I further argue that your assertion that the non-interactive
mouse pointer is not related to the libata resume changes is incorrect. I
further argue that the two dmesg fragments that I have posted to this thread
explain the situation quite clearly.

My argument contains 10 assertions. If you disagree with any of them, please
give details. In particular I don't have a clear idea of the size of the set of
machines that might experience the "fixed" ata/scsi resume deadlock.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (28 preceding siblings ...)
  2023-07-10 17:00 ` bugzilla-daemon
@ 2023-07-10 22:48 ` bugzilla-daemon
  2023-07-11 21:39 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-10 22:48 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #54 from Damien Le Moal (damien.lemoal@wdc.com) ---
(In reply to Paul Ausbeck from comment #53)
> I sense that we are in disagreement. In my way of thinking disagreement is
> overcome through logical argument. So here goes.
> 
> I argue that you can't call a bug fixed if the fix introduces a serious,
> entirely separate, regression. I further argue that at resume the mouse
> pointer should be interactive at the same time that the display becomes
> active. I further argue that if the mouse pointer is not interactive at any
> time that the display is active, that is a serious UX bug. I further argue
> that before the recent kernel ATA changes this particular bug did not exist
> and therefore it is a regression. I further argue that though the regression
> is UX related it cannot be patched over outside of the kernel. I further
> argue that this regression likely affects any personal machine that contains
> a spinning disk and that this class of machines is substantial in size. I
> further argue that your reported deadlock between ata resume and scsi resume
> likely affects a far smaller class of machines. I further argue that though
> for a single machine deadlock is a more important problem than temporary
> lack of mouse pointer interactivity, for the kernel as a whole, problem
> importance is proportional to the multiplication of the individual
> seriousness and the size of the set of affected machines. I further argue
> that your assertion that the non-interactive mouse pointer is not related to
> the libata resume changes is incorrect. I further argue that the two dmesg
> fragments that I have posted to this thread explain the situation quite
> clearly.
> 
> My argument contains 10 assertions. If you disagree with any of them, please
> give details. In particular I don't have a clear idea of the size of the set
> of machines that might experience the "fixed" ata/scsi resume deadlock.

Repeating the same things again and again without new technical information we
(developers) can act on is not very productive. So let's dig into this issue to
verify *if* this is really an issue introduced by libata, which as I said, I
doubt it is. But let's make sure. So could you please try the following:
1) Build the latest 6.5-rc1 kernel from git and try with that kernel to see if
the issue is still there.
2) If the issue is not resolved, with that same kernel, please try to revert
commit 6aa0365a3c85.
3) Regardless of the result for (2) please do a bisect to see if the commit
responsible for the issue is indeed 6aa0365a3c85. You can find information on
how to do that in the Linux documentation at
Documentation/admin-guide/bug-bisect.rst. You may also want to read 
Documentation/admin-guide/quickly-build-trimmed-linux.rst
4) Please provide the output of lspci and a full dmesg for your system. dmesg
output after boot and after resume would be helpful.

With the above information, we will likely have a better idea of what is going
on with your machine.

Also please know that the number of machines affected by an issue only
determines the priority/urgency of a fix. Even if a single user is affected,
issues will still be addressed as long as enough information is provided and
the person affected helps with testing (as I cannot recreate that issue with
the hardware I have).

Note that I am on vacation this week, so I am only checking emails and replying
as a courtesy. I will not do anything until Tuesday next week.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (29 preceding siblings ...)
  2023-07-10 22:48 ` bugzilla-daemon
@ 2023-07-11 21:39 ` bugzilla-daemon
  2023-07-11 22:55 ` bugzilla-daemon
  2023-07-20 21:52 ` bugzilla-daemon
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-11 21:39 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #55 from Paul Ausbeck (paula@alumni.cse.ucsc.edu) ---
lspci output is relatively small and easy:

00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core processor
DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation IvyBridge GT2 [HD Graphics
4000] (rev 09)
00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset Family
USB xHCI Host Controller (rev 04)
00:16.0 Communication controller: Intel Corporation 7 Series/C216 Chipset
Family MEI Controller #1 (rev 04)
00:19.0 Ethernet controller: Intel Corporation 82579V Gigabit Network
Connection (rev 04)
00:1a.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB
Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 7 Series/C216 Chipset Family High
Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI Express
Root Port 1 (rev c4)
00:1c.7 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family PCI
Express Root Port 8 (rev c4)
00:1d.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB
Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation H77 Express Chipset LPC Controller (rev
04)
00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset Family
6-port SATA Controller [AHCI mode] (rev 04)
00:1f.3 SMBus: Intel Corporation 7 Series/C216 Chipset Family SMBus Controller
(rev 04)
01:00.0 Multimedia video controller: Conexant Systems, Inc. CX23887/8 PCIe
Broadcast Audio and Video Decoder with 3D Comb (rev 0f)
02:00.0 PCI bridge: Integrated Technology Express, Inc. IT8892E PCIe to PCI
Bridge (rev 30)
03:01.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000
Controller (PHY/Link)

I'll wait a bit for full dmesg output and/or bisecting. When I initially read
the bug report, ericspero seemed to do a pretty good job of bisecting the
change.

It seems that we still don't agree that libata is responsible for the newly
introduced lack of mouse pointer interactivity following resume. Therefore, I
will spend some more time characterizing the problem. I have a relatively new
Chuwi laptop with a spare m.2/sata slot. I've ordered an m.2/sata to sata 7 pin
adapter so that I can plug a sata hard drive into this machine. As far as I can
find, this type of adapter is only available directly from China so it will
take a couple of weeks to get them. I'll post the results then. If I can muster
to two machines as orthogonal as I can make them both exhibiting the same
problem, perhaps we might come together. Or perhaps the Chuwi machine won't
exhibit the problem and I'll have to rethink.

One other observation. It seems to me that libata maintainers should have a
machine containing a spinning disk at their disposal. I have a bunch of spare
machines with hard drives and would be amenable to donating one for this
purpose. Also, in case you are interested, the previously described m.2/sata to
7pin sata adapter is available on eBay at:

https://www.ebay.com/itm/394315621714?hash=item5bcf0ae552:g:YGYAAOSwAjRjYcUd&amdata=enc%3AAQAIAAAA8BpM7jdIQ4JS2IAPozAs7uLQETGPFQcybkuhO3nBfjImXm%2BMSA5oCG2CcetiX4fOY579FQd0JR8Pj81mS%2FW8k9MyCtUCg1hGuVhX4aMzkBQrzfXXjZcz%2F%2BldcLdwbLHcyTpN%2BjrSKmKiyGprAr1aZEH27ur%2FTJReKBBziJpJAz8qlG9j6GCGEYsA5R2kLpW0bSX4s8it5oZ1xTHDNhiidRnGvgTi16IvgMCZiLuGnOgTeSJuVA4kQdpsli2lYjmMwuU4i3USWppY5mh54l0AGV3Nf%2Ff4%2FCwhCoEogLghu62kFYpd7h3fofI8mCPTFLjh%2FA%3D%3D%7Ctkp%3ABk9SR7LBpuuoYg

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (30 preceding siblings ...)
  2023-07-11 21:39 ` bugzilla-daemon
@ 2023-07-11 22:55 ` bugzilla-daemon
  2023-07-20 21:52 ` bugzilla-daemon
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-11 22:55 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #56 from Damien Le Moal (damien.lemoal@wdc.com) ---
(In reply to Paul Ausbeck from comment #55)
> lspci output is relatively small and easy:
> 
> 00:00.0 Host bridge: Intel Corporation Xeon E3-1200 v2/3rd Gen Core
> processor DRAM Controller (rev 09)
> 00:02.0 VGA compatible controller: Intel Corporation IvyBridge GT2 [HD
> Graphics 4000] (rev 09)
> 00:14.0 USB controller: Intel Corporation 7 Series/C210 Series Chipset
> Family USB xHCI Host Controller (rev 04)
> 00:16.0 Communication controller: Intel Corporation 7 Series/C216 Chipset
> Family MEI Controller #1 (rev 04)
> 00:19.0 Ethernet controller: Intel Corporation 82579V Gigabit Network
> Connection (rev 04)
> 00:1a.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB
> Enhanced Host Controller #2 (rev 04)
> 00:1b.0 Audio device: Intel Corporation 7 Series/C216 Chipset Family High
> Definition Audio Controller (rev 04)
> 00:1c.0 PCI bridge: Intel Corporation 7 Series/C216 Chipset Family PCI
> Express Root Port 1 (rev c4)
> 00:1c.7 PCI bridge: Intel Corporation 7 Series/C210 Series Chipset Family
> PCI Express Root Port 8 (rev c4)
> 00:1d.0 USB controller: Intel Corporation 7 Series/C216 Chipset Family USB
> Enhanced Host Controller #1 (rev 04)
> 00:1f.0 ISA bridge: Intel Corporation H77 Express Chipset LPC Controller
> (rev 04)
> 00:1f.2 SATA controller: Intel Corporation 7 Series/C210 Series Chipset
> Family 6-port SATA Controller [AHCI mode] (rev 04)

I think I have some servers that have the same chipset. But being servers,
suspend/resume is not well supported. Can try suspend to disk at least, but not
suspend to RAM.

> 00:1f.3 SMBus: Intel Corporation 7 Series/C216 Chipset Family SMBus
> Controller (rev 04)
> 01:00.0 Multimedia video controller: Conexant Systems, Inc. CX23887/8 PCIe
> Broadcast Audio and Video Decoder with 3D Comb (rev 0f)
> 02:00.0 PCI bridge: Integrated Technology Express, Inc. IT8892E PCIe to PCI
> Bridge (rev 30)
> 03:01.0 FireWire (IEEE 1394): Texas Instruments TSB43AB23 IEEE-1394a-2000
> Controller (PHY/Link)
> 
> I'll wait a bit for full dmesg output and/or bisecting. When I initially
> read the bug report, ericspero seemed to do a pretty good job of bisecting
> the change.

You could try with the revert of 6aa0365a3c85. What I think you are seeing is
the X client start (Wayland ?) being delayed by the fact that with
6aa0365a3c85, resume waits for the HDD to be revalidated. Hence the delay for X
& mouse to start working.

> It seems that we still don't agree that libata is responsible for the newly
> introduced lack of mouse pointer interactivity following resume. Therefore,

See above. It could be. But permanently reverting 6aa0365a3c85 is not a
solution as that commit fixes a real problem that some users reported and that
you could also hit yourself by reverting it. As I said, if you confirm that
this commit is indeed the cause of the delay, I can try to see how to reduce
the delay, if possible. The main issue is that the resume code is a bit of a
mess between libata and scsi, and libata resume is mostly done using EH (error
handler) context, which when running prevents using the drive for anything. A
rework of the resume path may be needed but that is extremely dangerous as that
could introduce lots of regressions (because libata is in a sense the
accumulation of decades of "magic" code to deal with buggy adapters and drives,
and there are a lot out these out there). So extreme caution is needed when
touching such code.

> I will spend some more time characterizing the problem. I have a relatively
> new Chuwi laptop with a spare m.2/sata slot. I've ordered an m.2/sata to
> sata 7 pin adapter so that I can plug a sata hard drive into this machine.
> As far as I can find, this type of adapter is only available directly from
> China so it will take a couple of weeks to get them. I'll post the results
> then. If I can muster to two machines as orthogonal as I can make them both
> exhibiting the same problem, perhaps we might come together. Or perhaps the
> Chuwi machine won't exhibit the problem and I'll have to rethink.
> 
> One other observation. It seems to me that libata maintainers should have a
> machine containing a spinning disk at their disposal. I have a bunch of
> spare machines with hard drives and would be amenable to donating one for
> this purpose. Also, in case you are interested, the previously described
> m.2/sata to 7pin sata adapter is available on eBay at:

I am the libata maintainer, and if you check my email address, you will see
that it is not hard for me to get SATA HDDs and SSDs. I have 3 racks of machine
in my lab with literally hundreds of HDDs and SSDs usable, and also some PCs
with CD/DVD, old parallel ATA drives, port multiplier enclosures, etc. Many
favors of hardware.

But that does not mean that I can reproduce all issues notified by users. I
could never reproduce the issue that commit 6aa0365a3c85 fixes so my tests only
confirmed that I was not seeing any regression (and I did not see a bad delay
with anything as I do not run any GUI on my machines).

Rather than hardware, which I have plenty, I would be happier with people
spending time testing for-next and RC trees to verify that patches do not have
unintended side effects. libata is not the sexiest of subsystems and it is hard
to get people to review and test patches.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16
       [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
                   ` (31 preceding siblings ...)
  2023-07-11 22:55 ` bugzilla-daemon
@ 2023-07-20 21:52 ` bugzilla-daemon
  32 siblings, 0 replies; 34+ messages in thread
From: bugzilla-daemon @ 2023-07-20 21:52 UTC (permalink / raw)
  To: linux-scsi

https://bugzilla.kernel.org/show_bug.cgi?id=215880

--- Comment #57 from Paul Ausbeck (paula@alumni.cse.ucsc.edu) ---
The m.2/SATA to 7pin SATA adapter that I mentioned previously as on order,
arrived. Using this adapter I connected a spare Barracuda 4TB drive to one of
the m.2 slots on my Chuwi Gemibook Pro laptop. With this setup, the lack of
mouse pointer interactivity on resume exhibits exactly the same as on my Intel
dh77kc desktop.

The mouse pointer is not interactive until after "PM: suspend exit" is written
to the kernel log. Additionally, an active GUI terminal is not interactive
until the same time. This is noticeable by typing "sudo pm-suspend" in a
terminal. The resulting CR/LF and next command prompt do not appear following
resume until the same time that the mouse cursor comes alive. The Barracuda
drive connected to the Gemibook isn't as slow to spin up as the relatively old
WD 2TB in my desktop so the delay to interactivity is only about 5 seconds
instead of 10+.

The interactivity delay is present on both machines when using kernel 6.1x and
not present on both machines when using kernel 5.10x. When using kernel 5.10x,
the dmesg log on both machines shows "PM suspend exit" happens before the
slower ATA drives are "configured for UDMA/133". When using kernel 6.1x, the
dmesg log shows "PM suspend exit" occurring after the slower ATA dries are
"configured for UDMA/133".

At this point, I am convinced that this is a regression related to changes in
libata. I will take this offline now to see what I can do to help recover the
previous resume interactivity behavior.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-07-20 21:52 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <bug-215880-11613@https.bugzilla.kernel.org/>
2022-06-27  4:09 ` [Bug 215880] Resume process hangs for 5-6 seconds starting sometime in 5.16 bugzilla-daemon
2022-06-28 21:17 ` bugzilla-daemon
2022-06-28 22:30 ` bugzilla-daemon
2022-07-15 18:44 ` bugzilla-daemon
2022-08-15 11:02 ` bugzilla-daemon
2022-08-15 13:36 ` bugzilla-daemon
2022-08-16 11:08 ` bugzilla-daemon
2022-08-16 15:44 ` bugzilla-daemon
2022-08-16 16:10 ` bugzilla-daemon
2022-08-16 16:14 ` bugzilla-daemon
2022-08-16 17:11 ` bugzilla-daemon
2022-08-25 20:01 ` bugzilla-daemon
2022-08-25 20:31 ` bugzilla-daemon
2022-08-25 22:15 ` bugzilla-daemon
2022-08-26  7:00   ` Damien Le Moal
2022-08-26  7:00 ` bugzilla-daemon
2022-10-06 15:37 ` bugzilla-daemon
2022-10-06 23:48 ` bugzilla-daemon
2022-10-07  0:10 ` bugzilla-daemon
2022-10-07  0:15 ` bugzilla-daemon
2023-07-08 23:17 ` bugzilla-daemon
2023-07-09  7:09 ` bugzilla-daemon
2023-07-09 23:18 ` bugzilla-daemon
2023-07-10  0:18 ` bugzilla-daemon
2023-07-10  0:47 ` bugzilla-daemon
2023-07-10  1:02 ` bugzilla-daemon
2023-07-10  1:07 ` bugzilla-daemon
2023-07-10  2:51 ` bugzilla-daemon
2023-07-10  3:48 ` bugzilla-daemon
2023-07-10 17:00 ` bugzilla-daemon
2023-07-10 22:48 ` bugzilla-daemon
2023-07-11 21:39 ` bugzilla-daemon
2023-07-11 22:55 ` bugzilla-daemon
2023-07-20 21:52 ` bugzilla-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).