[PATCH] block: fix NPE when resuming SCSI devices using blk-mq

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
@ 2018-07-13 13:29 Patrick Steinhardt
  2018-07-13 13:41 ` Ming Lei
  2018-07-25 18:13   ` Bart Van Assche
  0 siblings, 2 replies; 15+ messages in thread
From: Patrick Steinhardt @ 2018-07-13 13:29 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Patrick Steinhardt, linux-block, linux-kernel

When power management for SCSI is enabled and if a device uses blk-mq,
it is possible to trigger a `NULL` pointer exception when resuming that
device. The NPE is triggered when trying to dereference the `request_fn`
function pointer of the device's `request_queue`:

    __blk_run_queue_uncond:470
    __blk_run_queue:490
    blk_post_runtime_resume:3889
    sdev_runtime_resume:263
    scsi_runtime_resume:275

When the SCSI device is being allocated by `scsi_alloc_sdev`, the
device's request queue will either be initialized via
`scsi_mq_alloc_queue` or `scsi_old_alloc_queue`. But the `request_fn`
member of the request queue is in fact only being set in
`scsi_old_alloc_queue`, which will then later cause the mentioned NPE.

Fix the issue by checking whether the `request_fn` is set in
`__blk_run_queue_uncond`. In case it is unset, we'll silently return and
not try to invoke the callback, thus fixing the NPE.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
---

Since at least v4.14, I am easily able to trigger above NPE by
unplugging USB mass storage devices on my computer (Skylake, ASUS
Z170I) with CONFIG_SCSI_MQ_DEFAULT=y. The attached patch fixes
the issue, but keep in mind that this is my first patch, so the
proposed fix may not be appropriate at all. Feedback would be
highly appreciated.

 block/blk-core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index f84a9b7b6f5a..0a2041660cd9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -456,7 +456,7 @@ inline void __blk_run_queue_uncond(struct request_queue *q)
 	lockdep_assert_held(q->queue_lock);
 	WARN_ON_ONCE(q->mq_ops);

-	if (unlikely(blk_queue_dead(q)))
+	if (unlikely(!q->request_fn) || unlikely(blk_queue_dead(q)))
 		return;

 	/*
-- 
2.18.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
  2018-07-13 13:29 [PATCH] block: fix NPE when resuming SCSI devices using blk-mq Patrick Steinhardt
@ 2018-07-13 13:41 ` Ming Lei
  2018-07-16 15:11   ` Patrick Steinhardt
  2018-07-29  9:41   ` Tomas Janousek
  2018-07-25 18:13   ` Bart Van Assche
  1 sibling, 2 replies; 15+ messages in thread
From: Ming Lei @ 2018-07-13 13:41 UTC (permalink / raw)
  To: Patrick Steinhardt; +Cc: Jens Axboe, linux-block, Linux Kernel Mailing List

On Fri, Jul 13, 2018 at 9:29 PM, Patrick Steinhardt <ps@pks.im> wrote:
> When power management for SCSI is enabled and if a device uses blk-mq,
> it is possible to trigger a `NULL` pointer exception when resuming that
> device. The NPE is triggered when trying to dereference the `request_fn`
> function pointer of the device's `request_queue`:
>
>     __blk_run_queue_uncond:470
>     __blk_run_queue:490
>     blk_post_runtime_resume:3889
>     sdev_runtime_resume:263
>     scsi_runtime_resume:275
>
> When the SCSI device is being allocated by `scsi_alloc_sdev`, the
> device's request queue will either be initialized via
> `scsi_mq_alloc_queue` or `scsi_old_alloc_queue`. But the `request_fn`
> member of the request queue is in fact only being set in
> `scsi_old_alloc_queue`, which will then later cause the mentioned NPE.
>
> Fix the issue by checking whether the `request_fn` is set in
> `__blk_run_queue_uncond`. In case it is unset, we'll silently return and
> not try to invoke the callback, thus fixing the NPE.
>
> Signed-off-by: Patrick Steinhardt <ps@pks.im>
> ---
>
> Since at least v4.14, I am easily able to trigger above NPE by
> unplugging USB mass storage devices on my computer (Skylake, ASUS
> Z170I) with CONFIG_SCSI_MQ_DEFAULT=y. The attached patch fixes
> the issue, but keep in mind that this is my first patch, so the
> proposed fix may not be appropriate at all. Feedback would be
> highly appreciated.
>
>  block/blk-core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index f84a9b7b6f5a..0a2041660cd9 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -456,7 +456,7 @@ inline void __blk_run_queue_uncond(struct request_queue *q)
>         lockdep_assert_held(q->queue_lock);
>         WARN_ON_ONCE(q->mq_ops);
>
> -       if (unlikely(blk_queue_dead(q)))
> +       if (unlikely(!q->request_fn) || unlikely(blk_queue_dead(q)))
>                 return;
>

Now runtime PM is disabled for blk-mq/scsi_mq, not sure how this issue is
triggered on your machine.

Could you share the steps for reproducing this issue?

Thanks,
Ming Lei

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
  2018-07-13 13:41 ` Ming Lei
@ 2018-07-16 15:11   ` Patrick Steinhardt
  2018-07-29  9:41   ` Tomas Janousek
  1 sibling, 0 replies; 15+ messages in thread
From: Patrick Steinhardt @ 2018-07-16 15:11 UTC (permalink / raw)
  To: Ming Lei; +Cc: Jens Axboe, linux-block, Linux Kernel Mailing List


[-- Attachment #1.1: Type: text/plain, Size: 3114 bytes --]

On Fri, Jul 13, 2018 at 09:41:41PM +0800, Ming Lei wrote:
> On Fri, Jul 13, 2018 at 9:29 PM, Patrick Steinhardt <ps@pks.im> wrote:
> > When power management for SCSI is enabled and if a device uses blk-mq,
> > it is possible to trigger a `NULL` pointer exception when resuming that
> > device. The NPE is triggered when trying to dereference the `request_fn`
> > function pointer of the device's `request_queue`:
> >
> >     __blk_run_queue_uncond:470
> >     __blk_run_queue:490
> >     blk_post_runtime_resume:3889
> >     sdev_runtime_resume:263
> >     scsi_runtime_resume:275
> >
> > When the SCSI device is being allocated by `scsi_alloc_sdev`, the
> > device's request queue will either be initialized via
> > `scsi_mq_alloc_queue` or `scsi_old_alloc_queue`. But the `request_fn`
> > member of the request queue is in fact only being set in
> > `scsi_old_alloc_queue`, which will then later cause the mentioned NPE.
> >
> > Fix the issue by checking whether the `request_fn` is set in
> > `__blk_run_queue_uncond`. In case it is unset, we'll silently return and
> > not try to invoke the callback, thus fixing the NPE.
> >
> > Signed-off-by: Patrick Steinhardt <ps@pks.im>
> > ---
> >
> > Since at least v4.14, I am easily able to trigger above NPE by
> > unplugging USB mass storage devices on my computer (Skylake, ASUS
> > Z170I) with CONFIG_SCSI_MQ_DEFAULT=y. The attached patch fixes
> > the issue, but keep in mind that this is my first patch, so the
> > proposed fix may not be appropriate at all. Feedback would be
> > highly appreciated.
> >
> >  block/blk-core.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index f84a9b7b6f5a..0a2041660cd9 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -456,7 +456,7 @@ inline void __blk_run_queue_uncond(struct request_queue *q)
> >         lockdep_assert_held(q->queue_lock);
> >         WARN_ON_ONCE(q->mq_ops);
> >
> > -       if (unlikely(blk_queue_dead(q)))
> > +       if (unlikely(!q->request_fn) || unlikely(blk_queue_dead(q)))
> >                 return;
> >
> 
> Now runtime PM is disabled for blk-mq/scsi_mq, not sure how this issue is
> triggered on your machine.
> 
> Could you share the steps for reproducing this issue?

I bet that the issue stems from custom hotplugging scripts then,
which change the value of power/control. See the attachment of
this mail for all sysfs changes that are being performed after
plugging in the USB stick. Basically, the reproduction steps on
my machine are:

1. plug in USB stick (assumed to be /dev/sdb now)
2. wait a short amount of time
3. dd if=/dev/sdb of=/dev/null bs=4M
4. wait a short amount of time
5. unplug the USB stick, which immediately crashes the system

The issue isn't always reproducible, I think there is some
variance depending on how much time passes by at step 2 and/or 4.
It probably is related to the autosuspend delay. From my
experience the crash becomes more likely the longer I wait after
step 4.

Regards
Patrick

[-- Attachment #1.2: sysfs.log --]
[-- Type: text/plain, Size: 2655 bytes --]

/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/power/autosuspend_delay_ms -> 60000
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/power/control -> auto
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/power/control -> on
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/power/autosuspend_delay_ms -> 60000
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/power/control -> auto
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/scsi_host/host4/power/autosuspend_delay_ms -> 60000
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/scsi_host/host4/power/control -> auto
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/power/autosuspend_delay_ms -> 60000
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/scsi_device/4:0:0:0/power/autosuspend_delay_ms -> 60000
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/bsg/4:0:0:0/power/autosuspend_delay_ms -> 60000
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/power/control -> auto
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/scsi_device/4:0:0:0/power/control -> auto
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/bsg/4:0:0:0/power/control -> auto
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/power/autosuspend_delay_ms -> 60000
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/power/control -> auto
/sys//devices/virtual/bdi/8:16/power/autosuspend_delay_ms -> 60000
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/scsi_disk/4:0:0:0/power/autosuspend_delay_ms -> 60000
/sys//devices/virtual/bdi/8:16/power/control -> auto
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/scsi_disk/4:0:0:0/power/control -> auto
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/block/sdb/power/autosuspend_delay_ms -> 60000
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/block/sdb/sdb2/power/autosuspend_delay_ms -> 60000
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/block/sdb/power/control -> auto
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/block/sdb/sdb2/power/control -> auto
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/block/sdb/sdb1/power/autosuspend_delay_ms -> 60000
/sys//devices/pci0000:00/0000:00:14.0/usb2/2-2/2-2:1.0/host4/target4:0:0/4:0:0:0/block/sdb/sdb1/power/control -> auto

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
  2018-07-13 13:29 [PATCH] block: fix NPE when resuming SCSI devices using blk-mq Patrick Steinhardt
@ 2018-07-25 18:13   ` Bart Van Assche
  2018-07-25 18:13   ` Bart Van Assche
  1 sibling, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2018-07-25 18:13 UTC (permalink / raw)
  To: ps, axboe; +Cc: linux-kernel, linux-block, ming.lei

On Fri, 2018-07-13 at 15:29 +0200, Patrick Steinhardt wrote:
> When power management for SCSI is enabled and if a device uses blk-mq=
,
> it is possible to trigger a `NULL` pointer exception when res=
uming that
> device. The NPE is triggered when trying to dereference the `requ=
est_fn`
> function pointer of the device's `request_queue`:
>=20
>     __blk_run_queue_uncond:470
>     __blk_run_queue:490
>     blk_post_runtime_resume:3889
>     sdev_runtime_resume:263
>     scsi_runtime_resume:275
>=20
> When the SCSI device is being allocated by `scsi_alloc_sd=
ev`, the
> device's request queue will either be initialized via
> `scsi_mq_alloc_queue` or `scsi_old_al=
loc_queue`. But the `request_fn`
> member of the request queue is in fact only being set in
> `scsi_old_alloc_queue`, which will then later cau=
se the mentioned NPE.
>=20
> Fix the issue by checking whether the `request_fn` is set=
 in
> `__blk_run_queue_uncond`. In case it is unset,=
 we'll silently return and
> not try to invoke the callback, thus fixing the NPE.

Which kernel version are you using? Can you check whether the following two
commits are in your kernel tree?

* 4fd41a8552af ("SCSI: Fix NULL pointer dereference in runtime PM=
CIAOw- December
  2015).
* 765e40b675a9 ("block: disable runtime-pm for blk-mq"; July =
2017).

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
@ 2018-07-25 18:13   ` Bart Van Assche
  0 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2018-07-25 18:13 UTC (permalink / raw)
  To: ps, axboe; +Cc: linux-kernel, linux-block, ming.lei

On Fri, 2018-07-13 at 15:29 +0200, Patrick Steinhardt wrote:
> When power management for SCSI is enabled and if a device uses blk-mq,
> it is possible to trigger a `NULL` pointer exception when resuming that
> device. The NPE is triggered when trying to dereference the `request_fn`
> function pointer of the device's `request_queue`:
> 
>     __blk_run_queue_uncond:470
>     __blk_run_queue:490
>     blk_post_runtime_resume:3889
>     sdev_runtime_resume:263
>     scsi_runtime_resume:275
> 
> When the SCSI device is being allocated by `scsi_alloc_sdev`, the
> device's request queue will either be initialized via
> `scsi_mq_alloc_queue` or `scsi_old_alloc_queue`. But the `request_fn`
> member of the request queue is in fact only being set in
> `scsi_old_alloc_queue`, which will then later cause the mentioned NPE.
> 
> Fix the issue by checking whether the `request_fn` is set in
> `__blk_run_queue_uncond`. In case it is unset, we'll silently return and
> not try to invoke the callback, thus fixing the NPE.

Which kernel version are you using? Can you check whether the following two
commits are in your kernel tree?

* 4fd41a8552af ("SCSI: Fix NULL pointer dereference in runtime PM"; December
  2015).
* 765e40b675a9 ("block: disable runtime-pm for blk-mq"; July 2017).

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
  2018-07-25 18:13   ` Bart Van Assche
  (?)
@ 2018-07-26  8:38   ` Patrick Steinhardt
  2018-07-26 13:51     ` Bart Van Assche
  -1 siblings, 1 reply; 15+ messages in thread
From: Patrick Steinhardt @ 2018-07-26  8:38 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: axboe, linux-kernel, linux-block, ming.lei

[-- Attachment #1: Type: text/plain, Size: 2560 bytes --]

On Wed, Jul 25, 2018 at 06:13:02PM +0000, Bart Van Assche wrote:
> On Fri, 2018-07-13 at 15:29 +-0200, Patrick Steinhardt wrote:
> +AD4- When power management for SCSI is enabled and if a device uses blk-mq,
> +AD4- it is possible to trigger a +AGA-NULL+AGA- pointer exception when resuming that
> +AD4- device. The NPE is triggered when trying to dereference the +AGA-request+AF8-fn+AGA-
> +AD4- function pointer of the device's +AGA-request+AF8-queue+AGA-:
> +AD4- 
> +AD4-     +AF8AXw-blk+AF8-run+AF8-queue+AF8-uncond:470
> +AD4-     +AF8AXw-blk+AF8-run+AF8-queue:490
> +AD4-     blk+AF8-post+AF8-runtime+AF8-resume:3889
> +AD4-     sdev+AF8-runtime+AF8-resume:263
> +AD4-     scsi+AF8-runtime+AF8-resume:275
> +AD4- 
> +AD4- When the SCSI device is being allocated by +AGA-scsi+AF8-alloc+AF8-sdev+AGA-, the
> +AD4- device's request queue will either be initialized via
> +AD4- +AGA-scsi+AF8-mq+AF8-alloc+AF8-queue+AGA- or +AGA-scsi+AF8-old+AF8-alloc+AF8-queue+AGA-. But the +AGA-request+AF8-fn+AGA-
> +AD4- member of the request queue is in fact only being set in
> +AD4- +AGA-scsi+AF8-old+AF8-alloc+AF8-queue+AGA-, which will then later cause the mentioned NPE.
> +AD4- 
> +AD4- Fix the issue by checking whether the +AGA-request+AF8-fn+AGA- is set in
> +AD4- +AGAAXwBf-blk+AF8-run+AF8-queue+AF8-uncond+AGA-. In case it is unset, we'll silently return and
> +AD4- not try to invoke the callback, thus fixing the NPE.
> 
> Which kernel version are you using? Can you check whether the following two
> commits are in your kernel tree?
> 
> +ACo- 4fd41a8552af (+ACI-SCSI: Fix NULL pointer dereference in runtime PM+ACIAOw- December
>   2015).
> +ACo- 765e40b675a9 (+ACI-block: disable runtime-pm for blk-mq+ACIAOw- July 2017).

Commit 765e40b675a9 (block: disable runtime-pm for blk-mq July
2017) was indeed not part of my kernel. Back when I upgraded to
v4.14, suspend to RAM on my laptop was broken, and a bisect
showed that this particular commit was the culprit. So I reverted
it and forgot until it bit me now -- so it's been my own
stupidity. Guess it serves me right for not checking my own
changes.

That still leaves the other problem of broken suspend. I've just
checked with v4.17.10, and it's still there: as soon as I resume
from suspend, the kernel oopses and reboots the machine. I guess
I'll have to do another debugging session and see where it fails
exactly (and no, this time there are no more changes to the
kernel tree).

Anyway, thanks for pointing out my own mistakes.

Regards
Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
  2018-07-26  8:38   ` Patrick Steinhardt
@ 2018-07-26 13:51     ` Bart Van Assche
  2018-07-27 12:35       ` Patrick Steinhardt
  0 siblings, 1 reply; 15+ messages in thread
From: Bart Van Assche @ 2018-07-26 13:51 UTC (permalink / raw)
  To: ps; +Cc: linux-kernel, linux-block, axboe, ming.lei

On Thu, 2018-07-26 at 10:38 +0200, Patrick Steinhardt wrote:
> That still leaves the other problem of broken suspend. I've just
> checked with v4.17.10, and it's still there: as soon as I resume
> from suspend, the kernel oopses and reboots the machine. I guess
> I'll have to do another debugging session and see where it fails
> exactly (and no, this time there are no more changes to the
> kernel tree).

Can you share the kernel oops details before you start running a bisect?

Thanks,

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
  2018-07-26 13:51     ` Bart Van Assche
@ 2018-07-27 12:35       ` Patrick Steinhardt
  2018-07-27 15:03           ` Bart Van Assche
  0 siblings, 1 reply; 15+ messages in thread
From: Patrick Steinhardt @ 2018-07-27 12:35 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-kernel, linux-block, axboe, ming.lei

[-- Attachment #1: Type: text/plain, Size: 4910 bytes --]

On Thu, Jul 26, 2018 at 01:51:48PM +0000, Bart Van Assche wrote:
> On Thu, 2018-07-26 at 10:38 +-0200, Patrick Steinhardt wrote:
> +AD4- That still leaves the other problem of broken suspend. I've just
> +AD4- checked with v4.17.10, and it's still there: as soon as I resume
> +AD4- from suspend, the kernel oopses and reboots the machine. I guess
> +AD4- I'll have to do another debugging session and see where it fails
> +AD4- exactly (and no, this time there are no more changes to the
> +AD4- kernel tree).
> 
> Can you share the kernel oops details before you start running a bisect?

I've already done the bisect a few months ago, which resulted in
commit 765e40b675a9 (block: disable runtime-pm for blk-mq July
2017). I've never been able to get the Oops, as the screen was
blank, but I tried a little harder with pstore today and finally
got hold of it. Please see below for the Oops and previous dmesg
entries. Kernel version was v4.17.10.

Thanks
Patrick

<6>[   49.088931] PM: suspend entry (deep)
<6>[   49.088933] PM: Syncing filesystems ... done.
<6>[   49.093066] Freezing user space processes ... (elapsed 0.001 seconds) done.
<6>[   49.094283] OOM killer disabled.
<6>[   49.094284] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
<6>[   49.095470] Suspending console(s) (use no_console_suspend to debug)
<6>[   50.350548] ACPI: EC: interrupt blocked
<6>[   50.383913] ACPI: Preparing to enter system sleep state S3
<6>[   50.397529] ACPI: EC: EC stopped
<6>[   50.397529] ACPI: EC: event blocked
<6>[   50.397530] PM: Saving platform NVS memory
<6>[   50.397605] Disabling non-boot CPUs ...
<6>[   50.414940] smpboot: CPU 1 is now offline
<6>[   50.445153] smpboot: CPU 2 is now offline
<6>[   50.471702] smpboot: CPU 3 is now offline
<6>[   50.473710] ACPI: Low-level resume complete
<6>[   50.473770] ACPI: EC: EC started
<6>[   50.473771] PM: Restoring platform NVS memory
<6>[   50.476149] Enabling non-boot CPUs ...
<6>[   50.476175] x86: Booting SMP configuration:
<6>[   50.476176] smpboot: Booting Node 0 Processor 1 APIC 0x2
<4>[   50.478031]  cache: parent cpu1 should not be sleeping
<6>[   50.478333] CPU1 is up
<6>[   50.478348] smpboot: Booting Node 0 Processor 2 APIC 0x1
<4>[   50.478950]  cache: parent cpu2 should not be sleeping
<6>[   50.479194] CPU2 is up
<6>[   50.479206] smpboot: Booting Node 0 Processor 3 APIC 0x3
<4>[   50.479824]  cache: parent cpu3 should not be sleeping
<6>[   50.480470] CPU3 is up
<6>[   50.490620] ACPI: Waking up from system sleep state S3
<6>[   53.334256] ACPI: EC: interrupt unblocked
<6>[   53.411349] ACPI: EC: event unblocked
<1>[   53.411700] BUG: unable to handle kernel NULL pointer dereference at 00000000000001a8
<6>[   53.411705] PGD 0 P4D 0 
<4>[   53.411711] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC PTI
<4>[   53.411716] CPU: 3 PID: 1600 Comm: kworker/u8:48 Tainted: G     U            4.17.10 #9
<4>[   53.411717] Hardware name: Dell Inc. XPS 13 9343/0310JH, BIOS A15 01/23/2018
<4>[   53.411724] Workqueue: events_unbound async_run_entry_fn
<4>[   53.411733] RIP: 0010:blk_set_runtime_active+0x2d/0x50
<4>[   53.411736] RSP: 0000:ffff88020ef37df0 EFLAGS: 00010046
<4>[   53.411739] RAX: 0000000000000000 RBX: ffff880210a9a000 RCX: 0000000000000000
<4>[   53.411742] RDX: 00000000fffedf06 RSI: 0000000000000009 RDI: ffff880210a9a1b0
<4>[   53.411745] RBP: ffffffff815158d0 R08: ffffffff81c13ce0 R09: 0000000000000000
<4>[   53.411747] R10: 0000000000004400 R11: 0000000000005cd3 R12: ffff8802141671e8
<4>[   53.411749] R13: 0000000000000000 R14: ffff880215414c98 R15: 0000000000000000
<4>[   53.411753] FS:  0000000000000000(0000) GS:ffff88021e780000(0000) knlGS:0000000000000000
<4>[   53.411755] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[   53.411757] CR2: 00000000000001a8 CR3: 000000000300a003 CR4: 00000000003606e0
<4>[   53.411758] Call Trace:
<4>[   53.411766]  scsi_bus_resume_common+0xf8/0x110
<4>[   53.411771]  ? scsi_bus_thaw+0x10/0x10
<4>[   53.411776]  dpm_run_callback.isra.16+0x27/0x70
<4>[   53.411781]  device_resume+0xab/0x170
<4>[   53.411785]  async_resume+0x14/0x40
<4>[   53.411788]  async_run_entry_fn+0x34/0x100
<4>[   53.411793]  process_one_work+0x14d/0x2c0
<4>[   53.411797]  worker_thread+0x29/0x370
<4>[   53.411800]  ? process_one_work+0x2c0/0x2c0
<4>[   53.411803]  kthread+0x109/0x120
<4>[   53.411806]  ? __kthread_create_on_node+0x190/0x190
<4>[   53.411810]  ret_from_fork+0x35/0x40
<4>[   53.411813] Code: 89 fb 48 8b bf b8 01 00 00 e8 30 8c 4e 00 48 8b 83 40 02 00 00 be 09 00 00 00 c7 83 48 02 00 00 00 00 00 00 48 8b 15 03 31 cf 00 <48> 89 90 a8 01 00 00 48 8b bb 40 02 00 00 e8 00 1e 1e 00 48 8b 
<1>[   53.411873] RIP: blk_set_runtime_active+0x2d/0x50 RSP: ffff88020ef37df0
<4>[   53.411874] CR2: 00000000000001a8
<4>[   53.411877] ---[ end trace bbd10b6fcf31d6c4 ]---

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
  2018-07-27 12:35       ` Patrick Steinhardt
@ 2018-07-27 15:03           ` Bart Van Assche
  0 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2018-07-27 15:03 UTC (permalink / raw)
  To: ps; +Cc: linux-kernel, linux-block, axboe, ming.lei

On Fri, 2018-07-27 at 14:35 +0200, Patrick Steinhardt wrote:
> <1>[   53.411700] BUG: unable to handle kernel NULL poi=
nter dereference at 00000000000001a8
> <4>[   53.411733] RIP: 0010:blk_set_runtime_=
-active+0x2d/0x50

Are you sure commit 4fd41a8552af ("SCSI: Fix NULL pointer dereference i=
n
runtime PM"; December 2015) is in your kernel tree?

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
@ 2018-07-27 15:03           ` Bart Van Assche
  0 siblings, 0 replies; 15+ messages in thread
From: Bart Van Assche @ 2018-07-27 15:03 UTC (permalink / raw)
  To: ps; +Cc: linux-kernel, linux-block, axboe, ming.lei

On Fri, 2018-07-27 at 14:35 +0200, Patrick Steinhardt wrote:
> <1>[   53.411700] BUG: unable to handle kernel NULL pointer dereference at 00000000000001a8
> <4>[   53.411733] RIP: 0010:blk_set_runtime_active+0x2d/0x50

Are you sure commit 4fd41a8552af ("SCSI: Fix NULL pointer dereference in
runtime PM"; December 2015) is in your kernel tree?

Bart.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
  2018-07-27 15:03           ` Bart Van Assche
  (?)
@ 2018-07-28 18:50           ` Patrick Steinhardt
  -1 siblings, 0 replies; 15+ messages in thread
From: Patrick Steinhardt @ 2018-07-28 18:50 UTC (permalink / raw)
  To: Bart Van Assche; +Cc: linux-kernel, linux-block, axboe, ming.lei

[-- Attachment #1: Type: text/plain, Size: 721 bytes --]

On Fri, Jul 27, 2018 at 03:03:09PM +0000, Bart Van Assche wrote:
> On Fri, 2018-07-27 at 14:35 +-0200, Patrick Steinhardt wrote:
> +AD4- +ADw-1+AD4AWw-   53.411700+AF0- BUG: unable to handle kernel NULL pointer dereference at 00000000000001a8
> +AD4- +ADw-4+AD4AWw-   53.411733+AF0- RIP: 0010:blk+AF8-set+AF8-runtime+AF8-active+-0x2d/0x50
> 
> Are you sure commit 4fd41a8552af (+ACI-SCSI: Fix NULL pointer dereference in
> runtime PM+ACIAOw- December 2015) is in your kernel tree?
> 
> Bart.

Yes, I am:

$ git describe
v4.17.10
$ git tag --contains 4fd41a8552af v4.17.10
v4.17.10

I've also manually verified that all changes from 4fd41a8552af
are contained in the sources, which they are.

Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
  2018-07-13 13:41 ` Ming Lei
  2018-07-16 15:11   ` Patrick Steinhardt
@ 2018-07-29  9:41   ` Tomas Janousek
  2018-07-30  7:59     ` Patrick Steinhardt
  1 sibling, 1 reply; 15+ messages in thread
From: Tomas Janousek @ 2018-07-29  9:41 UTC (permalink / raw)
  To: Ming Lei
  Cc: Patrick Steinhardt, Jens Axboe, linux-block, Linux Kernel Mailing List

Hi,

On Fri, Jul 13, 2018 at 09:41:41PM +0800, Ming Lei wrote:
> Now runtime PM is disabled for blk-mq/scsi_mq, not sure how this issue is
> triggered on your machine.

While Patrick did miss the following patch:

* 765e40b675a9 ("block: disable runtime-pm for blk-mq"; July 2017).

there is at least one other way to trigger it -- enable laptop-mode-tools
or tlp which enable runtime-pm for all devices.

The "disable runtime-pm for blk-mq" only disables it _by_default_, but doesn't
prevent it from being enabled again from user-space, which it is unless one
manually blacklists sd devices from runtime-pm enablement. It's bitten a few
people already: https://github.com/rickysarraf/laptop-mode-tools/issues/123

(I found this thread because I'm also getting the NULL pointer dereference at
00000000000001a8 on resume from suspend.)

-- 
Tomáš Janoušek, a.k.a. Pivník, a.k.a. Liskni_si, http://work.lisk.in/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
  2018-07-29  9:41   ` Tomas Janousek
@ 2018-07-30  7:59     ` Patrick Steinhardt
  2018-07-30 11:50       ` Ming Lei
  0 siblings, 1 reply; 15+ messages in thread
From: Patrick Steinhardt @ 2018-07-30  7:59 UTC (permalink / raw)
  To: Tomas Janousek
  Cc: Ming Lei, Jens Axboe, linux-block, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1232 bytes --]

On Sun, Jul 29, 2018 at 11:41:31AM +0200, Tomas Janousek wrote:
> Hi,
> 
> On Fri, Jul 13, 2018 at 09:41:41PM +0800, Ming Lei wrote:
> > Now runtime PM is disabled for blk-mq/scsi_mq, not sure how this issue is
> > triggered on your machine.
> 
> While Patrick did miss the following patch:
> 
> * 765e40b675a9 ("block: disable runtime-pm for blk-mq"; July 2017).
> 
> there is at least one other way to trigger it -- enable laptop-mode-tools
> or tlp which enable runtime-pm for all devices.
> 
> The "disable runtime-pm for blk-mq" only disables it _by_default_, but doesn't
> prevent it from being enabled again from user-space, which it is unless one
> manually blacklists sd devices from runtime-pm enablement. It's bitten a few
> people already: https://github.com/rickysarraf/laptop-mode-tools/issues/123
> 
> (I found this thread because I'm also getting the NULL pointer dereference at
> 00000000000001a8 on resume from suspend.)

Huh, I did send out some more details on how I reproduce the
issue, but it seems like my mail didn't get through. While I
don't use laptop-mode-tools, I do have some custom hotplugging
scripts which do in fact enable runtime-PM for most devices.

Regards
Patrick

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
  2018-07-30  7:59     ` Patrick Steinhardt
@ 2018-07-30 11:50       ` Ming Lei
  2018-07-30 15:00         ` Tomas Janousek
  0 siblings, 1 reply; 15+ messages in thread
From: Ming Lei @ 2018-07-30 11:50 UTC (permalink / raw)
  To: Patrick Steinhardt
  Cc: Tomas Janousek, Ming Lei, Jens Axboe, linux-block,
	Linux Kernel Mailing List

On Mon, Jul 30, 2018 at 09:59:49AM +0200, Patrick Steinhardt wrote:
> On Sun, Jul 29, 2018 at 11:41:31AM +0200, Tomas Janousek wrote:
> > Hi,
> > 
> > On Fri, Jul 13, 2018 at 09:41:41PM +0800, Ming Lei wrote:
> > > Now runtime PM is disabled for blk-mq/scsi_mq, not sure how this issue is
> > > triggered on your machine.
> > 
> > While Patrick did miss the following patch:
> > 
> > * 765e40b675a9 ("block: disable runtime-pm for blk-mq"; July 2017).
> > 
> > there is at least one other way to trigger it -- enable laptop-mode-tools
> > or tlp which enable runtime-pm for all devices.
> > 
> > The "disable runtime-pm for blk-mq" only disables it _by_default_, but doesn't
> > prevent it from being enabled again from user-space, which it is unless one
> > manually blacklists sd devices from runtime-pm enablement. It's bitten a few
> > people already: https://github.com/rickysarraf/laptop-mode-tools/issues/123
> > 
> > (I found this thread because I'm also getting the NULL pointer dereference at
> > 00000000000001a8 on resume from suspend.)
> 
> Huh, I did send out some more details on how I reproduce the
> issue, but it seems like my mail didn't get through. While I
> don't use laptop-mode-tools, I do have some custom hotplugging
> scripts which do in fact enable runtime-PM for most devices.

Now runtime PM is still enabled for sd/sr, and I believe the following
patch is needed until we figure out one perfect way for supporting it
well:

diff --git a/block/blk-core.c b/block/blk-core.c
index 03a4ea93a5f3..090b782df129 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -3769,9 +3769,11 @@ EXPORT_SYMBOL(blk_finish_plug);
  */
 void blk_pm_runtime_init(struct request_queue *q, struct device *dev)
 {
-	/* not support for RQF_PM and ->rpm_status in blk-mq yet */
-	if (q->mq_ops)
+	/* Don't enable runtime PM for blk-mq until it is ready */
+	if (q->mq_ops) {
+		pm_runtime_disable(dev);
 		return;
+	}
 
 	q->dev = dev;
 	q->rpm_status = RPM_ACTIVE;

Thanks,
Ming

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] block: fix NPE when resuming SCSI devices using blk-mq
  2018-07-30 11:50       ` Ming Lei
@ 2018-07-30 15:00         ` Tomas Janousek
  0 siblings, 0 replies; 15+ messages in thread
From: Tomas Janousek @ 2018-07-30 15:00 UTC (permalink / raw)
  To: Ming Lei
  Cc: Patrick Steinhardt, Ming Lei, Jens Axboe, linux-block,
	Linux Kernel Mailing List

Hi Ming,

On Mon, Jul 30, 2018 at 07:50:03PM +0800, Ming Lei wrote:
> Now runtime PM is still enabled for sd/sr, and I believe the following
> patch is needed until we figure out one perfect way for supporting it
> well:
> [...]
> +		pm_runtime_disable(dev);

Awesome, thanks for the quick reaction! :-)

(I don't happen to have a .config for the hardware I'm experincing the problem
on, so don't expect a Tested-By from me, but I don't see why it wouldn't fix
the issue, and it did for Patrick.)

-- 
Tomáš Janoušek, a.k.a. Pivník, a.k.a. Liskni_si, http://work.lisk.in/

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-07-30 15:00 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-13 13:29 [PATCH] block: fix NPE when resuming SCSI devices using blk-mq Patrick Steinhardt
2018-07-13 13:41 ` Ming Lei
2018-07-16 15:11   ` Patrick Steinhardt
2018-07-29  9:41   ` Tomas Janousek
2018-07-30  7:59     ` Patrick Steinhardt
2018-07-30 11:50       ` Ming Lei
2018-07-30 15:00         ` Tomas Janousek
2018-07-25 18:13 ` Bart Van Assche
2018-07-25 18:13   ` Bart Van Assche
2018-07-26  8:38   ` Patrick Steinhardt
2018-07-26 13:51     ` Bart Van Assche
2018-07-27 12:35       ` Patrick Steinhardt
2018-07-27 15:03         ` Bart Van Assche
2018-07-27 15:03           ` Bart Van Assche
2018-07-28 18:50           ` Patrick Steinhardt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.