linux-next.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next
@ 2022-11-07 17:28 Gerd Bayer
  2022-11-08  3:50 ` Chaitanya Kulkarni
  2022-11-08  7:48 ` Christoph Hellwig
  0 siblings, 2 replies; 9+ messages in thread
From: Gerd Bayer @ 2022-11-07 17:28 UTC (permalink / raw)
  To: Jens Axboe, Christoph Hellwig, Sagi Grimberg
  Cc: Niklas Schnelle, linux-kernel, linux-next, linux-nvme

Hi,

our internal s390 CI pointed us to a potential racy "use after free" or similar 
issue in drivers/nvme/host/pci.c by ending one of the tests in the following 
kernel panic:

[ 1836.550881] nvme nvme0: pci function 0004:00:00.0
[ 1836.563814] nvme nvme0: Shutdown timeout set to 15 seconds
[ 1836.569587] nvme nvme0: 63/0/0 default/read/poll queues
[ 1836.577114]  nvme0n1: p1 p2
[ 1861.856726] nvme nvme0: pci function 0004:00:00.0
[ 1861.869539] nvme nvme0: failed to mark controller CONNECTING
[ 1861.869542] nvme nvme0: Removing after probe failure status: -16
[ 1861.869552] Unable to handle kernel pointer dereference in virtual kernel address space
[ 1861.869554] Failing address: 0000000000000000 TEID: 0000000000000483
[ 1861.869555] Fault in home space mode while using kernel ASCE.
[ 1861.869558] AS:0000000135c4c007 R3:00000003fffe0007 S:00000003fffe6000 P:000000000000013d 
[ 1861.869587] Oops: 0004 ilc:3 [#1] SMP 
[ 1861.869591] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4
nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables
nfnetlink mlx5_ib ib_uverbs uvdevice s390_trng ib_core vfio_ccw mdev vfio_iommu_type1 eadm_sch
 vfio sch_fq_codel configfs dm_service_time mlx5_core ghash_s390 prng chacha_s390 libchacha aes_s390 des_s390 libdes
sha3_512_s390 sha3_256_s390 sha512_s390 sha256_s390 sha1_s390 nvme sha_common nvme_core zfcp scsi_transport_fc
dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua dm_mirror dm_region_hash dm_log pkey zcry
pt rng_core autofs4
[ 1861.869627] CPU: 4 PID: 2929 Comm: kworker/u800:0 Not tainted 6.1.0-rc3-next-20221104 #4
[ 1861.869630] Hardware name: IBM 3931 A01 701 (LPAR)
[ 1861.869631] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
[ 1861.869637] Krnl PSW : 0704c00180000000 0000000134f026d0 (mutex_lock+0x10/0x28)
[ 1861.869643]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[ 1861.869646] Krnl GPRS: 0000000001000000 0000000000000000 0000000000000078 00000000a5f8c200
[ 1861.869648]            000003800309601c 0000000000000004 0000000000000000 0000000088e64220
[ 1861.869650]            0000000000000078 0000000000000000 0000000000000098 0000000088e64000
[ 1861.869651]            00000000a5f8c200 0000000088e641e0 00000001349bdac2 0000038003ea7c20
[ 1861.869658] Krnl Code: 0000000134f026c0: c0040008cfb8        brcl    0,000000013501c630
[ 1861.869658]            0000000134f026c6: a7190000            lghi    %r1,0
[ 1861.869658]           #0000000134f026ca: e33003400004        lg      %r3,832
[ 1861.869658]           >0000000134f026d0: eb1320000030        csg     %r1,%r3,0(%r2)
[ 1861.869658]            0000000134f026d6: ec160006007c        cgij    %r1,0,6,0000000134f026e2
[ 1861.869658]            0000000134f026dc: 07fe                bcr     15,%r14
[ 1861.869658]            0000000134f026de: 47000700            bc      0,1792
[ 1861.869658]            0000000134f026e2: c0f4ffffffe7        brcl    15,0000000134f026b0
[ 1861.869715] Call Trace:
[ 1861.869716]  [<0000000134f026d0>] mutex_lock+0x10/0x28 
[ 1861.869719]  [<000003ff7fc381d6>] nvme_dev_disable+0x1b6/0x2b0 [nvme] 
[ 1861.869722]  [<000003ff7fc3929e>] nvme_reset_work+0x49e/0x6a0 [nvme] 
[ 1861.869724]  [<0000000134309158>] process_one_work+0x200/0x458 
[ 1861.869730]  [<00000001343098e6>] worker_thread+0x66/0x480 
[ 1861.869732]  [<0000000134312888>] kthread+0x108/0x110 
[ 1861.869735]  [<0000000134297354>] __ret_from_fork+0x3c/0x58 
[ 1861.869738]  [<0000000134f074ea>] ret_from_fork+0xa/0x40 
[ 1861.869740] Last Breaking-Event-Address:
[ 1861.869741]  [<00000001349bdabc>] blk_mq_quiesce_tagset+0x2c/0xc0
[ 1861.869747] Kernel panic - not syncing: Fatal exception: panic_on_oops

On a stock kernel from
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=next-20221104
we have been able to reproduce this at will with
this small script 

#!/usr/bin/env bash

echo $1 > /sys/bus/pci/drivers/nvme/unbind
echo $1 > /sys/bus/pci/drivers/nvme/bind
echo 1 > /sys/bus/pci/devices/$1/remove

when filling in the NVMe drives' PCI identifier.

We believe this to be a race-condition somewhere, since this sequence does not produce the panic
when executed interactively.

Could this be linked to the recent (refactoring) work by Christoph Hellwig?
E.g. https://lore.kernel.org/all/20221101150050.3510-3-hch@lst.de/

Thank you,
Gerd Bayer


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next
  2022-11-07 17:28 nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next Gerd Bayer
@ 2022-11-08  3:50 ` Chaitanya Kulkarni
  2022-11-08 15:46   ` Keith Busch
  2022-11-08  7:48 ` Christoph Hellwig
  1 sibling, 1 reply; 9+ messages in thread
From: Chaitanya Kulkarni @ 2022-11-08  3:50 UTC (permalink / raw)
  To: Gerd Bayer, Jens Axboe, Christoph Hellwig, Sagi Grimberg
  Cc: Niklas Schnelle, linux-kernel, linux-next, linux-nvme

On 11/7/22 09:28, Gerd Bayer wrote:
> Hi,
> 
> our internal s390 CI pointed us to a potential racy "use after free" or similar
> issue in drivers/nvme/host/pci.c by ending one of the tests in the following
> kernel panic:
> 

Thanks a lot for reporting this ...

[...]
> 
> On a stock kernel from
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tag/?h=next-20221104
> we have been able to reproduce this at will with
> this small script
> 
> #!/usr/bin/env bash
> 
> echo $1 > /sys/bus/pci/drivers/nvme/unbind
> echo $1 > /sys/bus/pci/drivers/nvme/bind
> echo 1 > /sys/bus/pci/devices/$1/remove
> 
> when filling in the NVMe drives' PCI identifier.
> 

Can you please submit a blktests for this ?
so this will get tested by everyone at each release ?

> We believe this to be a race-condition somewhere, since this sequence does not produce the panic
> when executed interactively.
> 

You can try and bisect the code to point out at exact commit.

> Could this be linked to the recent (refactoring) work by Christoph Hellwig?
> E.g. https://lore.kernel.org/all/20221101150050.3510-3-hch@lst.de/
> 
> Thank you,
> Gerd Bayer
> 
> 

-ck


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next
  2022-11-07 17:28 nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next Gerd Bayer
  2022-11-08  3:50 ` Chaitanya Kulkarni
@ 2022-11-08  7:48 ` Christoph Hellwig
  2022-11-08 17:11   ` Gerd Bayer
                     ` (2 more replies)
  1 sibling, 3 replies; 9+ messages in thread
From: Christoph Hellwig @ 2022-11-08  7:48 UTC (permalink / raw)
  To: Gerd Bayer
  Cc: Jens Axboe, Christoph Hellwig, Sagi Grimberg, Niklas Schnelle,
	linux-kernel, linux-next, linux-nvme

Below is the minimal fix.  I'll see if I sort out the mess that is
probe/reset failure vs ->remove a bit better, though.

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index f94b05c585cbc..577bacdcfee08 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -5160,6 +5160,8 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
 
 void nvme_stop_queues(struct nvme_ctrl *ctrl)
 {
+	if (!ctrl->tagset)
+		return;
 	if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
 		blk_mq_quiesce_tagset(ctrl->tagset);
 	else
@@ -5169,6 +5171,8 @@ EXPORT_SYMBOL_GPL(nvme_stop_queues);
 
 void nvme_start_queues(struct nvme_ctrl *ctrl)
 {
+	if (!ctrl->tagset)
+		return;
 	if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags))
 		blk_mq_unquiesce_tagset(ctrl->tagset);
 }

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next
  2022-11-08  3:50 ` Chaitanya Kulkarni
@ 2022-11-08 15:46   ` Keith Busch
  0 siblings, 0 replies; 9+ messages in thread
From: Keith Busch @ 2022-11-08 15:46 UTC (permalink / raw)
  To: Chaitanya Kulkarni
  Cc: Gerd Bayer, Jens Axboe, Christoph Hellwig, Sagi Grimberg,
	Niklas Schnelle, linux-kernel, linux-next, linux-nvme

On Tue, Nov 08, 2022 at 03:50:33AM +0000, Chaitanya Kulkarni wrote:
> On 11/7/22 09:28, Gerd Bayer wrote:
> 
> > We believe this to be a race-condition somewhere, since this sequence does not produce the panic
> > when executed interactively.
> > 
> 
> You can try and bisect the code to point out at exact commit.

Bisect doesn't work without a known good commit point.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next
  2022-11-08  7:48 ` Christoph Hellwig
@ 2022-11-08 17:11   ` Gerd Bayer
  2022-11-08 17:23   ` Keith Busch
  2022-11-09  2:54   ` Sagi Grimberg
  2 siblings, 0 replies; 9+ messages in thread
From: Gerd Bayer @ 2022-11-08 17:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Jens Axboe, Sagi Grimberg, Niklas Schnelle, linux-kernel,
	linux-next, linux-nvme

Hi Christoph,

with your minimal fix

On Tue, 2022-11-08 at 08:48 +0100, Christoph Hellwig wrote:
> Below is the minimal fix.  I'll see if I sort out the mess that is
> probe/reset failure vs ->remove a bit better, though.
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f94b05c585cbc..577bacdcfee08 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -5160,6 +5160,8 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
>  
>  void nvme_stop_queues(struct nvme_ctrl *ctrl)
>  {
> +	if (!ctrl->tagset)
> +		return;
>  	if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
>  		blk_mq_quiesce_tagset(ctrl->tagset);
>  	else
> @@ -5169,6 +5171,8 @@ EXPORT_SYMBOL_GPL(nvme_stop_queues);
>  
>  void nvme_start_queues(struct nvme_ctrl *ctrl)
>  {
> +	if (!ctrl->tagset)
> +		return;
>  	if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags))
>  		blk_mq_unquiesce_tagset(ctrl->tagset);
>  }

on next-20221108 the kernel does not crash any
more when I run the short test-script.

dmesg shows:
Nov 08 17:38:51 a46lp24.lnxne.boe kernel: nvme nvme0: pci function 0004:00:00.0
Nov 08 17:38:51 a46lp24.lnxne.boe kernel:
nvme nvme0: failed to mark controller CONNECTING
Nov 08 17:38:51 a46lp24.lnxne.boe kernel: nvme nvme0: Removing after
probe failure status: -16
Nov 08 17:38:52 a46lp24.lnxne.boe kernel: pci 0004:00:00.0: Removing from iommu group 0

while kernel remains up.
I can even do 
- rescan on the pci bus (to bring back the nvme drive), and
- run the test script
multiple times.

So from my point of view this band-aid is valuable to be incorporated while the larger 
overhaul in

https://lore.kernel.org/linux-nvme/20221108150252.2123727-1-hch@lst.de/
is out for review and test.

Feel free to add my
Tested-by: Gerd Bayer <gbayer@linux.ibm.com>

Thank you,
Gerd Bayer


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next
  2022-11-08  7:48 ` Christoph Hellwig
  2022-11-08 17:11   ` Gerd Bayer
@ 2022-11-08 17:23   ` Keith Busch
  2022-11-09  6:20     ` Christoph Hellwig
  2022-11-09  2:54   ` Sagi Grimberg
  2 siblings, 1 reply; 9+ messages in thread
From: Keith Busch @ 2022-11-08 17:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Gerd Bayer, Jens Axboe, Sagi Grimberg, Niklas Schnelle,
	linux-kernel, linux-next, linux-nvme

On Tue, Nov 08, 2022 at 08:48:47AM +0100, Christoph Hellwig wrote:
> Below is the minimal fix.  I'll see if I sort out the mess that is
> probe/reset failure vs ->remove a bit better, though.

This looks good to go. I vote apply for 6.1, and we should consider
merging your larger refactor for 6.2 after more test/review.

Reviewed-by: Keith Busch <kbusch@kernel.org>
 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f94b05c585cbc..577bacdcfee08 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -5160,6 +5160,8 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
>  
>  void nvme_stop_queues(struct nvme_ctrl *ctrl)
>  {
> +	if (!ctrl->tagset)
> +		return;
>  	if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
>  		blk_mq_quiesce_tagset(ctrl->tagset);
>  	else
> @@ -5169,6 +5171,8 @@ EXPORT_SYMBOL_GPL(nvme_stop_queues);
>  
>  void nvme_start_queues(struct nvme_ctrl *ctrl)
>  {
> +	if (!ctrl->tagset)
> +		return;
>  	if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags))
>  		blk_mq_unquiesce_tagset(ctrl->tagset);
>  }
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next
  2022-11-08  7:48 ` Christoph Hellwig
  2022-11-08 17:11   ` Gerd Bayer
  2022-11-08 17:23   ` Keith Busch
@ 2022-11-09  2:54   ` Sagi Grimberg
  2022-11-09  7:41     ` Chao Leng
  2 siblings, 1 reply; 9+ messages in thread
From: Sagi Grimberg @ 2022-11-09  2:54 UTC (permalink / raw)
  To: Christoph Hellwig, Gerd Bayer
  Cc: Jens Axboe, Niklas Schnelle, linux-kernel, linux-next, linux-nvme


> Below is the minimal fix.  I'll see if I sort out the mess that is
> probe/reset failure vs ->remove a bit better, though.
> 
> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
> index f94b05c585cbc..577bacdcfee08 100644
> --- a/drivers/nvme/host/core.c
> +++ b/drivers/nvme/host/core.c
> @@ -5160,6 +5160,8 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
>   
>   void nvme_stop_queues(struct nvme_ctrl *ctrl)
>   {
> +	if (!ctrl->tagset)
> +		return;
>   	if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
>   		blk_mq_quiesce_tagset(ctrl->tagset);
>   	else
> @@ -5169,6 +5171,8 @@ EXPORT_SYMBOL_GPL(nvme_stop_queues);
>   
>   void nvme_start_queues(struct nvme_ctrl *ctrl)
>   {
> +	if (!ctrl->tagset)
> +		return;
>   	if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags))
>   		blk_mq_unquiesce_tagset(ctrl->tagset);
>   }

Can we do that in the pci driver and not here?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next
  2022-11-08 17:23   ` Keith Busch
@ 2022-11-09  6:20     ` Christoph Hellwig
  0 siblings, 0 replies; 9+ messages in thread
From: Christoph Hellwig @ 2022-11-09  6:20 UTC (permalink / raw)
  To: Keith Busch
  Cc: Christoph Hellwig, Gerd Bayer, Jens Axboe, Sagi Grimberg,
	Niklas Schnelle, linux-kernel, linux-next, linux-nvme

On Tue, Nov 08, 2022 at 10:23:31AM -0700, Keith Busch wrote:
> This looks good to go. I vote apply for 6.1, and we should consider
> merging your larger refactor for 6.2 after more test/review.

The problem doesn't exist yet in 6.1, it is caused by the recent
quiesce rework.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next
  2022-11-09  2:54   ` Sagi Grimberg
@ 2022-11-09  7:41     ` Chao Leng
  0 siblings, 0 replies; 9+ messages in thread
From: Chao Leng @ 2022-11-09  7:41 UTC (permalink / raw)
  To: Sagi Grimberg, Christoph Hellwig, Gerd Bayer
  Cc: Jens Axboe, Niklas Schnelle, linux-kernel, linux-next, linux-nvme

The check "if (!ctrl->tagset)" is just reduce the probability.

The real reason is the race of probe and remove.
It is similar with TCP and RDMA transport.
Israel has tried to fix it.
The detail:
https://github.com/torvalds/linux/commit/ce1518139e6976cf19c133b555083354fdb629b8
Unfortunately, this patch was reverted.

If it is in the process of "probe", remove should not be called.
Maybe we can move pci_set_drvdata to the end of nvme_probe.
Of course, the removal may not take effect if it is in the process of "probe".
This is why the patch of Israel is reverted.

Perhaps the better option would be that "remove" wait for the "probe" to complete,
and then do the real remove.
This requires additional mechanism to implement this.

On 2022/11/9 10:54, Sagi Grimberg wrote:
> 
>> Below is the minimal fix.  I'll see if I sort out the mess that is
>> probe/reset failure vs ->remove a bit better, though.
>>
>> diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
>> index f94b05c585cbc..577bacdcfee08 100644
>> --- a/drivers/nvme/host/core.c
>> +++ b/drivers/nvme/host/core.c
>> @@ -5160,6 +5160,8 @@ EXPORT_SYMBOL_GPL(nvme_start_freeze);
>>   void nvme_stop_queues(struct nvme_ctrl *ctrl)
>>   {
>> +    if (!ctrl->tagset)
>> +        return;
>>       if (!test_and_set_bit(NVME_CTRL_STOPPED, &ctrl->flags))
>>           blk_mq_quiesce_tagset(ctrl->tagset);
>>       else
>> @@ -5169,6 +5171,8 @@ EXPORT_SYMBOL_GPL(nvme_stop_queues);
>>   void nvme_start_queues(struct nvme_ctrl *ctrl)
>>   {
>> +    if (!ctrl->tagset)
>> +        return;
>>       if (test_and_clear_bit(NVME_CTRL_STOPPED, &ctrl->flags))
>>           blk_mq_unquiesce_tagset(ctrl->tagset);
>>   }
> 
> Can we do that in the pci driver and not here?
> 
> .

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2022-11-09  7:41 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-07 17:28 nvme-pci: NULL pointer dereference in nvme_dev_disable() on linux-next Gerd Bayer
2022-11-08  3:50 ` Chaitanya Kulkarni
2022-11-08 15:46   ` Keith Busch
2022-11-08  7:48 ` Christoph Hellwig
2022-11-08 17:11   ` Gerd Bayer
2022-11-08 17:23   ` Keith Busch
2022-11-09  6:20     ` Christoph Hellwig
2022-11-09  2:54   ` Sagi Grimberg
2022-11-09  7:41     ` Chao Leng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).