IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again

All of lore.kernel.org
 help / color / mirror / Atom feed

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-05 21:38 ` IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again Keith Busch
@ 2018-04-05 21:22   ` Scott Bauer
  2018-04-05 22:21     ` Alex G.
  0 siblings, 1 reply; 24+ messages in thread
From: Scott Bauer @ 2018-04-05 21:22 UTC (permalink / raw)


On Thu, Apr 05, 2018@03:38:47PM -0600, Keith Busch wrote:
> On Thu, Apr 05, 2018@03:51:38PM -0500, Alex G. wrote:
> > Hi Keith,
> > 
> > The NULL pointer dereference strikes yet again, but in a different
> > place. I think you'll love this one, as we can get it with native AER.
> > I'm not sure what to make of it, or why we get an invalid opcode with
> > the package, but the error is consistently tied to nvme.
> 
> Interesting indeed.
> 
> Invaild opcode is a BUG_ON triggering a kernel panic when it evaluates
> to true:
> 
>   [  938.971059] kernel BUG at mm/slub.c:296!
> 
> Which is this:
> 
>   static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
>   {
> 	unsigned long freeptr_addr = (unsigned long)object + s->offset;
> 
>   #ifdef CONFIG_SLAB_FREELIST_HARDENED
> 	BUG_ON(object == fp); /* naive detection of double free or corruption */
>   #endif
> 
> 	*(void **)freeptr_addr = freelist_ptr(s, fp, freeptr_addr);
>   }
> 
> So the code thinks it's found memory corruption. Maybe it has.

Alex, are you able to build with KASAN? Assuming it is memory corruption KASAN can provide
us the location of the first free which may assist in debugging.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
       [not found] <5d6d1a8c-6490-4046-0fba-da0a0df3d00c@gmail.com>
@ 2018-04-05 21:38 ` Keith Busch
  2018-04-05 21:22   ` Scott Bauer
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2018-04-05 21:38 UTC (permalink / raw)

On Thu, Apr 05, 2018@03:51:38PM -0500, Alex G. wrote:
> Hi Keith,
> 
> The NULL pointer dereference strikes yet again, but in a different
> place. I think you'll love this one, as we can get it with native AER.
> I'm not sure what to make of it, or why we get an invalid opcode with
> the package, but the error is consistently tied to nvme.

Interesting indeed.

Invaild opcode is a BUG_ON triggering a kernel panic when it evaluates
to true:

  [  938.971059] kernel BUG at mm/slub.c:296!

Which is this:

  static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
  {
	unsigned long freeptr_addr = (unsigned long)object + s->offset;

  #ifdef CONFIG_SLAB_FREELIST_HARDENED
	BUG_ON(object == fp); /* naive detection of double free or corruption */
  #endif

	*(void **)freeptr_addr = freelist_ptr(s, fp, freeptr_addr);
  }

So the code thinks it's found memory corruption. Maybe it has.

At least one odd thing is the repeated "controller is down; will reset"
messages. This should come into play only if an IO timeout occurs, and
observing the message should mean the driver aborted and completed all
outstanding IO. There shouldn't be any more IO to timeout after that,
so not sure how we're repeatedly entering the timeout handler.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-05 21:22   ` Scott Bauer
@ 2018-04-05 22:21     ` Alex G.
  2018-04-05 22:41       ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Alex G. @ 2018-04-05 22:21 UTC (permalink / raw)


On 04/05/2018 04:22 PM, Scott Bauer wrote:
> On Thu, Apr 05, 2018@03:38:47PM -0600, Keith Busch wrote:
>> On Thu, Apr 05, 2018@03:51:38PM -0500, Alex G. wrote:
>>> Hi Keith,
>>>
>>> The NULL pointer dereference strikes yet again, but in a different
>>> place. I think you'll love this one, as we can get it with native AER.
>>> I'm not sure what to make of it, or why we get an invalid opcode with
>>> the package, but the error is consistently tied to nvme.
>>
>> Interesting indeed.
>>
>> Invaild opcode is a BUG_ON triggering a kernel panic when it evaluates
>> to true:
>>
>>   [  938.971059] kernel BUG at mm/slub.c:296!
>>
>> Which is this:
>>
>>   static inline void set_freepointer(struct kmem_cache *s, void *object, void *fp)
>>   {
>> 	unsigned long freeptr_addr = (unsigned long)object + s->offset;
>>
>>   #ifdef CONFIG_SLAB_FREELIST_HARDENED
>> 	BUG_ON(object == fp); /* naive detection of double free or corruption */
>>   #endif
>>
>> 	*(void **)freeptr_addr = freelist_ptr(s, fp, freeptr_addr);
>>   }
>>
>> So the code thinks it's found memory corruption. Maybe it has.
> 
> Alex, are you able to build with KASAN? Assuming it is memory corruption KASAN can provide
> us the location of the first free which may assist in debugging.
> 

All you have to do is say CONFIG_KASAN=y. It took almost no time at all
to trigger. The serial port is still stuck spewing out the logs, but the
ssh logger has them.

I've had to put the full log somewhere else[1], as it's way too big for
an email.

Alex
http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180405-1705.log

-------------- next part --------------
A non-text attachment was scrubbed...
Name: log-20180405-1705-trimmed.log
Type: text/x-log
Size: 99000 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20180405/0943fa3d/attachment-0001.bin>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-05 22:21     ` Alex G.
@ 2018-04-05 22:41       ` Keith Busch
  2018-04-05 22:48         ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2018-04-05 22:41 UTC (permalink / raw)


On Thu, Apr 05, 2018@05:21:29PM -0500, Alex G. wrote:
> All you have to do is say CONFIG_KASAN=y. It took almost no time at all
> to trigger. The serial port is still stuck spewing out the logs, but the
> ssh logger has them.
> 
> I've had to put the full log somewhere else[1], as it's way too big for
> an email.
> 
> Alex
> http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180405-1705.log

I take it this doesn't happen without the MD raid. Could you try with
the following patch?

---
diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
index bf0b840645cc..d377643b5d06 100644
--- a/drivers/md/dm-rq.c
+++ b/drivers/md/dm-rq.c
@@ -366,7 +366,7 @@ static void dm_complete_request(struct request *rq, blk_status_t error)
 	if (!rq->q->mq_ops)
 		blk_complete_request(rq);
 	else
-		blk_mq_complete_request(rq);
+		blk_mq_end_request(rq);
 }
 
 /*
--

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-05 22:41       ` Keith Busch
@ 2018-04-05 22:48         ` Keith Busch
  2018-04-05 23:05           ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2018-04-05 22:48 UTC (permalink / raw)


On Thu, Apr 05, 2018@04:41:39PM -0600, Keith Busch wrote:
> I take it this doesn't happen without the MD raid. Could you try with
> the following patch?
> 
> ---
> diff --git a/drivers/md/dm-rq.c b/drivers/md/dm-rq.c
> index bf0b840645cc..d377643b5d06 100644
> --- a/drivers/md/dm-rq.c
> +++ b/drivers/md/dm-rq.c
> @@ -366,7 +366,7 @@ static void dm_complete_request(struct request *rq, blk_status_t error)
>  	if (!rq->q->mq_ops)
>  		blk_complete_request(rq);
>  	else
> -		blk_mq_complete_request(rq);
> +		blk_mq_end_request(rq);
>  }
>  
>  /*
> --

Actually, scratch this for a moment. I don't know this path well enough
to suggest the above.

What I really want is for a failed queue_rq to not call the nvme
driver's softirq_done for that request, and it looks like that may be
happening. I need to stare at this a little longer.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-05 22:48         ` Keith Busch
@ 2018-04-05 23:05           ` Keith Busch
  2018-04-05 23:39             ` Alex G.
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2018-04-05 23:05 UTC (permalink / raw)


Just trying to confirm a suspicion, could you retry with the following?

---
diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index b6f43b738f03..f9847a9f2973 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -890,12 +890,23 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
 	if (unlikely(nvmeq->cq_vector < 0)) {
 		ret = BLK_STS_IOERR;
 		spin_unlock_irq(&nvmeq->q_lock);
-		goto out_cleanup_iod;
+		goto out_unmap_iod;
 	}
 	__nvme_submit_cmd(nvmeq, &cmnd);
 	nvme_process_cq(nvmeq);
 	spin_unlock_irq(&nvmeq->q_lock);
 	return BLK_STS_OK;
+
+out_unmap_iod:
+	nvme_unmap_data(dev, req);
+	{
+		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
+		iod->nents = 0;
+		iod->sg = NULL;
+		iod->npages = -1;
+	}
+	return ret;
+
 out_cleanup_iod:
 	nvme_free_iod(dev, req);
 out_free_cmd:
--

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-05 23:05           ` Keith Busch
@ 2018-04-05 23:39             ` Alex G.
  2018-04-05 23:44               ` Alex G.
  0 siblings, 1 reply; 24+ messages in thread
From: Alex G. @ 2018-04-05 23:39 UTC (permalink / raw)


On 04/05/2018 06:05 PM, Keith Busch wrote:
> Just trying to confirm a suspicion, could you retry with the following?

Started the test. Will let you know tomorrow of result.

Alex
> ---
> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
> index b6f43b738f03..f9847a9f2973 100644
> --- a/drivers/nvme/host/pci.c
> +++ b/drivers/nvme/host/pci.c
> @@ -890,12 +890,23 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>  	if (unlikely(nvmeq->cq_vector < 0)) {
>  		ret = BLK_STS_IOERR;
>  		spin_unlock_irq(&nvmeq->q_lock);
> -		goto out_cleanup_iod;
> +		goto out_unmap_iod;
>  	}
>  	__nvme_submit_cmd(nvmeq, &cmnd);
>  	nvme_process_cq(nvmeq);
>  	spin_unlock_irq(&nvmeq->q_lock);
>  	return BLK_STS_OK;
> +
> +out_unmap_iod:
> +	nvme_unmap_data(dev, req);
> +	{
> +		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
> +		iod->nents = 0;
> +		iod->sg = NULL;
> +		iod->npages = -1;
> +	}
> +	return ret;
> +
>  out_cleanup_iod:
>  	nvme_free_iod(dev, req);
>  out_free_cmd:
> --
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-05 23:39             ` Alex G.
@ 2018-04-05 23:44               ` Alex G.
  2018-04-06 15:32                 ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Alex G. @ 2018-04-05 23:44 UTC (permalink / raw)


On 04/05/2018 06:39 PM, Alex G. wrote:
> On 04/05/2018 06:05 PM, Keith Busch wrote:
>> Just trying to confirm a suspicion, could you retry with the following?
> 
> Started the test. Will let you know tomorrow of result.

Actually, it crashed very fast [1]

[1] http://gtech.myftp.org/~mrnuke/nvme_logs/log-20180405-1838.log

> 
> Alex
>> ---
>> diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
>> index b6f43b738f03..f9847a9f2973 100644
>> --- a/drivers/nvme/host/pci.c
>> +++ b/drivers/nvme/host/pci.c
>> @@ -890,12 +890,23 @@ static blk_status_t nvme_queue_rq(struct blk_mq_hw_ctx *hctx,
>>  	if (unlikely(nvmeq->cq_vector < 0)) {
>>  		ret = BLK_STS_IOERR;
>>  		spin_unlock_irq(&nvmeq->q_lock);
>> -		goto out_cleanup_iod;
>> +		goto out_unmap_iod;
>>  	}
>>  	__nvme_submit_cmd(nvmeq, &cmnd);
>>  	nvme_process_cq(nvmeq);
>>  	spin_unlock_irq(&nvmeq->q_lock);
>>  	return BLK_STS_OK;
>> +
>> +out_unmap_iod:
>> +	nvme_unmap_data(dev, req);
>> +	{
>> +		struct nvme_iod *iod = blk_mq_rq_to_pdu(req);
>> +		iod->nents = 0;
>> +		iod->sg = NULL;
>> +		iod->npages = -1;
>> +	}
>> +	return ret;
>> +
>>  out_cleanup_iod:
>>  	nvme_free_iod(dev, req);
>>  out_free_cmd:
>> --
>>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-05 23:44               ` Alex G.
@ 2018-04-06 15:32                 ` Keith Busch
  2018-04-06 15:46                   ` Alex G.
       [not found]                   ` <94d77cb7-759f-595a-2264-37305dfa96c4@gmail.com>
  0 siblings, 2 replies; 24+ messages in thread
From: Keith Busch @ 2018-04-06 15:32 UTC (permalink / raw)


On Thu, Apr 05, 2018@06:44:21PM -0500, Alex G. wrote:
> Actually, it crashed very fast [1]

Could you possibly do an experiment for this? Does this happen if you're
not using MD RAID and disable NVME_MULTIPATH?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-06 15:32                 ` Keith Busch
@ 2018-04-06 15:46                   ` Alex G.
       [not found]                   ` <94d77cb7-759f-595a-2264-37305dfa96c4@gmail.com>
  1 sibling, 0 replies; 24+ messages in thread
From: Alex G. @ 2018-04-06 15:46 UTC (permalink / raw)


On 04/06/2018 10:32 AM, Keith Busch wrote:
> On Thu, Apr 05, 2018@06:44:21PM -0500, Alex G. wrote:
>> Actually, it crashed very fast [1]
> 
> Could you possibly do an experiment for this? Does this happen if you're
> not using MD RAID and disable NVME_MULTIPATH?

I can, though the whole point of the lvm mirror is to cause IO to the
drive after hot-insert. Without that IO, there's a much smaller window
to get PCIe errors on SURPRISE!!! removal.

Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
       [not found]                   ` <94d77cb7-759f-595a-2264-37305dfa96c4@gmail.com>
@ 2018-04-06 17:16                     ` Scott Bauer
  2018-04-06 17:46                       ` Alex G.
  0 siblings, 1 reply; 24+ messages in thread
From: Scott Bauer @ 2018-04-06 17:16 UTC (permalink / raw)

es1;4804;0cOn Fri, Apr 06, 2018@12:11:48PM -0500, Alex G. wrote:
> On 04/06/2018 10:32 AM, Keith Busch wrote:
> > On Thu, Apr 05, 2018@06:44:21PM -0500, Alex G. wrote:
> >> Actually, it crashed very fast [1]
> > 
> > Could you possibly do an experiment for this? Does this happen if you're
> > not using MD RAID and disable NVME_MULTIPATH?
> 
> Okay, no md-raid (or lvm mirror), NVME_MULTIPATH not set, and we still
> get a use-after-free (log attached). Could this be a race condition
> between nvme causing an unload and AER recovery?

You're using AER inject, right?

Two things with aer inject:

aer_inject.c doesn't grab the pci_lock_rescan_remove() when it calls
pci_get_domain_bus_and_slot so theoretically we could be in

pciehp_unconfigure_device with the rescan lock, aer_inject will get the pci_dev,
via pci_get_domain_bus_and_slot, which will do a pci_dev_get().

pciehp_unconfigure_device will start iterating over the devices, calling pci_dev_get(),

So we now have 2 references on the device. aer_inject then calls the aer irq handler:
aer_irq() where we delegate off to a work queue, so we're unblocked and we kick back
to aer_inject(), where at the end we do a pci_dev_put().

Now back to pciehp_unconfigure_device() it does a pci_dev_put() as well and our ref
count drops to 0.

It seems like an error with aer_inject, it needs to continue to hold that reference to
the pci_device until the delegated work queue is complete, I think?

If you comment out the pci_dev_put() at the bottom of aer_inject.c aer_inject(), does that UAF
go away? --note this isnt the real fix, but it will tell me if I am insane or on the right path.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-06 17:16                     ` Scott Bauer
@ 2018-04-06 17:46                       ` Alex G.
  2018-04-06 18:04                         ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Alex G. @ 2018-04-06 17:46 UTC (permalink / raw)




On 04/06/2018 12:16 PM, Scott Bauer wrote:
> es1;4804;0cOn Fri, Apr 06, 2018@12:11:48PM -0500, Alex G. wrote:
>> On 04/06/2018 10:32 AM, Keith Busch wrote:
>>> On Thu, Apr 05, 2018@06:44:21PM -0500, Alex G. wrote:
>>>> Actually, it crashed very fast [1]
>>>
>>> Could you possibly do an experiment for this? Does this happen if you're
>>> not using MD RAID and disable NVME_MULTIPATH?
>>
>> Okay, no md-raid (or lvm mirror), NVME_MULTIPATH not set, and we still
>> get a use-after-free (log attached). Could this be a race condition
>> between nvme causing an unload and AER recovery?
> 
> 
> 
> You're using AER inject, right?

No. I'm causing the errors in hardware with hot-unplug.

> Two things with aer inject:
> 
> aer_inject.c doesn't grab the pci_lock_rescan_remove() when it calls
> pci_get_domain_bus_and_slot so theoretically we could be in
> 
> 
> pciehp_unconfigure_device with the rescan lock, aer_inject will get the pci_dev,
> via pci_get_domain_bus_and_slot, which will do a pci_dev_get().
> 
> 
> pciehp_unconfigure_device will start iterating over the devices, calling pci_dev_get(),
> 
> So we now have 2 references on the device. aer_inject then calls the aer irq handler:
> aer_irq() where we delegate off to a work queue, so we're unblocked and we kick back
> to aer_inject(), where at the end we do a pci_dev_put().
> 
> Now back to pciehp_unconfigure_device() it does a pci_dev_put() as well and our ref
> count drops to 0.
> 
> It seems like an error with aer_inject, it needs to continue to hold that reference to
> the pci_device until the delegated work queue is complete, I think?
> 
> If you comment out the pci_dev_put() at the bottom of aer_inject.c aer_inject(), does that UAF
> go away? --note this isnt the real fix, but it will tell me if I am insane or on the right path.
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-06 17:46                       ` Alex G.
@ 2018-04-06 18:04                         ` Keith Busch
  2018-04-06 19:00                           ` Scott Bauer
  2018-04-06 19:08                           ` Alex G.
  0 siblings, 2 replies; 24+ messages in thread
From: Keith Busch @ 2018-04-06 18:04 UTC (permalink / raw)

On Fri, Apr 06, 2018@12:46:06PM -0500, Alex G. wrote:
> On 04/06/2018 12:16 PM, Scott Bauer wrote:
> > You're using AER inject, right?
> 
> No. I'm causing the errors in hardware with hot-unplug.

I think Scott's still on the right track for this particular sighting.
The AER handler looks unsafe under changing topologies. It might need run
under pci_lock_rescan_remove() before walking the bus to prevent races
with the surprise removal, but it's not clear to me yet if holding that
lock is okay to do in this context.

This however does not appear to resemble your previous sightings. In your
previous sightings, it looks like something has lost track of commands,
and we're freeing the resources with them a second time.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-06 18:04                         ` Keith Busch
@ 2018-04-06 19:00                           ` Scott Bauer
  2018-04-06 19:34                             ` Keith Busch
  2018-04-06 19:08                           ` Alex G.
  1 sibling, 1 reply; 24+ messages in thread
From: Scott Bauer @ 2018-04-06 19:00 UTC (permalink / raw)




On 04/06/2018 12:04 PM, Keith Busch wrote:
> On Fri, Apr 06, 2018@12:46:06PM -0500, Alex G. wrote:
>> On 04/06/2018 12:16 PM, Scott Bauer wrote:
>>> You're using AER inject, right?
>>
>> No. I'm causing the errors in hardware with hot-unplug.
> 
> I think Scott's still on the right track for this particular sighting.
> The AER handler looks unsafe under changing topologies. It might need run
> under pci_lock_rescan_remove() before walking the bus to prevent races
> with the surprise removal, but it's not clear to me yet if holding that
> lock is okay to do in this contexty

I think we may get into a deadlock situation if we grab the pci_lock_rescan.
the hotplug unconfigure code will eventually call driver->remove() which I believe
can end up in the aer_remove(), which will do a flush_work. If the aer delegated
irq handler is waiting on the pci_lock_rescan, before it does a walk_bus, we've deadlocked
there as the hp code is waiting on the remove() to finish, and the remove is waiting on 
the flush work to finish and the work being flushed is waiting on the lock.

Although I didn't check to see if flushwork waits for already running work or not.



> 
> This however does not appear to resemble your previous sightings. In your
> previous sightings, it looks like something has lost track of commands,
> and we're freeing the resources with them a second time.
> 
> _______________________________________________
> Linux-nvme mailing list
> Linux-nvme at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-nvme
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-06 18:04                         ` Keith Busch
  2018-04-06 19:00                           ` Scott Bauer
@ 2018-04-06 19:08                           ` Alex G.
  2018-04-06 22:00                             ` Keith Busch
  1 sibling, 1 reply; 24+ messages in thread
From: Alex G. @ 2018-04-06 19:08 UTC (permalink / raw)


On 04/06/2018 01:04 PM, Keith Busch wrote:
> On Fri, Apr 06, 2018@12:46:06PM -0500, Alex G. wrote:
>> On 04/06/2018 12:16 PM, Scott Bauer wrote:
>>> You're using AER inject, right?
>>
>> No. I'm causing the errors in hardware with hot-unplug.
> 
> I think Scott's still on the right track for this particular sighting.
> The AER handler looks unsafe under changing topologies. It might need run
> under pci_lock_rescan_remove() before walking the bus to prevent races
> with the surprise removal, but it's not clear to me yet if holding that
> lock is okay to do in this context.

I think we have three mechanisms that can remove a device: nvme timeout,
Link Down interrupt, and AER.
Link Down comes 20-60ms after the link actually dies, in which time nvme
will queue IO, which can cause a flood of PCIe errors, which trigger AER
handling. I suspect there is a massive race condition somewhere, but I
don't yet have convincing evidence to prove it.

> This however does not appear to resemble your previous sightings. In your
> previous sightings, it looks like something has lost track of commands,
> and we're freeing the resources with them a second time.

I think they might be related.

Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-06 19:00                           ` Scott Bauer
@ 2018-04-06 19:34                             ` Keith Busch
  0 siblings, 0 replies; 24+ messages in thread
From: Keith Busch @ 2018-04-06 19:34 UTC (permalink / raw)


On Fri, Apr 06, 2018@01:00:37PM -0600, Scott Bauer wrote:
> I think we may get into a deadlock situation if we grab the pci_lock_rescan.
> the hotplug unconfigure code will eventually call driver->remove() which I believe
> can end up in the aer_remove(), which will do a flush_work. If the aer delegated
> irq handler is waiting on the pci_lock_rescan, before it does a walk_bus, we've deadlocked
> there as the hp code is waiting on the remove() to finish, and the remove is waiting on 
> the flush work to finish and the work being flushed is waiting on the lock.

Darn. I believe your point is valid, though not through pciehp since
root ports themselves are not pcie hot pluggable components, but other
paths could get there.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-06 19:08                           ` Alex G.
@ 2018-04-06 22:00                             ` Keith Busch
  2018-04-09 18:23                               ` Alex G.
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2018-04-06 22:00 UTC (permalink / raw)


On Fri, Apr 06, 2018@02:08:38PM -0500, Alex G. wrote:
> On 04/06/2018 01:04 PM, Keith Busch wrote:
> > On Fri, Apr 06, 2018@12:46:06PM -0500, Alex G. wrote:
> >> On 04/06/2018 12:16 PM, Scott Bauer wrote:
> >>> You're using AER inject, right?
> >>
> >> No. I'm causing the errors in hardware with hot-unplug.
> > 
> > I think Scott's still on the right track for this particular sighting.
> > The AER handler looks unsafe under changing topologies. It might need run
> > under pci_lock_rescan_remove() before walking the bus to prevent races
> > with the surprise removal, but it's not clear to me yet if holding that
> > lock is okay to do in this context.
> 
> I think we have three mechanisms that can remove a device: nvme timeout,
> Link Down interrupt, and AER.

The only things that should cause an automatic device removal from the
PCI topology are PCIe Slot events "Link Down" or "Present Detect Change",
or a PCIe DPC trigger.

An nvme command timeout or PCIe AER may have the driver unbind from the
device if recovery attempts are unable to get the controller talking
again, but the PCI device will remain.

> Link Down comes 20-60ms after the link actually dies, in which time nvme
> will queue IO, which can cause a flood of PCIe errors, which trigger AER
> handling. I suspect there is a massive race condition somewhere, but I
> don't yet have convincing evidence to prove it.

I think Scott's identified at least one race here.
 
> > This however does not appear to resemble your previous sightings. In your
> > previous sightings, it looks like something has lost track of commands,
> > and we're freeing the resources with them a second time.
> 
> I think they might be related.

Possibly.

Fixing the AER use-after-free on a removed pci_dev looks like it's going
to require a bit more work in the AER driver to do it safely. But just
as an experiment, could you try applying the below and let us know if
that fixes this most recent sighting? The conditions where this patch
are unsafe should not apply to your test, and this would prove this type
of synchronization is required.

---
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index a4bfea52e7d4..16ecbcd76373 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -805,8 +805,10 @@ void aer_isr(struct work_struct *work)
 	struct pcie_device *p_device = rpc->rpd;
 	struct aer_err_source uninitialized_var(e_src);
 
+	pci_lock_rescan_remove();
 	mutex_lock(&rpc->rpc_mutex);
 	while (get_e_source(rpc, &e_src))
 		aer_isr_one_error(p_device, &e_src);
 	mutex_unlock(&rpc->rpc_mutex);
+	pci_unlock_rescan_remove();
 }
--

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-06 22:00                             ` Keith Busch
@ 2018-04-09 18:23                               ` Alex G.
  2018-04-09 19:11                                 ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Alex G. @ 2018-04-09 18:23 UTC (permalink / raw)


On 04/06/2018 05:00 PM, Keith Busch wrote:
(snip)
> ---
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index a4bfea52e7d4..16ecbcd76373 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -805,8 +805,10 @@ void aer_isr(struct work_struct *work)
>  	struct pcie_device *p_device = rpc->rpd;
>  	struct aer_err_source uninitialized_var(e_src);
>  
> +	pci_lock_rescan_remove();
>  	mutex_lock(&rpc->rpc_mutex);
>  	while (get_e_source(rpc, &e_src))
>  		aer_isr_one_error(p_device, &e_src);
>  	mutex_unlock(&rpc->rpc_mutex);
> +	pci_unlock_rescan_remove();
>  }
> --

With this patch, I'm not seeing issues without LVM mirrors, but as soon
as I enable the mirror, we get the
 * use-after-free in swiotlb_unmap_sg_attrs
 * double-free or invalid-free in nvme_pci_complete_rq

Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-09 18:23                               ` Alex G.
@ 2018-04-09 19:11                                 ` Keith Busch
  2018-04-09 19:36                                   ` Alex G.
  0 siblings, 1 reply; 24+ messages in thread
From: Keith Busch @ 2018-04-09 19:11 UTC (permalink / raw)


On Mon, Apr 09, 2018@01:23:54PM -0500, Alex G. wrote:
> On 04/06/2018 05:00 PM, Keith Busch wrote:
> (snip)
> > ---
> > diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> > index a4bfea52e7d4..16ecbcd76373 100644
> > --- a/drivers/pci/pcie/aer/aerdrv_core.c
> > +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> > @@ -805,8 +805,10 @@ void aer_isr(struct work_struct *work)
> >  	struct pcie_device *p_device = rpc->rpd;
> >  	struct aer_err_source uninitialized_var(e_src);
> >  
> > +	pci_lock_rescan_remove();
> >  	mutex_lock(&rpc->rpc_mutex);
> >  	while (get_e_source(rpc, &e_src))
> >  		aer_isr_one_error(p_device, &e_src);
> >  	mutex_unlock(&rpc->rpc_mutex);
> > +	pci_unlock_rescan_remove();
> >  }
> > --
> 
> With this patch, I'm not seeing issues without LVM mirrors, but as soon
> as I enable the mirror, we get the
>  * use-after-free in swiotlb_unmap_sg_attrs
>  * double-free or invalid-free in nvme_pci_complete_rq
> 
> Alex

Awesome, thank you for the update. We'll do some more work on the
above fixing the use-after-free in AER handling to make it ready for
consideration.

Still not sure on the nvme double/incorrect completion, but have not
given up yet.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-09 19:11                                 ` Keith Busch
@ 2018-04-09 19:36                                   ` Alex G.
  2018-04-09 19:47                                     ` Keith Busch
  0 siblings, 1 reply; 24+ messages in thread
From: Alex G. @ 2018-04-09 19:36 UTC (permalink / raw)




On 04/09/2018 02:11 PM, Keith Busch wrote:
> On Mon, Apr 09, 2018@01:23:54PM -0500, Alex G. wrote:
>> On 04/06/2018 05:00 PM, Keith Busch wrote:
>> (snip)
>>> ---
>>> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
>>> index a4bfea52e7d4..16ecbcd76373 100644
>>> --- a/drivers/pci/pcie/aer/aerdrv_core.c
>>> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
>>> @@ -805,8 +805,10 @@ void aer_isr(struct work_struct *work)
>>>  	struct pcie_device *p_device = rpc->rpd;
>>>  	struct aer_err_source uninitialized_var(e_src);
>>>  
>>> +	pci_lock_rescan_remove();
>>>  	mutex_lock(&rpc->rpc_mutex);
>>>  	while (get_e_source(rpc, &e_src))
>>>  		aer_isr_one_error(p_device, &e_src);
>>>  	mutex_unlock(&rpc->rpc_mutex);
>>> +	pci_unlock_rescan_remove();
>>>  }
>>> --
>>
>> With this patch, I'm not seeing issues without LVM mirrors, but as soon
>> as I enable the mirror, we get the
>>  * use-after-free in swiotlb_unmap_sg_attrs
>>  * double-free or invalid-free in nvme_pci_complete_rq
>>
>> Alex
> 
> Awesome, thank you for the update. We'll do some more work on the
> above fixing the use-after-free in AER handling to make it ready for
> consideration.

No problem.

> Still not sure on the nvme double/incorrect completion, but have not
> given up yet.

Although I remember a time when I could run these sort of tests without
panic(), I don't have a solid base from which I can bisect things.
Especially since we've added a few fixes lately that won't be in any base.

I'll try a couple of branches, but don't hold your breath.

Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-09 19:36                                   ` Alex G.
@ 2018-04-09 19:47                                     ` Keith Busch
  2018-04-10  0:07                                       ` Alex G.
                                                         ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Keith Busch @ 2018-04-09 19:47 UTC (permalink / raw)


On Mon, Apr 09, 2018@02:36:30PM -0500, Alex G. wrote:
> Although I remember a time when I could run these sort of tests without
> panic(), I don't have a solid base from which I can bisect things.
> Especially since we've added a few fixes lately that won't be in any base.
> 
> I'll try a couple of branches, but don't hold your breath.

This may be a regression. Your panic is triggered through the timeout
path, which appears to have had recent modifications that are included
in your kernel. There's at least one issue in this area:

  https://marc.info/?l=linux-block&m=152325124306821&w=2

I haven't looked into it enough to know if it's related to this issue,
though.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-09 19:47                                     ` Keith Busch
@ 2018-04-10  0:07                                       ` Alex G.
  2018-04-10 14:19                                       ` Alex G.
  2018-05-02 15:39                                       ` Alex G.
  2 siblings, 0 replies; 24+ messages in thread
From: Alex G. @ 2018-04-10  0:07 UTC (permalink / raw)


On 04/09/2018 02:47 PM, Keith Busch wrote:
> On Mon, Apr 09, 2018@02:36:30PM -0500, Alex G. wrote:
>> Although I remember a time when I could run these sort of tests without
>> panic(), I don't have a solid base from which I can bisect things.
>> Especially since we've added a few fixes lately that won't be in any base.
>>
>> I'll try a couple of branches, but don't hold your breath.
> 
> This may be a regression.

It is. 4.16-rc7+ (+nsid duplicate and device free patches) does not
crash, or cause kasan to complain. However axboe/linux-block/for-next
exhibits the issue. I'm getting ready to also test for-4.17/block, and
then your AER patches.

> Your panic is triggered through the timeout
> path, which appears to have had recent modifications that are included
> in your kernel. There's at least one issue in this area:
> 
>   https://marc.info/?l=linux-block&m=152325124306821&w=2
> 
> I haven't looked into it enough to know if it's related to this issue,
> though.

Thanks. I'll grab a look tomorrow and see if that fixes the issue.

Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-09 19:47                                     ` Keith Busch
  2018-04-10  0:07                                       ` Alex G.
@ 2018-04-10 14:19                                       ` Alex G.
  2018-05-02 15:39                                       ` Alex G.
  2 siblings, 0 replies; 24+ messages in thread
From: Alex G. @ 2018-04-10 14:19 UTC (permalink / raw)



On 04/09/2018 02:47 PM, Keith Busch wrote:
> On Mon, Apr 09, 2018@02:36:30PM -0500, Alex G. wrote:
>> Although I remember a time when I could run these sort of tests without
>> panic(), I don't have a solid base from which I can bisect things.
>> Especially since we've added a few fixes lately that won't be in any base.
>>
>> I'll try a couple of branches, but don't hold your breath.
> 
> This may be a regression. Your panic is triggered through the timeout
> path, which appears to have had recent modifications that are included
> in your kernel. There's at least one issue in this area:
> 
>   https://marc.info/?l=linux-block&m=152325124306821&w=2

This change appears to fix the NULL dereference.

> I haven't looked into it enough to know if it's related to this issue,
> though.
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again
  2018-04-09 19:47                                     ` Keith Busch
  2018-04-10  0:07                                       ` Alex G.
  2018-04-10 14:19                                       ` Alex G.
@ 2018-05-02 15:39                                       ` Alex G.
  2 siblings, 0 replies; 24+ messages in thread
From: Alex G. @ 2018-05-02 15:39 UTC (permalink / raw)


On 04/09/2018 02:47 PM, Keith Busch wrote:
> On Mon, Apr 09, 2018@02:36:30PM -0500, Alex G. wrote:
>> Although I remember a time when I could run these sort of tests without
>> panic(), I don't have a solid base from which I can bisect things.
>> Especially since we've added a few fixes lately that won't be in any base.
>>
>> I'll try a couple of branches, but don't hold your breath.
> 
> This may be a regression. Your panic is triggered through the timeout
> path, which appears to have had recent modifications that are included
> in your kernel. There's at least one issue in this area:
> 
>   https://marc.info/?l=linux-block&m=152325124306821&w=2
> 
> I haven't looked into it enough to know if it's related to this issue,
> though.

I just tested 4.17-rc3, and without this patch, I still see the NULL
dereference. I don't know whom to nudge, but it would be nice for this
fix to land upstream.

Alex

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-05-02 15:39 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <5d6d1a8c-6490-4046-0fba-da0a0df3d00c@gmail.com>
2018-04-05 21:38 ` IRQ/nvme_pci_complete_rq: NULL pointer dereference yet again Keith Busch
2018-04-05 21:22   ` Scott Bauer
2018-04-05 22:21     ` Alex G.
2018-04-05 22:41       ` Keith Busch
2018-04-05 22:48         ` Keith Busch
2018-04-05 23:05           ` Keith Busch
2018-04-05 23:39             ` Alex G.
2018-04-05 23:44               ` Alex G.
2018-04-06 15:32                 ` Keith Busch
2018-04-06 15:46                   ` Alex G.
     [not found]                   ` <94d77cb7-759f-595a-2264-37305dfa96c4@gmail.com>
2018-04-06 17:16                     ` Scott Bauer
2018-04-06 17:46                       ` Alex G.
2018-04-06 18:04                         ` Keith Busch
2018-04-06 19:00                           ` Scott Bauer
2018-04-06 19:34                             ` Keith Busch
2018-04-06 19:08                           ` Alex G.
2018-04-06 22:00                             ` Keith Busch
2018-04-09 18:23                               ` Alex G.
2018-04-09 19:11                                 ` Keith Busch
2018-04-09 19:36                                   ` Alex G.
2018-04-09 19:47                                     ` Keith Busch
2018-04-10  0:07                                       ` Alex G.
2018-04-10 14:19                                       ` Alex G.
2018-05-02 15:39                                       ` Alex G.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.