All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] PCI/AER: Use-after-free fix
@ 2018-04-09 22:04 Keith Busch
  2018-04-09 22:04 ` [PATCH 1/4] PCI/AER: Remove unused parameters Keith Busch
                   ` (6 more replies)
  0 siblings, 7 replies; 15+ messages in thread
From: Keith Busch @ 2018-04-09 22:04 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas; +Cc: Alex_Gagniuc, Scott Bauer, Keith Busch

AER error handling walks the PCI topology below a root port, saving
pointers of the pci_dev structs affected by the error along the way.

At the same time, the pcie hotplug driver could be freeing those very
same structures, causing the AER driver to reference freed memory.

This series fixes this by synchroniziing the aer driver with the pci
hotplug driver during. The final patch is the one that ultimately provides
the locking by having AER lock the same pci lock rescan/remove mutex as
the pciehp driver. The first three patches are prepping to make it safe
for the aer bottom half handler to hold that lock.

Keith Busch (4):
  PCI/AER: Remove unused parameters
  PCI/AER: Replace struct pcie_device with pci_dev
  PCI/AER: Reference count aer structures
  PCI/AER: Lock pci topology when scanning errors

 drivers/pci/pcie/aer/aerdrv.c      | 28 +++++++++++++++++++++-------
 drivers/pci/pcie/aer/aerdrv.h      |  9 +++------
 drivers/pci/pcie/aer/aerdrv_core.c | 38 +++++++++++++++++---------------------
 3 files changed, 41 insertions(+), 34 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] PCI/AER: Remove unused parameters
  2018-04-09 22:04 [PATCH 0/4] PCI/AER: Use-after-free fix Keith Busch
@ 2018-04-09 22:04 ` Keith Busch
  2018-04-09 22:04 ` [PATCH 2/4] PCI/AER: Replace struct pcie_device with pci_dev Keith Busch
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2018-04-09 22:04 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas; +Cc: Alex_Gagniuc, Scott Bauer, Keith Busch

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/aer/aerdrv_core.c | 14 +++++---------
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 0ea5acc40323..a533ea25e52c 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -541,15 +541,12 @@ static void do_recovery(struct pci_dev *dev, int severity)
 
 /**
  * handle_error_source - handle logging error into an event log
- * @aerdev: pointer to pcie_device data structure of the root port
  * @dev: pointer to pci_dev data structure of error source device
  * @info: comprehensive error information
  *
  * Invoked when an error being detected by Root Port.
  */
-static void handle_error_source(struct pcie_device *aerdev,
-	struct pci_dev *dev,
-	struct aer_err_info *info)
+static void handle_error_source(struct pci_dev *dev, struct aer_err_info *info)
 {
 	int pos;
 
@@ -695,8 +692,7 @@ static int get_device_error_info(struct pci_dev *dev, struct aer_err_info *info)
 	return 1;
 }
 
-static inline void aer_process_err_devices(struct pcie_device *p_device,
-			struct aer_err_info *e_info)
+static inline void aer_process_err_devices(struct aer_err_info *e_info)
 {
 	int i;
 
@@ -707,7 +703,7 @@ static inline void aer_process_err_devices(struct pcie_device *p_device,
 	}
 	for (i = 0; i < e_info->error_dev_num && e_info->dev[i]; i++) {
 		if (get_device_error_info(e_info->dev[i], e_info))
-			handle_error_source(p_device, e_info->dev[i], e_info);
+			handle_error_source(e_info->dev[i], e_info);
 	}
 }
 
@@ -738,7 +734,7 @@ static void aer_isr_one_error(struct pcie_device *p_device,
 		aer_print_port_info(p_device->port, e_info);
 
 		if (find_source_device(p_device->port, e_info))
-			aer_process_err_devices(p_device, e_info);
+			aer_process_err_devices(e_info);
 	}
 
 	if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
@@ -757,7 +753,7 @@ static void aer_isr_one_error(struct pcie_device *p_device,
 		aer_print_port_info(p_device->port, e_info);
 
 		if (find_source_device(p_device->port, e_info))
-			aer_process_err_devices(p_device, e_info);
+			aer_process_err_devices(e_info);
 	}
 }
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] PCI/AER: Replace struct pcie_device with pci_dev
  2018-04-09 22:04 [PATCH 0/4] PCI/AER: Use-after-free fix Keith Busch
  2018-04-09 22:04 ` [PATCH 1/4] PCI/AER: Remove unused parameters Keith Busch
@ 2018-04-09 22:04 ` Keith Busch
  2018-04-09 22:04 ` [PATCH 3/4] PCI/AER: Reference count aer structures Keith Busch
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2018-04-09 22:04 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas; +Cc: Alex_Gagniuc, Scott Bauer, Keith Busch

The AER driver really only needed the pcie_device to get to the port
pci_dev, so this patch removes the unnecessary indirection.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/aer/aerdrv.c      |  6 +++---
 drivers/pci/pcie/aer/aerdrv.h      |  2 +-
 drivers/pci/pcie/aer/aerdrv_core.c | 18 ++++++++----------
 3 files changed, 12 insertions(+), 14 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
index 779b3879b1b5..9ce8a824afbc 100644
--- a/drivers/pci/pcie/aer/aerdrv.c
+++ b/drivers/pci/pcie/aer/aerdrv.c
@@ -94,7 +94,7 @@ static void set_downstream_devices_error_reporting(struct pci_dev *dev,
  */
 static void aer_enable_rootport(struct aer_rpc *rpc)
 {
-	struct pci_dev *pdev = rpc->rpd->port;
+	struct pci_dev *pdev = rpc->rpd;
 	int aer_pos;
 	u16 reg16;
 	u32 reg32;
@@ -136,7 +136,7 @@ static void aer_enable_rootport(struct aer_rpc *rpc)
  */
 static void aer_disable_rootport(struct aer_rpc *rpc)
 {
-	struct pci_dev *pdev = rpc->rpd->port;
+	struct pci_dev *pdev = rpc->rpd;
 	u32 reg32;
 	int pos;
 
@@ -232,7 +232,7 @@ static struct aer_rpc *aer_alloc_rpc(struct pcie_device *dev)
 	/* Initialize Root lock access, e_lock, to Root Error Status Reg */
 	spin_lock_init(&rpc->e_lock);
 
-	rpc->rpd = dev;
+	rpc->rpd = dev->port;
 	INIT_WORK(&rpc->dpc_handler, aer_isr);
 	mutex_init(&rpc->rpc_mutex);
 
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index 08b4584f62fe..f34174feab55 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -58,7 +58,7 @@ struct aer_err_source {
 };
 
 struct aer_rpc {
-	struct pcie_device *rpd;	/* Root Port device */
+	struct pci_dev *rpd;		/* Root Port device */
 	struct work_struct dpc_handler;
 	struct aer_err_source e_sources[AER_ERROR_SOURCES_MAX];
 	struct aer_err_info e_info;
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index a533ea25e52c..672374cfb16d 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -709,13 +709,13 @@ static inline void aer_process_err_devices(struct aer_err_info *e_info)
 
 /**
  * aer_isr_one_error - consume an error detected by root port
- * @p_device: pointer to error root port service device
+ * @rpc: pointer to the root port which holds an error
  * @e_src: pointer to an error source
  */
-static void aer_isr_one_error(struct pcie_device *p_device,
+static void aer_isr_one_error(struct aer_rpc *rpc,
 		struct aer_err_source *e_src)
 {
-	struct aer_rpc *rpc = get_service_data(p_device);
+	struct pci_dev *pdev = rpc->rpd;
 	struct aer_err_info *e_info = &rpc->e_info;
 
 	/*
@@ -730,10 +730,9 @@ static void aer_isr_one_error(struct pcie_device *p_device,
 			e_info->multi_error_valid = 1;
 		else
 			e_info->multi_error_valid = 0;
+		aer_print_port_info(pdev, e_info);
 
-		aer_print_port_info(p_device->port, e_info);
-
-		if (find_source_device(p_device->port, e_info))
+		if (find_source_device(pdev, e_info))
 			aer_process_err_devices(e_info);
 	}
 
@@ -750,9 +749,9 @@ static void aer_isr_one_error(struct pcie_device *p_device,
 		else
 			e_info->multi_error_valid = 0;
 
-		aer_print_port_info(p_device->port, e_info);
+		aer_print_port_info(pdev, e_info);
 
-		if (find_source_device(p_device->port, e_info))
+		if (find_source_device(pdev, e_info))
 			aer_process_err_devices(e_info);
 	}
 }
@@ -795,11 +794,10 @@ static int get_e_source(struct aer_rpc *rpc, struct aer_err_source *e_src)
 void aer_isr(struct work_struct *work)
 {
 	struct aer_rpc *rpc = container_of(work, struct aer_rpc, dpc_handler);
-	struct pcie_device *p_device = rpc->rpd;
 	struct aer_err_source uninitialized_var(e_src);
 
 	mutex_lock(&rpc->rpc_mutex);
 	while (get_e_source(rpc, &e_src))
-		aer_isr_one_error(p_device, &e_src);
+		aer_isr_one_error(rpc, &e_src);
 	mutex_unlock(&rpc->rpc_mutex);
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] PCI/AER: Reference count aer structures
  2018-04-09 22:04 [PATCH 0/4] PCI/AER: Use-after-free fix Keith Busch
  2018-04-09 22:04 ` [PATCH 1/4] PCI/AER: Remove unused parameters Keith Busch
  2018-04-09 22:04 ` [PATCH 2/4] PCI/AER: Replace struct pcie_device with pci_dev Keith Busch
@ 2018-04-09 22:04 ` Keith Busch
  2018-04-09 22:04 ` [PATCH 4/4] PCI/AER: Lock pci topology when scanning errors Keith Busch
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 15+ messages in thread
From: Keith Busch @ 2018-04-09 22:04 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas; +Cc: Alex_Gagniuc, Scott Bauer, Keith Busch

The AER driver's removal was flushing its scheduled work to ensure it
was safe to free the aer structure. This patch removes that flushing and
prevents use-after-free instead by reference counting the aer root port
structure and its pci_dev.

The purpose of this patch is to allow the bottom half worker to take
locks that may be held while the aer driver's removal is called.

Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/aer/aerdrv.c      | 23 +++++++++++++++++++----
 drivers/pci/pcie/aer/aerdrv.h      |  2 ++
 drivers/pci/pcie/aer/aerdrv_core.c |  2 ++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
index 9ce8a824afbc..0b2eb88c422b 100644
--- a/drivers/pci/pcie/aer/aerdrv.c
+++ b/drivers/pci/pcie/aer/aerdrv.c
@@ -209,7 +209,9 @@ irqreturn_t aer_irq(int irq, void *context)
 	spin_unlock_irqrestore(&rpc->e_lock, flags);
 
 	/*  Invoke DPC handler */
-	schedule_work(&rpc->dpc_handler);
+	kref_get(&rpc->ref);
+	if (!schedule_work(&rpc->dpc_handler))
+		aer_release(rpc);
 
 	return IRQ_HANDLED;
 }
@@ -232,7 +234,8 @@ static struct aer_rpc *aer_alloc_rpc(struct pcie_device *dev)
 	/* Initialize Root lock access, e_lock, to Root Error Status Reg */
 	spin_lock_init(&rpc->e_lock);
 
-	rpc->rpd = dev->port;
+	rpc->rpd = pci_dev_get(dev->port);
+	kref_init(&rpc->ref);
 	INIT_WORK(&rpc->dpc_handler, aer_isr);
 	mutex_init(&rpc->rpc_mutex);
 
@@ -242,6 +245,19 @@ static struct aer_rpc *aer_alloc_rpc(struct pcie_device *dev)
 	return rpc;
 }
 
+static void aer_free(struct kref *ref)
+{
+	struct aer_rpc *rpc = container_of(ref, struct aer_rpc, ref);
+
+	pci_dev_put(rpc->rpd);
+	kfree(rpc);
+}
+
+void aer_release(struct aer_rpc *rpc)
+{
+	kref_put(&rpc->ref, aer_free);
+}
+
 /**
  * aer_remove - clean up resources
  * @dev: pointer to the pcie_dev data structure
@@ -257,10 +273,9 @@ static void aer_remove(struct pcie_device *dev)
 		if (rpc->isr)
 			free_irq(dev->irq, dev);
 
-		flush_work(&rpc->dpc_handler);
 		aer_disable_rootport(rpc);
-		kfree(rpc);
 		set_service_data(dev, NULL);
+		aer_release(rpc);
 	}
 }
 
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index f34174feab55..f886521e2c7b 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -60,6 +60,7 @@ struct aer_err_source {
 struct aer_rpc {
 	struct pci_dev *rpd;		/* Root Port device */
 	struct work_struct dpc_handler;
+	struct kref ref;
 	struct aer_err_source e_sources[AER_ERROR_SOURCES_MAX];
 	struct aer_err_info e_info;
 	unsigned short prod_idx;	/* Error Producer Index */
@@ -110,6 +111,7 @@ extern struct bus_type pcie_port_bus_type;
 void aer_isr(struct work_struct *work);
 void aer_print_error(struct pci_dev *dev, struct aer_err_info *info);
 void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info);
+void aer_release(struct aer_rpc *rpc);
 irqreturn_t aer_irq(int irq, void *context);
 
 #ifdef CONFIG_ACPI_APEI
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index 672374cfb16d..e4059d7fa7fa 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -800,4 +800,6 @@ void aer_isr(struct work_struct *work)
 	while (get_e_source(rpc, &e_src))
 		aer_isr_one_error(rpc, &e_src);
 	mutex_unlock(&rpc->rpc_mutex);
+
+	aer_release(rpc);
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] PCI/AER: Lock pci topology when scanning errors
  2018-04-09 22:04 [PATCH 0/4] PCI/AER: Use-after-free fix Keith Busch
                   ` (2 preceding siblings ...)
  2018-04-09 22:04 ` [PATCH 3/4] PCI/AER: Reference count aer structures Keith Busch
@ 2018-04-09 22:04 ` Keith Busch
  2018-06-05 22:09   ` Bjorn Helgaas
  2018-04-10 13:15 ` [PATCH 0/4] PCI/AER: Use-after-free fix Dongdong Liu
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2018-04-09 22:04 UTC (permalink / raw)
  To: Linux PCI, Bjorn Helgaas; +Cc: Alex_Gagniuc, Scott Bauer, Keith Busch

The side effects of surprise removal may trigger AER handling. The AER
handling walks the pci topology and may access a pci_dev that is being
freed by the hotplug handler.

This patch fixes that use-after-free by locking the PCI topology in
the AER handler so it isn't racing with the pciehp removal.

Since the AER handler now runs under a global PCI lock, the rpc specific
mutex is no longer necessary.

Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com>
Signed-off-by: Keith Busch <keith.busch@intel.com>
---
 drivers/pci/pcie/aer/aerdrv.c      | 1 -
 drivers/pci/pcie/aer/aerdrv.h      | 5 -----
 drivers/pci/pcie/aer/aerdrv_core.c | 4 ++--
 3 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
index 0b2eb88c422b..b88e5e2f3700 100644
--- a/drivers/pci/pcie/aer/aerdrv.c
+++ b/drivers/pci/pcie/aer/aerdrv.c
@@ -237,7 +237,6 @@ static struct aer_rpc *aer_alloc_rpc(struct pcie_device *dev)
 	rpc->rpd = pci_dev_get(dev->port);
 	kref_init(&rpc->ref);
 	INIT_WORK(&rpc->dpc_handler, aer_isr);
-	mutex_init(&rpc->rpc_mutex);
 
 	/* Use PCIe bus function to store rpc into PCIe device */
 	set_service_data(dev, rpc);
diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
index f886521e2c7b..b90fc5d4cda2 100644
--- a/drivers/pci/pcie/aer/aerdrv.h
+++ b/drivers/pci/pcie/aer/aerdrv.h
@@ -70,11 +70,6 @@ struct aer_rpc {
 					 * Lock access to Error Status/ID Regs
 					 * and error producer/consumer index
 					 */
-	struct mutex rpc_mutex;		/*
-					 * only one thread could do
-					 * recovery on the same
-					 * root port hierarchy
-					 */
 };
 
 struct aer_broadcast_data {
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index e4059d7fa7fa..de210b7439eb 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -796,10 +796,10 @@ void aer_isr(struct work_struct *work)
 	struct aer_rpc *rpc = container_of(work, struct aer_rpc, dpc_handler);
 	struct aer_err_source uninitialized_var(e_src);
 
-	mutex_lock(&rpc->rpc_mutex);
+	pci_lock_rescan_remove();
 	while (get_e_source(rpc, &e_src))
 		aer_isr_one_error(rpc, &e_src);
-	mutex_unlock(&rpc->rpc_mutex);
+	pci_unlock_rescan_remove();
 
 	aer_release(rpc);
 }
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] PCI/AER: Use-after-free fix
  2018-04-09 22:04 [PATCH 0/4] PCI/AER: Use-after-free fix Keith Busch
                   ` (3 preceding siblings ...)
  2018-04-09 22:04 ` [PATCH 4/4] PCI/AER: Lock pci topology when scanning errors Keith Busch
@ 2018-04-10 13:15 ` Dongdong Liu
  2018-04-12 17:06 ` Alex_Gagniuc
  2018-06-05 22:11 ` Bjorn Helgaas
  6 siblings, 0 replies; 15+ messages in thread
From: Dongdong Liu @ 2018-04-10 13:15 UTC (permalink / raw)
  To: Keith Busch, Linux PCI, Bjorn Helgaas; +Cc: Alex_Gagniuc, Scott Bauer


在 2018/4/10 6:04, Keith Busch 写道:
> AER error handling walks the PCI topology below a root port, saving
> pointers of the pci_dev structs affected by the error along the way.
>
> At the same time, the pcie hotplug driver could be freeing those very
> same structures, causing the AER driver to reference freed memory.

I also have met such issue.  The details see the below link.
https://www.spinics.net/lists/linux-pci/msg70614.html
It seems do reset_link() will trigger hotplug driver to remove/rescan device
when met AER fatal error.
do_recovery()
    --- reset_link()
	--- pci_reset_secondary_bus() //then trigger link down/up.
So if the root port support hotplug, it will remove/rescan device.
I still have a question, since AER driver have already done recovery,
it seems no need to trigger hotplug to remove and rescan the device.

Thanks,
Dongdong

>
> This series fixes this by synchroniziing the aer driver with the pci
> hotplug driver during. The final patch is the one that ultimately provides
> the locking by having AER lock the same pci lock rescan/remove mutex as
> the pciehp driver. The first three patches are prepping to make it safe
> for the aer bottom half handler to hold that lock.
>
> Keith Busch (4):
>   PCI/AER: Remove unused parameters
>   PCI/AER: Replace struct pcie_device with pci_dev
>   PCI/AER: Reference count aer structures
>   PCI/AER: Lock pci topology when scanning errors
>
>  drivers/pci/pcie/aer/aerdrv.c      | 28 +++++++++++++++++++++-------
>  drivers/pci/pcie/aer/aerdrv.h      |  9 +++------
>  drivers/pci/pcie/aer/aerdrv_core.c | 38 +++++++++++++++++---------------------
>  3 files changed, 41 insertions(+), 34 deletions(-)
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] PCI/AER: Use-after-free fix
  2018-04-12 17:06 ` Alex_Gagniuc
@ 2018-04-12 16:47   ` Scott Bauer
  2018-04-13 14:49     ` Alex_Gagniuc
  2018-04-16 19:49     ` Alex_Gagniuc
  2018-04-12 17:10   ` Keith Busch
  1 sibling, 2 replies; 15+ messages in thread
From: Scott Bauer @ 2018-04-12 16:47 UTC (permalink / raw)
  To: Alex_Gagniuc; +Cc: keith.busch, linux-pci, bhelgaas

On Thu, Apr 12, 2018 at 05:06:05PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> From: Keith Busch [mailto:keith.busch@intel.com] 
> 
> > AER error handling walks the PCI topology below a root port, saving pointers of the pci_dev structs affected by the error along the way.
> 
> Hi Keith,
> 
> I've been trying to do an ABA test to confirm that your change eliminates the use-after-free issue we've seen. The race seems to be quite elusive, so I can't reliably reproduce it. Your changes have not been forgotten; I have them staged for further testing.
> 
> Alex


If you need help triggering the race you can add a sleep/microsleep here:

aer_isr_one_error() between the find_source_device and process err device:

sbauer@sbauer-Z170X-UD5:~/nvme_code/upstream_jens/linux-block$ git diff drivers/pci/pcie/aer/aerdrv_core.c
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
index a4bfea52e7d4..5ca0c07b1d05 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -22,6 +22,7 @@
 #include <linux/delay.h>
 #include <linux/slab.h>
 #include <linux/kfifo.h>
+#include <linux/delay.h>
 #include "aerdrv.h"
 
 #define        PCI_EXP_AER_FLAGS       (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVCTL_NFERE | \
@@ -740,8 +741,10 @@ static void aer_isr_one_error(struct pcie_device *p_device,
 
                aer_print_port_info(p_device->port, e_info);
 
-               if (find_source_device(p_device->port, e_info))
+               if (find_source_device(p_device->port, e_info)) {
+                       msleep(350);
                        aer_process_err_devices(p_device, e_info);
+               }
        }
 
        if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) {
@@ -759,8 +762,10 @@ static void aer_isr_one_error(struct pcie_device *p_device,
 
                aer_print_port_info(p_device->port, e_info);
 
-               if (find_source_device(p_device->port, e_info))
+               if (find_source_device(p_device->port, e_info)) {
+                       msleep(350);
                        aer_process_err_devices(p_device, e_info);
+               }
        }
 }

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* RE: [PATCH 0/4] PCI/AER: Use-after-free fix
  2018-04-09 22:04 [PATCH 0/4] PCI/AER: Use-after-free fix Keith Busch
                   ` (4 preceding siblings ...)
  2018-04-10 13:15 ` [PATCH 0/4] PCI/AER: Use-after-free fix Dongdong Liu
@ 2018-04-12 17:06 ` Alex_Gagniuc
  2018-04-12 16:47   ` Scott Bauer
  2018-04-12 17:10   ` Keith Busch
  2018-06-05 22:11 ` Bjorn Helgaas
  6 siblings, 2 replies; 15+ messages in thread
From: Alex_Gagniuc @ 2018-04-12 17:06 UTC (permalink / raw)
  To: keith.busch, linux-pci, bhelgaas; +Cc: scott.bauer

From: Keith Busch [mailto:keith.busch@intel.com]=20

> AER error handling walks the PCI topology below a root port, saving point=
ers of the pci_dev structs affected by the error along the way.

Hi Keith,

I've been trying to do an ABA test to confirm that your change eliminates t=
he use-after-free issue we've seen. The race seems to be quite elusive, so =
I can't reliably reproduce it. Your changes have not been forgotten; I have=
 them staged for further testing.

Alex

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] PCI/AER: Use-after-free fix
  2018-04-12 17:06 ` Alex_Gagniuc
  2018-04-12 16:47   ` Scott Bauer
@ 2018-04-12 17:10   ` Keith Busch
  1 sibling, 0 replies; 15+ messages in thread
From: Keith Busch @ 2018-04-12 17:10 UTC (permalink / raw)
  To: Alex_Gagniuc; +Cc: linux-pci, bhelgaas, scott.bauer

On Thu, Apr 12, 2018 at 05:06:05PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> From: Keith Busch [mailto:keith.busch@intel.com] 
> 
> > AER error handling walks the PCI topology below a root port, saving
> > pointers of the pci_dev structs affected by the error along the way.
> 
> Hi Keith,
> 
> I've been trying to do an ABA test to confirm that your change
> eliminates the use-after-free issue we've seen. The race seems to be
> quite elusive, so I can't reliably reproduce it. Your changes have not
> been forgotten; I have them staged for further testing.
> 
> Alex

No worries. This is essentially just a "safe" version of the patch that
you had tested, so if that one was successful, this one should be too.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH 0/4] PCI/AER: Use-after-free fix
  2018-04-12 16:47   ` Scott Bauer
@ 2018-04-13 14:49     ` Alex_Gagniuc
  2018-04-16 19:49     ` Alex_Gagniuc
  1 sibling, 0 replies; 15+ messages in thread
From: Alex_Gagniuc @ 2018-04-13 14:49 UTC (permalink / raw)
  To: scott.bauer; +Cc: keith.busch, linux-pci, bhelgaas, mr.nuke.me

I got the cold chills when I realized you called for a delay of 350ms. It's=
 because 350ms is around the delay I've observed to be caused by FFS.
First run KASANed with the extra delay, so hopefully, I'll have more cement=
 test results by EOB today.

Alex

-----Original Message-----
From: Scott Bauer [mailto:scott.bauer@intel.com]=20
Sent: Thursday, April 12, 2018 11:47 AM
To: Gagniuc, Alexandru - Dell Team
Cc: keith.busch@intel.com; linux-pci@vger.kernel.org; bhelgaas@google.com
Subject: Re: [PATCH 0/4] PCI/AER: Use-after-free fix

On Thu, Apr 12, 2018 at 05:06:05PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> From: Keith Busch [mailto:keith.busch@intel.com]
>=20
> > AER error handling walks the PCI topology below a root port, saving poi=
nters of the pci_dev structs affected by the error along the way.
>=20
> Hi Keith,
>=20
> I've been trying to do an ABA test to confirm that your change eliminates=
 the use-after-free issue we've seen. The race seems to be quite elusive, s=
o I can't reliably reproduce it. Your changes have not been forgotten; I ha=
ve them staged for further testing.
>=20
> Alex


If you need help triggering the race you can add a sleep/microsleep here:

aer_isr_one_error() between the find_source_device and process err device:

sbauer@sbauer-Z170X-UD5:~/nvme_code/upstream_jens/linux-block$ git diff dri=
vers/pci/pcie/aer/aerdrv_core.c
diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerd=
rv_core.c
index a4bfea52e7d4..5ca0c07b1d05 100644
--- a/drivers/pci/pcie/aer/aerdrv_core.c
+++ b/drivers/pci/pcie/aer/aerdrv_core.c
@@ -22,6 +22,7 @@
 #include <linux/delay.h>
 #include <linux/slab.h>
 #include <linux/kfifo.h>
+#include <linux/delay.h>
 #include "aerdrv.h"
=20
 #define        PCI_EXP_AER_FLAGS       (PCI_EXP_DEVCTL_CERE | PCI_EXP_DEVC=
TL_NFERE | \
@@ -740,8 +741,10 @@ static void aer_isr_one_error(struct pcie_device *p_de=
vice,
=20
                aer_print_port_info(p_device->port, e_info);
=20
-               if (find_source_device(p_device->port, e_info))
+               if (find_source_device(p_device->port, e_info)) {
+                       msleep(350);
                        aer_process_err_devices(p_device, e_info);
+               }
        }
=20
        if (e_src->status & PCI_ERR_ROOT_UNCOR_RCV) { @@ -759,8 +762,10 @@ =
static void aer_isr_one_error(struct pcie_device *p_device,
=20
                aer_print_port_info(p_device->port, e_info);
=20
-               if (find_source_device(p_device->port, e_info))
+               if (find_source_device(p_device->port, e_info)) {
+                       msleep(350);
                        aer_process_err_devices(p_device, e_info);
+               }
        }
 }

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* RE: [PATCH 0/4] PCI/AER: Use-after-free fix
  2018-04-12 16:47   ` Scott Bauer
  2018-04-13 14:49     ` Alex_Gagniuc
@ 2018-04-16 19:49     ` Alex_Gagniuc
  1 sibling, 0 replies; 15+ messages in thread
From: Alex_Gagniuc @ 2018-04-16 19:49 UTC (permalink / raw)
  To: scott.bauer; +Cc: keith.busch, linux-pci, bhelgaas, mr.nuke.me

From: Scott Bauer [mailto:scott.bauer@intel.com]=20
On Thu, Apr 12, 2018 at 05:06:05PM +0000, Alex_Gagniuc@Dellteam.com wrote:
> From: Keith Busch [mailto:keith.busch@intel.com]
>=20
> > AER error handling walks the PCI topology below a root port, saving poi=
nters of the pci_dev structs affected by the error along the way.
>=20
> Hi Keith,
>=20
> I've been trying to do an ABA test to confirm that your change eliminates=
 the use-after-free issue we've seen. The race seems to be quite elusive, s=
o I can't reliably reproduce it. Your changes have not been forgotten; I ha=
ve them staged for further testing.
>=20
> Alex


> If you need help triggering the race you can add a sleep/microsleep here:

Tested-by: Alexandru Gagniuc

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] PCI/AER: Lock pci topology when scanning errors
  2018-04-09 22:04 ` [PATCH 4/4] PCI/AER: Lock pci topology when scanning errors Keith Busch
@ 2018-06-05 22:09   ` Bjorn Helgaas
  2018-06-05 22:18     ` Keith Busch
  0 siblings, 1 reply; 15+ messages in thread
From: Bjorn Helgaas @ 2018-06-05 22:09 UTC (permalink / raw)
  To: Keith Busch; +Cc: Linux PCI, Bjorn Helgaas, Alex_Gagniuc, Scott Bauer

On Mon, Apr 09, 2018 at 04:04:44PM -0600, Keith Busch wrote:
> The side effects of surprise removal may trigger AER handling. The AER
> handling walks the pci topology and may access a pci_dev that is being
> freed by the hotplug handler.
> 
> This patch fixes that use-after-free by locking the PCI topology in
> the AER handler so it isn't racing with the pciehp removal.
> 
> Since the AER handler now runs under a global PCI lock, the rpc specific
> mutex is no longer necessary.
> 
> Reported-by: Alex Gagniuc <Alex_Gagniuc@Dellteam.com>
> Signed-off-by: Keith Busch <keith.busch@intel.com>
> ---
>  drivers/pci/pcie/aer/aerdrv.c      | 1 -
>  drivers/pci/pcie/aer/aerdrv.h      | 5 -----
>  drivers/pci/pcie/aer/aerdrv_core.c | 4 ++--
>  3 files changed, 2 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/pci/pcie/aer/aerdrv.c b/drivers/pci/pcie/aer/aerdrv.c
> index 0b2eb88c422b..b88e5e2f3700 100644
> --- a/drivers/pci/pcie/aer/aerdrv.c
> +++ b/drivers/pci/pcie/aer/aerdrv.c
> @@ -237,7 +237,6 @@ static struct aer_rpc *aer_alloc_rpc(struct pcie_device *dev)
>  	rpc->rpd = pci_dev_get(dev->port);
>  	kref_init(&rpc->ref);
>  	INIT_WORK(&rpc->dpc_handler, aer_isr);
> -	mutex_init(&rpc->rpc_mutex);
>  
>  	/* Use PCIe bus function to store rpc into PCIe device */
>  	set_service_data(dev, rpc);
> diff --git a/drivers/pci/pcie/aer/aerdrv.h b/drivers/pci/pcie/aer/aerdrv.h
> index f886521e2c7b..b90fc5d4cda2 100644
> --- a/drivers/pci/pcie/aer/aerdrv.h
> +++ b/drivers/pci/pcie/aer/aerdrv.h
> @@ -70,11 +70,6 @@ struct aer_rpc {
>  					 * Lock access to Error Status/ID Regs
>  					 * and error producer/consumer index
>  					 */
> -	struct mutex rpc_mutex;		/*
> -					 * only one thread could do
> -					 * recovery on the same
> -					 * root port hierarchy
> -					 */
>  };
>  
>  struct aer_broadcast_data {
> diff --git a/drivers/pci/pcie/aer/aerdrv_core.c b/drivers/pci/pcie/aer/aerdrv_core.c
> index e4059d7fa7fa..de210b7439eb 100644
> --- a/drivers/pci/pcie/aer/aerdrv_core.c
> +++ b/drivers/pci/pcie/aer/aerdrv_core.c
> @@ -796,10 +796,10 @@ void aer_isr(struct work_struct *work)
>  	struct aer_rpc *rpc = container_of(work, struct aer_rpc, dpc_handler);
>  	struct aer_err_source uninitialized_var(e_src);
>  
> -	mutex_lock(&rpc->rpc_mutex);
> +	pci_lock_rescan_remove();
>  	while (get_e_source(rpc, &e_src))
>  		aer_isr_one_error(rpc, &e_src);
> -	mutex_unlock(&rpc->rpc_mutex);
> +	pci_unlock_rescan_remove();

I think this needs to be updated after Oza's patches, doesn't it?

It looks like this would deadlock if I applied it to my current "next"
branch as-is:

  aer_isr
    pci_lock_rescan_remove
    aer_isr_one_error
      aer_process_err_devices
        handle_error_source
          pcie_do_fatal_recovery
            pci_lock_rescan_remove      <-- deadlock

>       aer_release(rpc);
>  }
> -- 
> 2.14.3
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 0/4] PCI/AER: Use-after-free fix
  2018-04-09 22:04 [PATCH 0/4] PCI/AER: Use-after-free fix Keith Busch
                   ` (5 preceding siblings ...)
  2018-04-12 17:06 ` Alex_Gagniuc
@ 2018-06-05 22:11 ` Bjorn Helgaas
  6 siblings, 0 replies; 15+ messages in thread
From: Bjorn Helgaas @ 2018-06-05 22:11 UTC (permalink / raw)
  To: Keith Busch
  Cc: Linux PCI, Bjorn Helgaas, Alex_Gagniuc, Scott Bauer, Oza Pawandeep

On Mon, Apr 09, 2018 at 04:04:40PM -0600, Keith Busch wrote:
> AER error handling walks the PCI topology below a root port, saving
> pointers of the pci_dev structs affected by the error along the way.
> 
> At the same time, the pcie hotplug driver could be freeing those very
> same structures, causing the AER driver to reference freed memory.
> 
> This series fixes this by synchroniziing the aer driver with the pci
> hotplug driver during. The final patch is the one that ultimately provides
> the locking by having AER lock the same pci lock rescan/remove mutex as
> the pciehp driver. The first three patches are prepping to make it safe
> for the aer bottom half handler to hold that lock.
> 
> Keith Busch (4):
>   PCI/AER: Remove unused parameters
>   PCI/AER: Replace struct pcie_device with pci_dev
>   PCI/AER: Reference count aer structures
>   PCI/AER: Lock pci topology when scanning errors
> 
>  drivers/pci/pcie/aer/aerdrv.c      | 28 +++++++++++++++++++++-------
>  drivers/pci/pcie/aer/aerdrv.h      |  9 +++------
>  drivers/pci/pcie/aer/aerdrv_core.c | 38 +++++++++++++++++---------------------
>  3 files changed, 41 insertions(+), 34 deletions(-)

I applied the first two to pci/aer for v4.18, thanks!

I think the last two might need some rework to adapt after Oza's patches. 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] PCI/AER: Lock pci topology when scanning errors
  2018-06-05 22:09   ` Bjorn Helgaas
@ 2018-06-05 22:18     ` Keith Busch
  2018-06-06 13:52       ` Bjorn Helgaas
  0 siblings, 1 reply; 15+ messages in thread
From: Keith Busch @ 2018-06-05 22:18 UTC (permalink / raw)
  To: Bjorn Helgaas; +Cc: Linux PCI, Bjorn Helgaas, Alex_Gagniuc, Scott Bauer

On Tue, Jun 05, 2018 at 05:09:11PM -0500, Bjorn Helgaas wrote:
> > @@ -796,10 +796,10 @@ void aer_isr(struct work_struct *work)
> >  	struct aer_rpc *rpc = container_of(work, struct aer_rpc, dpc_handler);
> >  	struct aer_err_source uninitialized_var(e_src);
> >  
> > -	mutex_lock(&rpc->rpc_mutex);
> > +	pci_lock_rescan_remove();
> >  	while (get_e_source(rpc, &e_src))
> >  		aer_isr_one_error(rpc, &e_src);
> > -	mutex_unlock(&rpc->rpc_mutex);
> > +	pci_unlock_rescan_remove();
> 
> I think this needs to be updated after Oza's patches, doesn't it?
> 
> It looks like this would deadlock if I applied it to my current "next"
> branch as-is:
> 
>   aer_isr
>     pci_lock_rescan_remove
>     aer_isr_one_error
>       aer_process_err_devices
>         handle_error_source
>           pcie_do_fatal_recovery
>             pci_lock_rescan_remove      <-- deadlock
> 
> >       aer_release(rpc);
> >  }

Yes, looks like you are right about that.

I fully intended to have this rebased on that by now, but nvme issues
took way more time than I anticipated. Things appear to have calmed down
on that front, and I should be able to rebase appropriately this week
(famous last words...).

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] PCI/AER: Lock pci topology when scanning errors
  2018-06-05 22:18     ` Keith Busch
@ 2018-06-06 13:52       ` Bjorn Helgaas
  0 siblings, 0 replies; 15+ messages in thread
From: Bjorn Helgaas @ 2018-06-06 13:52 UTC (permalink / raw)
  To: Keith Busch; +Cc: Linux PCI, Bjorn Helgaas, Alex_Gagniuc, Scott Bauer

On Tue, Jun 05, 2018 at 04:18:26PM -0600, Keith Busch wrote:
> On Tue, Jun 05, 2018 at 05:09:11PM -0500, Bjorn Helgaas wrote:
> > > @@ -796,10 +796,10 @@ void aer_isr(struct work_struct *work)
> > >  	struct aer_rpc *rpc = container_of(work, struct aer_rpc, dpc_handler);
> > >  	struct aer_err_source uninitialized_var(e_src);
> > >  
> > > -	mutex_lock(&rpc->rpc_mutex);
> > > +	pci_lock_rescan_remove();
> > >  	while (get_e_source(rpc, &e_src))
> > >  		aer_isr_one_error(rpc, &e_src);
> > > -	mutex_unlock(&rpc->rpc_mutex);
> > > +	pci_unlock_rescan_remove();
> > 
> > I think this needs to be updated after Oza's patches, doesn't it?
> > 
> > It looks like this would deadlock if I applied it to my current "next"
> > branch as-is:
> > 
> >   aer_isr
> >     pci_lock_rescan_remove
> >     aer_isr_one_error
> >       aer_process_err_devices
> >         handle_error_source
> >           pcie_do_fatal_recovery
> >             pci_lock_rescan_remove      <-- deadlock
> > 
> > >       aer_release(rpc);
> > >  }
> 
> Yes, looks like you are right about that.
> 
> I fully intended to have this rebased on that by now, but nvme issues
> took way more time than I anticipated. Things appear to have calmed down
> on that front, and I should be able to rebase appropriately this week
> (famous last words...).

No worries, it's doubtful that I can still squeeze it into v4.18, so I
think it's better if we target this for v4.19.

I have some questions unrelated to the rebase:

  - What does the use-after-free look like to a user?  Panic,
    corruption, etc?  It's nice if we have breadcrumbs that connect a
    symptom to the fix.

  - I'm not super comfortable with the AER tree traversal in
    find_source_device().  Obviously this has always been there and is
    not your issue.  But it's dubious when a driver for device X also
    peeks at devices Y, Z, etc.  That always leads to locking issues.

  - I'm not clear on how pci_bus_sem and pci_rescan_remove_lock should
    be used.  pci_bus_sem is acquired by pci_walk_bus() and a few
    other PCI core paths.  pci_rescan_remove_lock is acquired (via
    pci_lock_rescan_remove()) by hotplug drivers and a few other
    add/remove/rescan paths.

    Most users of pci_walk_bus() do not use pci_lock_rescan_remove(),
    so it's not clear to me how to decide whether they *all* should,
    or why this AER path is different from the ones that don't need
    to.

Bjorn

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2018-06-06 13:52 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-09 22:04 [PATCH 0/4] PCI/AER: Use-after-free fix Keith Busch
2018-04-09 22:04 ` [PATCH 1/4] PCI/AER: Remove unused parameters Keith Busch
2018-04-09 22:04 ` [PATCH 2/4] PCI/AER: Replace struct pcie_device with pci_dev Keith Busch
2018-04-09 22:04 ` [PATCH 3/4] PCI/AER: Reference count aer structures Keith Busch
2018-04-09 22:04 ` [PATCH 4/4] PCI/AER: Lock pci topology when scanning errors Keith Busch
2018-06-05 22:09   ` Bjorn Helgaas
2018-06-05 22:18     ` Keith Busch
2018-06-06 13:52       ` Bjorn Helgaas
2018-04-10 13:15 ` [PATCH 0/4] PCI/AER: Use-after-free fix Dongdong Liu
2018-04-12 17:06 ` Alex_Gagniuc
2018-04-12 16:47   ` Scott Bauer
2018-04-13 14:49     ` Alex_Gagniuc
2018-04-16 19:49     ` Alex_Gagniuc
2018-04-12 17:10   ` Keith Busch
2018-06-05 22:11 ` Bjorn Helgaas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.