linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/10] PCI: Double removing fix and allocate 64bit mmio pref
@ 2013-11-26  1:28 Yinghai Lu
  2013-11-26  1:28 ` [PATCH v2 01/10] PCI: Use device_release_driver in pci_stop_root_bus Yinghai Lu
                   ` (9 more replies)
  0 siblings, 10 replies; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26  1:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	Yinghai Lu

First 4 are for Gu Zheng <guz.fnst@cn.fujitsu.com> to help double pci
device removing via sysfs.

Second 6 are about mmio 64 allocation that could help Guo Chao <yan@linux.vnet.ibm.com> on powerpc mmio allocation.
It will try to assign 64 bit resource above 4g at first.

Could be found:
        git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-pci-3.14

And it is based on current pci/for-linus.

-v2: update after patch that move device_del down to pci_destroy_dev.
     add "Try best to allocate pref mmio 64bit above 4G"

Yinghai Lu (10):
  PCI: Use device_release_driver in pci_stop_root_bus
  PCI: Move back pci_proc_attach_devices calling
  PCI: Move resources and bus_list releasing to pci_release_dev
  PCI: Destroy pci dev only once
  PCI: pcibus address to resource converting take bus directly
  PCI: Add pcibios_bus_addr_to_res()
  PCI: Try to allocate mem64 above 4G at first
  PCI: Try best to allocate pref mmio 64bit above 4g
  PCI: Sort pci root bus resources list
  intel-gtt: Read 64bit for gmar_bus_addr

 arch/x86/include/asm/pci.h   |   1 -
 drivers/char/agp/intel-gtt.c |  14 +++--
 drivers/pci/bus.c            |  58 +++++++++++++++----
 drivers/pci/host-bridge.c    |  48 +++++++++++-----
 drivers/pci/pci.h            |   2 +
 drivers/pci/probe.c          |  23 ++++++--
 drivers/pci/remove.c         |  31 +++-------
 drivers/pci/setup-bus.c      | 133 ++++++++++++++++++++++++++++---------------
 drivers/pci/setup-res.c      |  14 ++++-
 include/linux/pci.h          |  10 ++--
 10 files changed, 228 insertions(+), 106 deletions(-)

-- 
1.8.1.4


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH v2 01/10] PCI: Use device_release_driver in pci_stop_root_bus
  2013-11-26  1:28 [PATCH v2 00/10] PCI: Double removing fix and allocate 64bit mmio pref Yinghai Lu
@ 2013-11-26  1:28 ` Yinghai Lu
  2013-11-27  1:09   ` Rafael J. Wysocki
  2013-11-26  1:28 ` [PATCH v2 02/10] PCI: Move back pci_proc_attach_devices calling Yinghai Lu
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26  1:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	Yinghai Lu

To be consistent with change in
| PCI: Move device_del() from pci_stop_dev() to pci_destroy_dev()

Use device_release_driver for root bus/hostbridge.

Also use device_unregister() in pci_remove_root_bus() instead of
device_del/put_device, that will be corresponding device_register()
for pci_create_root_bus for hostbridge.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 drivers/pci/remove.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index cc9337a..692f4c3 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -128,7 +128,7 @@ void pci_stop_root_bus(struct pci_bus *bus)
 		pci_stop_bus_device(child);
 
 	/* stop the host bridge */
-	device_del(&host_bridge->dev);
+	device_release_driver(&host_bridge->dev);
 }
 
 void pci_remove_root_bus(struct pci_bus *bus)
@@ -147,5 +147,5 @@ void pci_remove_root_bus(struct pci_bus *bus)
 	host_bridge->bus = NULL;
 
 	/* remove the host bridge */
-	put_device(&host_bridge->dev);
+	device_unregister(&host_bridge->dev);
 }
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 02/10] PCI: Move back pci_proc_attach_devices calling
  2013-11-26  1:28 [PATCH v2 00/10] PCI: Double removing fix and allocate 64bit mmio pref Yinghai Lu
  2013-11-26  1:28 ` [PATCH v2 01/10] PCI: Use device_release_driver in pci_stop_root_bus Yinghai Lu
@ 2013-11-26  1:28 ` Yinghai Lu
  2013-11-26  1:28 ` [PATCH v2 03/10] PCI: Move resources and bus_list releasing to pci_release_dev Yinghai Lu
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26  1:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	Yinghai Lu

We stop detach proc when pci_stop_device.
So should attach that during pci_bus_add_device.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 drivers/pci/bus.c   | 1 +
 drivers/pci/probe.c | 2 --
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index fc1b740..1ffd95b 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -175,6 +175,7 @@ int pci_bus_add_device(struct pci_dev *dev)
 	 * are not assigned yet for some devices.
 	 */
 	pci_fixup_device(pci_fixup_final, dev);
+	pci_proc_attach_device(dev);
 	pci_create_sysfs_dev_files(dev);
 
 	dev->match_driver = true;
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 38e403d..173a9cf 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1381,8 +1381,6 @@ void pci_device_add(struct pci_dev *dev, struct pci_bus *bus)
 	dev->match_driver = false;
 	ret = device_add(&dev->dev);
 	WARN_ON(ret < 0);
-
-	pci_proc_attach_device(dev);
 }
 
 struct pci_dev *__ref pci_scan_single_device(struct pci_bus *bus, int devfn)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 03/10] PCI: Move resources and bus_list releasing to pci_release_dev
  2013-11-26  1:28 [PATCH v2 00/10] PCI: Double removing fix and allocate 64bit mmio pref Yinghai Lu
  2013-11-26  1:28 ` [PATCH v2 01/10] PCI: Use device_release_driver in pci_stop_root_bus Yinghai Lu
  2013-11-26  1:28 ` [PATCH v2 02/10] PCI: Move back pci_proc_attach_devices calling Yinghai Lu
@ 2013-11-26  1:28 ` Yinghai Lu
  2013-11-27  1:15   ` Rafael J. Wysocki
  2013-11-26  1:28 ` [PATCH v2 04/10] PCI: Destroy pci dev only once Yinghai Lu
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26  1:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	Yinghai Lu

We should not release resource in pci_destroy that is too early
as there could be still other use hold reference.

release them or remove it from bus devices list at last
in pci_release_dev instead.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 drivers/pci/probe.c  | 21 +++++++++++++++++++--
 drivers/pci/remove.c | 19 -------------------
 2 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 173a9cf..12ec56c 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1154,6 +1154,18 @@ static void pci_release_capabilities(struct pci_dev *dev)
 	pci_free_cap_save_buffers(dev);
 }
 
+static void pci_free_resources(struct pci_dev *dev)
+{
+	int i;
+
+	pci_cleanup_rom(dev);
+	for (i = 0; i < PCI_NUM_RESOURCES; i++) {
+		struct resource *res = dev->resource + i;
+		if (res->parent)
+			release_resource(res);
+	}
+}
+
 /**
  * pci_release_dev - free a pci device structure when all users of it are finished.
  * @dev: device that's been disconnected
@@ -1163,9 +1175,14 @@ static void pci_release_capabilities(struct pci_dev *dev)
  */
 static void pci_release_dev(struct device *dev)
 {
-	struct pci_dev *pci_dev;
+	struct pci_dev *pci_dev = to_pci_dev(dev);
+
+	down_write(&pci_bus_sem);
+	list_del(&pci_dev->bus_list);
+	up_write(&pci_bus_sem);
+
+	pci_free_resources(pci_dev);
 
-	pci_dev = to_pci_dev(dev);
 	pci_release_capabilities(pci_dev);
 	pci_release_of_node(pci_dev);
 	pcibios_release_device(pci_dev);
diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index 692f4c3..f452148 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -3,20 +3,6 @@
 #include <linux/pci-aspm.h>
 #include "pci.h"
 
-static void pci_free_resources(struct pci_dev *dev)
-{
-	int i;
-
-	msi_remove_pci_irq_vectors(dev);
-
-	pci_cleanup_rom(dev);
-	for (i = 0; i < PCI_NUM_RESOURCES; i++) {
-		struct resource *res = dev->resource + i;
-		if (res->parent)
-			release_resource(res);
-	}
-}
-
 static void pci_stop_dev(struct pci_dev *dev)
 {
 	pci_pme_active(dev, false);
@@ -36,11 +22,6 @@ static void pci_destroy_dev(struct pci_dev *dev)
 {
 	device_del(&dev->dev);
 
-	down_write(&pci_bus_sem);
-	list_del(&dev->bus_list);
-	up_write(&pci_bus_sem);
-
-	pci_free_resources(dev);
 	put_device(&dev->dev);
 }
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-11-26  1:28 [PATCH v2 00/10] PCI: Double removing fix and allocate 64bit mmio pref Yinghai Lu
                   ` (2 preceding siblings ...)
  2013-11-26  1:28 ` [PATCH v2 03/10] PCI: Move resources and bus_list releasing to pci_release_dev Yinghai Lu
@ 2013-11-26  1:28 ` Yinghai Lu
  2013-11-26  3:38   ` Bjorn Helgaas
  2013-11-26  1:28 ` [PATCH v2 05/10] PCI: pcibus address to resource converting take bus directly Yinghai Lu
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26  1:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	Yinghai Lu

Mutliple removing via /sys will call pci_destroy_dev two times.

| When concurent removing pci devices which are in the same pci subtree
| via sysfs, such as:
| echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove ; echo -n 1 >
| /sys/bus/pci/devices/0000\:1a\:01.0/remove
| (1a:01.0 device is downstream from the 10:00.0 bridge)
|
| the following warning will show:
| [ 1799.280918] ------------[ cut here ]------------
| [ 1799.336199] WARNING: CPU: 7 PID: 126 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0()
| [ 1799.433093] list_del corruption, ffff8807b4a7c000->next is LIST_POISON1 (dead000000100100)
| [ 1800.276623] CPU: 7 PID: 126 Comm: kworker/u512:1 Tainted: G        W    3.12.0-rc5+ #196
| [ 1800.508918] Workqueue: sysfsd sysfs_schedule_callback_work
| [ 1800.574703]  0000000000000009 ffff8807adbadbd8 ffffffff8168b26c ffff8807c27d08a8
| [ 1800.663860]  ffff8807adbadc28 ffff8807adbadc18 ffffffff810711dc ffff8807adbadc68
| [ 1800.753130]  ffff8807b4a7c000 ffff8807b4a7c000 ffff8807ad089c00 0000000000000000
| [ 1800.842282] Call Trace:
| [ 1800.871651]  [<ffffffff8168b26c>] dump_stack+0x55/0x76
| [ 1800.933301]  [<ffffffff810711dc>] warn_slowpath_common+0x8c/0xc0
| [ 1801.005283]  [<ffffffff810712c6>] warn_slowpath_fmt+0x46/0x50
| [ 1801.074081]  [<ffffffff8135a343>] __list_del_entry+0x63/0xd0
| [ 1801.141839]  [<ffffffff8135a3c1>] list_del+0x11/0x40
| [ 1801.201320]  [<ffffffff813734da>] pci_remove_bus_device+0x6a/0xe0
| [ 1801.274279]  [<ffffffff8137356e>] pci_stop_and_remove_bus_device+0x1e/0x30
| [ 1801.356606]  [<ffffffff8137b20b>] remove_callback+0x2b/0x40
| [ 1801.423412]  [<ffffffff81251848>] sysfs_schedule_callback_work+0x18/0x60
| [ 1801.503744]  [<ffffffff8108eab5>] process_one_work+0x1f5/0x540
| [ 1801.573640]  [<ffffffff8108ea53>] ? process_one_work+0x193/0x540
| [ 1801.645616]  [<ffffffff8108f2ac>] worker_thread+0x11c/0x370
| [ 1801.712337]  [<ffffffff8108f190>] ? rescuer_thread+0x350/0x350
| [ 1801.782178]  [<ffffffff8109731d>] kthread+0xed/0x100
| [ 1801.841661]  [<ffffffff81097230>] ? kthread_create_on_node+0x160/0x160
| [ 1801.919919]  [<ffffffff8169cc3c>] ret_from_fork+0x7c/0xb0
| [ 1801.984608]  [<ffffffff81097230>] ? kthread_create_on_node+0x160/0x160
| [ 1802.062825] ---[ end trace d77f2054de000fb7 ]---
|
| This issue is related to the bug 54411:
| https://bugzilla.kernel.org/show_bug.cgi?id=54411

Add is_removed to record if pci_destroy_dev is called already.

During second calling, still have extra dev ref hold via
device_schedule_call, so we are safe to check dev->is_removed.

It fixs the problem In Gu's test.

-v2: add partial changelog from Gu Zheng <guz.fnst@cn.fujitsu.com>
     refresh after patch of moving device_del from Rafael.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 drivers/pci/remove.c | 8 +++++---
 include/linux/pci.h  | 1 +
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index f452148..b090cec 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -20,9 +20,11 @@ static void pci_stop_dev(struct pci_dev *dev)
 
 static void pci_destroy_dev(struct pci_dev *dev)
 {
-	device_del(&dev->dev);
-
-	put_device(&dev->dev);
+	if (!dev->is_removed) {
+		device_del(&dev->dev);
+		dev->is_removed = 1;
+		put_device(&dev->dev);
+	}
 }
 
 void pci_remove_bus(struct pci_bus *bus)
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 1084a15..ccb316d 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -321,6 +321,7 @@ struct pci_dev {
 	unsigned int	multifunction:1;/* Part of multi-function device */
 	/* keep track of device state */
 	unsigned int	is_added:1;
+	unsigned int	is_removed:1;	/* pci_destroy_dev is called */
 	unsigned int	is_busmaster:1; /* device is busmaster */
 	unsigned int	no_msi:1;	/* device may not use msi */
 	unsigned int	block_cfg_access:1;	/* config space access is blocked */
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 05/10] PCI: pcibus address to resource converting take bus directly
  2013-11-26  1:28 [PATCH v2 00/10] PCI: Double removing fix and allocate 64bit mmio pref Yinghai Lu
                   ` (3 preceding siblings ...)
  2013-11-26  1:28 ` [PATCH v2 04/10] PCI: Destroy pci dev only once Yinghai Lu
@ 2013-11-26  1:28 ` Yinghai Lu
  2013-11-26  1:28 ` [PATCH v2 06/10] PCI: Add pcibios_bus_addr_to_res() Yinghai Lu
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26  1:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	Yinghai Lu

For allocating resource under bus path, we do have dev pass along, and we
could just use bus instead.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 drivers/pci/host-bridge.c | 34 +++++++++++++++++++++-------------
 include/linux/pci.h       |  3 +++
 2 files changed, 24 insertions(+), 13 deletions(-)

diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c
index a68dc61..2e7288b 100644
--- a/drivers/pci/host-bridge.c
+++ b/drivers/pci/host-bridge.c
@@ -9,22 +9,19 @@
 
 #include "pci.h"
 
-static struct pci_bus *find_pci_root_bus(struct pci_dev *dev)
+static struct pci_bus *find_pci_root_bus(struct pci_bus *bus)
 {
-	struct pci_bus *bus;
-
-	bus = dev->bus;
 	while (bus->parent)
 		bus = bus->parent;
 
 	return bus;
 }
 
-static struct pci_host_bridge *find_pci_host_bridge(struct pci_dev *dev)
+static struct pci_host_bridge *find_pci_host_bridge(struct pci_bus *bus)
 {
-	struct pci_bus *bus = find_pci_root_bus(dev);
+	struct pci_bus *root_bus = find_pci_root_bus(bus);
 
-	return to_pci_host_bridge(bus->bridge);
+	return to_pci_host_bridge(root_bus->bridge);
 }
 
 void pci_set_host_bridge_release(struct pci_host_bridge *bridge,
@@ -40,10 +37,11 @@ static bool resource_contains(struct resource *res1, struct resource *res2)
 	return res1->start <= res2->start && res1->end >= res2->end;
 }
 
-void pcibios_resource_to_bus(struct pci_dev *dev, struct pci_bus_region *region,
-			     struct resource *res)
+void __pcibios_resource_to_bus(struct pci_bus *bus,
+				      struct pci_bus_region *region,
+				      struct resource *res)
 {
-	struct pci_host_bridge *bridge = find_pci_host_bridge(dev);
+	struct pci_host_bridge *bridge = find_pci_host_bridge(bus);
 	struct pci_host_bridge_window *window;
 	resource_size_t offset = 0;
 
@@ -60,6 +58,11 @@ void pcibios_resource_to_bus(struct pci_dev *dev, struct pci_bus_region *region,
 	region->start = res->start - offset;
 	region->end = res->end - offset;
 }
+void pcibios_resource_to_bus(struct pci_dev *dev, struct pci_bus_region *region,
+			     struct resource *res)
+{
+	__pcibios_resource_to_bus(dev->bus, region, res);
+}
 EXPORT_SYMBOL(pcibios_resource_to_bus);
 
 static bool region_contains(struct pci_bus_region *region1,
@@ -68,10 +71,10 @@ static bool region_contains(struct pci_bus_region *region1,
 	return region1->start <= region2->start && region1->end >= region2->end;
 }
 
-void pcibios_bus_to_resource(struct pci_dev *dev, struct resource *res,
-			     struct pci_bus_region *region)
+static void __pcibios_bus_to_resource(struct pci_bus *bus, struct resource *res,
+				      struct pci_bus_region *region)
 {
-	struct pci_host_bridge *bridge = find_pci_host_bridge(dev);
+	struct pci_host_bridge *bridge = find_pci_host_bridge(bus);
 	struct pci_host_bridge_window *window;
 	resource_size_t offset = 0;
 
@@ -93,4 +96,9 @@ void pcibios_bus_to_resource(struct pci_dev *dev, struct resource *res,
 	res->start = region->start + offset;
 	res->end = region->end + offset;
 }
+void pcibios_bus_to_resource(struct pci_dev *dev, struct resource *res,
+			     struct pci_bus_region *region)
+{
+	__pcibios_bus_to_resource(dev->bus, res, region);
+}
 EXPORT_SYMBOL(pcibios_bus_to_resource);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index ccb316d..55ee90f 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -738,6 +738,9 @@ void pci_fixup_cardbus(struct pci_bus *);
 
 /* Generic PCI functions used internally */
 
+void __pcibios_resource_to_bus(struct pci_bus *bus,
+			       struct pci_bus_region *region,
+			       struct resource *res);
 void pcibios_resource_to_bus(struct pci_dev *dev, struct pci_bus_region *region,
 			     struct resource *res);
 void pcibios_bus_to_resource(struct pci_dev *dev, struct resource *res,
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 06/10] PCI: Add pcibios_bus_addr_to_res()
  2013-11-26  1:28 [PATCH v2 00/10] PCI: Double removing fix and allocate 64bit mmio pref Yinghai Lu
                   ` (4 preceding siblings ...)
  2013-11-26  1:28 ` [PATCH v2 05/10] PCI: pcibus address to resource converting take bus directly Yinghai Lu
@ 2013-11-26  1:28 ` Yinghai Lu
  2013-11-26  1:28 ` [PATCH v2 07/10] PCI: Try to allocate mem64 above 4G at first Yinghai Lu
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26  1:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	Yinghai Lu

it takes addr and return converted address only.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 drivers/pci/host-bridge.c | 14 ++++++++++++++
 include/linux/pci.h       |  2 ++
 2 files changed, 16 insertions(+)

diff --git a/drivers/pci/host-bridge.c b/drivers/pci/host-bridge.c
index 2e7288b..c911adb 100644
--- a/drivers/pci/host-bridge.c
+++ b/drivers/pci/host-bridge.c
@@ -102,3 +102,17 @@ void pcibios_bus_to_resource(struct pci_dev *dev, struct resource *res,
 	__pcibios_bus_to_resource(dev->bus, res, region);
 }
 EXPORT_SYMBOL(pcibios_bus_to_resource);
+
+resource_size_t pcibios_bus_addr_to_res(struct pci_bus *bus, int flags,
+					resource_size_t addr)
+{
+	struct pci_bus_region region;
+	struct resource r;
+
+	r.flags = flags;
+	region.start = addr;
+	region.end = addr;
+	__pcibios_bus_to_resource(bus, &r, &region);
+
+	return r.end;
+}
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 55ee90f..3c6e399 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -745,6 +745,8 @@ void pcibios_resource_to_bus(struct pci_dev *dev, struct pci_bus_region *region,
 			     struct resource *res);
 void pcibios_bus_to_resource(struct pci_dev *dev, struct resource *res,
 			     struct pci_bus_region *region);
+resource_size_t pcibios_bus_addr_to_res(struct pci_bus *bus, int flags,
+					resource_size_t addr);
 void pcibios_scan_specific_bus(int busn);
 struct pci_bus *pci_find_bus(int domain, int busnr);
 void pci_bus_add_devices(const struct pci_bus *bus);
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 07/10] PCI: Try to allocate mem64 above 4G at first
  2013-11-26  1:28 [PATCH v2 00/10] PCI: Double removing fix and allocate 64bit mmio pref Yinghai Lu
                   ` (5 preceding siblings ...)
  2013-11-26  1:28 ` [PATCH v2 06/10] PCI: Add pcibios_bus_addr_to_res() Yinghai Lu
@ 2013-11-26  1:28 ` Yinghai Lu
  2013-11-26  4:15   ` Bjorn Helgaas
  2013-11-26  1:28 ` [PATCH v2 08/10] PCI: Try best to allocate pref mmio 64bit above 4g Yinghai Lu
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26  1:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	Yinghai Lu

Will fall back to below 4g if it can not find any above 4g.

x86 32bit without X86_PAE support will have bottom set to 0, because
resource_size_t is 32bit.

Also for 32bit with resource_size_t 64bit kernel on machine with pae support
we are safe because iomem_resource is limited to 32bit according to
x86_phys_bits.

-v2: update bottom assigning to make it clear for non-pae support machine.
-v3: Bjorn's change:
        use MAX_RESOURCE instead of -1
        use start/end instead of bottom/max
        for all arch instead of just x86_64
-v4: updated after PCI_MAX_RESOURCE_32 change.
-v5: restore io handling to use PCI_MAX_RESOURCE_32 as limit.
-v6: checking pcibios_resource_to_bus return for every bus res, to decide it
	if we need to try high at first.
     It supports all arches instead of just x86_64.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 arch/x86/include/asm/pci.h |  1 -
 drivers/pci/bus.c          | 42 ++++++++++++++++++++++++++++++++++--------
 drivers/pci/pci.h          |  2 ++
 include/linux/pci.h        |  4 ----
 4 files changed, 36 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index 947b5c4..122c299 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -125,7 +125,6 @@ int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc,
 
 /* generic pci stuff */
 #include <asm-generic/pci.h>
-#define PCIBIOS_MAX_MEM_32 0xffffffff
 
 #ifdef CONFIG_NUMA
 /* Returns the node based on pci bus */
diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index 1ffd95b..f801f6a 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -125,15 +125,13 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
 {
 	int i, ret = -ENOMEM;
 	struct resource *r;
-	resource_size_t max = -1;
 
 	type_mask |= IORESOURCE_IO | IORESOURCE_MEM;
 
-	/* don't allocate too high if the pref mem doesn't support 64bit*/
-	if (!(res->flags & IORESOURCE_MEM_64))
-		max = PCIBIOS_MAX_MEM_32;
-
 	pci_bus_for_each_resource(bus, r, i) {
+		resource_size_t start, end, middle;
+		struct pci_bus_region region;
+
 		if (!r)
 			continue;
 
@@ -147,14 +145,42 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
 		    !(res->flags & IORESOURCE_PREFETCH))
 			continue;
 
+		start = 0;
+		end = MAX_RESOURCE;
+		/*
+		 * don't allocate too high if the pref mem doesn't
+		 * support 64bit, also if this is a 64-bit mem
+		 * resource, try above 4GB first
+		 */
+		__pcibios_resource_to_bus(bus, &region, r);
+		if (region.start <= PCI_MAX_ADDR_32 &&
+		    region.end > PCI_MAX_ADDR_32) {
+			middle = pcibios_bus_addr_to_res(bus, res->flags,
+						      PCI_MAX_ADDR_32);
+			if (res->flags & IORESOURCE_MEM_64)
+				start = middle + 1;
+			else
+				end = middle;
+		} else if (region.start > PCI_MAX_ADDR_32 &&
+			   !(res->flags & IORESOURCE_MEM_64))
+				continue;
+
+again:
 		/* Ok, try it out.. */
 		ret = allocate_resource(r, res, size,
-					r->start ? : min,
-					max, align,
+					max(start, r->start ? : min),
+					end, align,
 					alignf, alignf_data);
 		if (ret == 0)
-			break;
+			return 0;
+
+		if (start != 0) {
+			start = 0;
+			goto again;
+		}
 	}
+
+
 	return ret;
 }
 
diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index 9c91ecc..aea4efb 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -198,6 +198,8 @@ enum pci_bar_type {
 	pci_bar_mem64,		/* A 64-bit memory BAR */
 };
 
+#define PCI_MAX_ADDR_32	((resource_size_t)0xffffffff)
+
 bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl,
 				int crs_timeout);
 int pci_setup_device(struct pci_dev *dev);
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 3c6e399..1c69789 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -1491,10 +1491,6 @@ static inline struct pci_dev *pci_dev_get(struct pci_dev *dev)
 
 #include <asm/pci.h>
 
-#ifndef PCIBIOS_MAX_MEM_32
-#define PCIBIOS_MAX_MEM_32 (-1)
-#endif
-
 /* these helpers provide future and backwards compatibility
  * for accessing popular PCI BAR info */
 #define pci_resource_start(dev, bar)	((dev)->resource[(bar)].start)
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 08/10] PCI: Try best to allocate pref mmio 64bit above 4g
  2013-11-26  1:28 [PATCH v2 00/10] PCI: Double removing fix and allocate 64bit mmio pref Yinghai Lu
                   ` (6 preceding siblings ...)
  2013-11-26  1:28 ` [PATCH v2 07/10] PCI: Try to allocate mem64 above 4G at first Yinghai Lu
@ 2013-11-26  1:28 ` Yinghai Lu
  2013-11-26  4:17   ` Bjorn Helgaas
  2013-11-26  1:28 ` [PATCH v2 09/10] PCI: Sort pci root bus resources list Yinghai Lu
  2013-11-26  1:28 ` [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr Yinghai Lu
  9 siblings, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26  1:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	Yinghai Lu

When one of children resources does not support MEM_64, MEM_64 for
bridge get reset, so pull down whole pref resource on the bridge under 4G.

If the bridge support pref mem 64, will only allocate that with pref mem64 to
children that support it.
For children resources if they only support pref mem 32, will allocate them
from non pref mem instead.

If the bridge only support 32bit pref mmio, will still have all children pref
mmio under that.

-v2: Add release bridge res support with bridge mem res for pref_mem children res.
-v3: refresh and make it can be applied early before for_each_dev_res patchset.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Tested-by: Guo Chao <yan@linux.vnet.ibm.com>
---
 drivers/pci/setup-bus.c | 133 ++++++++++++++++++++++++++++++++----------------
 drivers/pci/setup-res.c |  14 ++++-
 2 files changed, 101 insertions(+), 46 deletions(-)

diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
index 219a410..b98419e 100644
--- a/drivers/pci/setup-bus.c
+++ b/drivers/pci/setup-bus.c
@@ -711,12 +711,11 @@ static void pci_bridge_check_ranges(struct pci_bus *bus)
    bus resource of a given type. Note: we intentionally skip
    the bus resources which have already been assigned (that is,
    have non-NULL parent resource). */
-static struct resource *find_free_bus_resource(struct pci_bus *bus, unsigned long type)
+static struct resource *find_free_bus_resource(struct pci_bus *bus,
+			 unsigned long type_mask, unsigned long type)
 {
 	int i;
 	struct resource *r;
-	unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
-				  IORESOURCE_PREFETCH;
 
 	pci_bus_for_each_resource(bus, r, i) {
 		if (r == &ioport_resource || r == &iomem_resource)
@@ -813,7 +812,8 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
 		resource_size_t add_size, struct list_head *realloc_head)
 {
 	struct pci_dev *dev;
-	struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO);
+	struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO,
+							IORESOURCE_IO);
 	resource_size_t size = 0, size0 = 0, size1 = 0;
 	resource_size_t children_add_size = 0;
 	resource_size_t min_align, align;
@@ -913,15 +913,16 @@ static inline resource_size_t calculate_mem_align(resource_size_t *aligns,
  * guarantees that all child resources fit in this size.
  */
 static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
-			 unsigned long type, resource_size_t min_size,
-			resource_size_t add_size,
-			struct list_head *realloc_head)
+			 unsigned long type, unsigned long type2,
+			 resource_size_t min_size, resource_size_t add_size,
+			 struct list_head *realloc_head)
 {
 	struct pci_dev *dev;
 	resource_size_t min_align, align, size, size0, size1;
 	resource_size_t aligns[12];	/* Alignments from 1Mb to 2Gb */
 	int order, max_order;
-	struct resource *b_res = find_free_bus_resource(bus, type);
+	struct resource *b_res = find_free_bus_resource(bus,
+					 mask | IORESOURCE_PREFETCH, type);
 	unsigned int mem64_mask = 0;
 	resource_size_t children_add_size = 0;
 
@@ -942,7 +943,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
 			struct resource *r = &dev->resource[i];
 			resource_size_t r_size;
 
-			if (r->parent || (r->flags & mask) != type)
+			if (r->parent || ((r->flags & mask) != type &&
+					  (r->flags & mask) != type2))
 				continue;
 			r_size = resource_size(r);
 #ifdef CONFIG_PCI_IOV
@@ -1115,8 +1117,9 @@ void __ref __pci_bus_size_bridges(struct pci_bus *bus,
 			struct list_head *realloc_head)
 {
 	struct pci_dev *dev;
-	unsigned long mask, prefmask;
+	unsigned long mask, prefmask, type2 = 0;
 	resource_size_t additional_mem_size = 0, additional_io_size = 0;
+	struct resource *b_res;
 
 	list_for_each_entry(dev, &bus->devices, bus_list) {
 		struct pci_bus *b = dev->subordinate;
@@ -1161,15 +1164,31 @@ void __ref __pci_bus_size_bridges(struct pci_bus *bus,
 		   has already been allocated by arch code, try
 		   non-prefetchable range for both types of PCI memory
 		   resources. */
+		b_res = &bus->self->resource[PCI_BRIDGE_RESOURCES];
 		mask = IORESOURCE_MEM;
 		prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH;
-		if (pbus_size_mem(bus, prefmask, prefmask,
+		if (b_res[2].flags & IORESOURCE_MEM_64) {
+			prefmask |= IORESOURCE_MEM_64;
+			if (pbus_size_mem(bus, prefmask, prefmask, prefmask,
 				  realloc_head ? 0 : additional_mem_size,
-				  additional_mem_size, realloc_head))
-			mask = prefmask; /* Success, size non-prefetch only. */
-		else
-			additional_mem_size += additional_mem_size;
-		pbus_size_mem(bus, mask, IORESOURCE_MEM,
+				  additional_mem_size, realloc_head)) {
+					/* Success, size non-pref64 only. */
+					mask = prefmask;
+					type2 = prefmask & ~IORESOURCE_MEM_64;
+			}
+		}
+		if (!type2) {
+			prefmask &= ~IORESOURCE_MEM_64;
+			if (pbus_size_mem(bus, prefmask, prefmask, prefmask,
+					 realloc_head ? 0 : additional_mem_size,
+					 additional_mem_size, realloc_head)) {
+				/* Success, size non-prefetch only. */
+				mask = prefmask;
+			} else
+				additional_mem_size += additional_mem_size;
+			type2 = IORESOURCE_MEM;
+		}
+		pbus_size_mem(bus, mask, IORESOURCE_MEM, type2,
 				realloc_head ? 0 : additional_mem_size,
 				additional_mem_size, realloc_head);
 		break;
@@ -1255,42 +1274,66 @@ static void __ref __pci_bridge_assign_resources(const struct pci_dev *bridge,
 static void pci_bridge_release_resources(struct pci_bus *bus,
 					  unsigned long type)
 {
-	int idx;
-	bool changed = false;
-	struct pci_dev *dev;
+	struct pci_dev *dev = bus->self;
 	struct resource *r;
 	unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
-				  IORESOURCE_PREFETCH;
+				  IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
+	unsigned old_flags = 0;
+	struct resource *b_res;
+	int idx = 1;
 
-	dev = bus->self;
-	for (idx = PCI_BRIDGE_RESOURCES; idx <= PCI_BRIDGE_RESOURCE_END;
-	     idx++) {
-		r = &dev->resource[idx];
-		if ((r->flags & type_mask) != type)
-			continue;
-		if (!r->parent)
-			continue;
-		/*
-		 * if there are children under that, we should release them
-		 *  all
-		 */
-		release_child_resources(r);
-		if (!release_resource(r)) {
-			dev_printk(KERN_DEBUG, &dev->dev,
-				 "resource %d %pR released\n", idx, r);
-			/* keep the old size */
-			r->end = resource_size(r) - 1;
-			r->start = 0;
-			r->flags = 0;
-			changed = true;
-		}
-	}
+	b_res = &dev->resource[PCI_BRIDGE_RESOURCES];
+
+	/*
+	 *     1. if there is io port assign fail, will release bridge
+	 *	  io port.
+	 *     2. if there is non pref mmio assign fail, release bridge
+	 *	  nonpref mmio.
+	 *     3. if there is 64bit pref mmio assign fail, and bridge pref
+	 *	  is 64bit, release bridge pref mmio.
+	 *     4. if there is pref mmio assign fail, and bridge pref is
+	 *	  32bit mmio, release bridge pref mmio
+	 *     5. if there is pref mmio assign fail, and bridge pref is not
+	 *	  assigned, release bridge nonpref mmio.
+	 */
+	if (type & IORESOURCE_IO)
+		idx = 0;
+	else if (!(type & IORESOURCE_PREFETCH))
+		idx = 1;
+	else if ((type & IORESOURCE_MEM_64) &&
+		 (b_res[2].flags & IORESOURCE_MEM_64))
+		idx = 2;
+	else if (!(b_res[2].flags & IORESOURCE_MEM_64) &&
+		 (b_res[2].flags & IORESOURCE_PREFETCH))
+		idx = 2;
+	else
+		idx = 1;
+
+	r = &b_res[idx];
+
+	if (!r->parent)
+		return;
+
+	/*
+	 * if there are children under that, we should release them
+	 *  all
+	 */
+	release_child_resources(r);
+	if (!release_resource(r)) {
+		type = old_flags = r->flags & type_mask;
+		dev_printk(KERN_DEBUG, &dev->dev, "resource %d %pR released\n",
+					PCI_BRIDGE_RESOURCES + idx, r);
+		/* keep the old size */
+		r->end = resource_size(r) - 1;
+		r->start = 0;
+		r->flags = 0;
 
-	if (changed) {
 		/* avoiding touch the one without PREF */
 		if (type & IORESOURCE_PREFETCH)
 			type = IORESOURCE_PREFETCH;
 		__pci_setup_bridge(bus, type);
+		/* for next child res under same bridge */
+		r->flags = old_flags;
 	}
 }
 
@@ -1469,7 +1512,7 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
 	LIST_HEAD(fail_head);
 	struct pci_dev_resource *fail_res;
 	unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
-				  IORESOURCE_PREFETCH;
+				  IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
 	int pci_try_num = 1;
 	enum enable_type enable_local;
 
diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
index 83c4d3b..e968412 100644
--- a/drivers/pci/setup-res.c
+++ b/drivers/pci/setup-res.c
@@ -208,9 +208,21 @@ static int __pci_assign_resource(struct pci_bus *bus, struct pci_dev *dev,
 
 	/* First, try exact prefetching match.. */
 	ret = pci_bus_alloc_resource(bus, res, size, align, min,
-				     IORESOURCE_PREFETCH,
+				     IORESOURCE_PREFETCH | IORESOURCE_MEM_64,
 				     pcibios_align_resource, dev);
 
+	if (ret < 0 &&
+	    (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64))) {
+		/*
+		 * That failed.
+		 *
+		 * Try below 4g pref
+		 */
+		ret = pci_bus_alloc_resource(bus, res, size, align, min,
+					     IORESOURCE_PREFETCH,
+					     pcibios_align_resource, dev);
+	}
+
 	if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) {
 		/*
 		 * That failed.
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 09/10] PCI: Sort pci root bus resources list
  2013-11-26  1:28 [PATCH v2 00/10] PCI: Double removing fix and allocate 64bit mmio pref Yinghai Lu
                   ` (7 preceding siblings ...)
  2013-11-26  1:28 ` [PATCH v2 08/10] PCI: Try best to allocate pref mmio 64bit above 4g Yinghai Lu
@ 2013-11-26  1:28 ` Yinghai Lu
  2013-11-26  4:18   ` Bjorn Helgaas
  2013-11-26  1:28 ` [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr Yinghai Lu
  9 siblings, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26  1:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	Yinghai Lu

Some x86 systems expose above 4G 64bit mmio in _CRS as non-pref mmio range.
[   49.415281] PCI host bridge to bus 0000:00
[   49.419921] pci_bus 0000:00: root bus resource [bus 00-1e]
[   49.426107] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
[   49.433041] pci_bus 0000:00: root bus resource [io  0x1000-0x5fff]
[   49.440010] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
[   49.447768] pci_bus 0000:00: root bus resource [mem 0xfed8c000-0xfedfffff]
[   49.455532] pci_bus 0000:00: root bus resource [mem 0x90000000-0x9fffbfff]
[   49.463259] pci_bus 0000:00: root bus resource [mem 0x380000000000-0x381fffffffff]

During assign unassigned 64bit mmio resource, it will go through
every non-pref mmio for root bus in pci_bus_alloc_resource().
As the loop is with pci_bus_for_each_resource(), and could have chance
to use under 4G mmio range instead of above 4G mmio range if the requested
range is not big enough, even it could handle above 4G 64bit pref mmio.

For root bus, we can order list from high to low in pci_add_resource_offset(),
during creating root bus, it will still keep the same order in final bus
resource list.
	pci_acpi_scan_root
		==> add_resources
			==> pci_add_resource_offset: # Add to temp resources
		==> pci_create_root_bus
			==> pci_bus_add_resource # add to final bus resources.

After that, we can make sure 64bit pref mmio for pci bridges will be allocated
higest of mmio non-pref, and in this case it is above 4G instead of under 4G.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 drivers/pci/bus.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
index f801f6a..adf17858 100644
--- a/drivers/pci/bus.c
+++ b/drivers/pci/bus.c
@@ -21,7 +21,8 @@
 void pci_add_resource_offset(struct list_head *resources, struct resource *res,
 			     resource_size_t offset)
 {
-	struct pci_host_bridge_window *window;
+	struct pci_host_bridge_window *window, *tmp;
+	struct list_head *n;
 
 	window = kzalloc(sizeof(struct pci_host_bridge_window), GFP_KERNEL);
 	if (!window) {
@@ -31,7 +32,17 @@ void pci_add_resource_offset(struct list_head *resources, struct resource *res,
 
 	window->res = res;
 	window->offset = offset;
-	list_add_tail(&window->list, resources);
+
+	/* sorted it according to res end */
+	n = resources;
+	list_for_each_entry(tmp, resources, list)
+		if (window->res->end > tmp->res->end) {
+			n = &tmp->list;
+			break;
+		}
+
+	/* Insert it just before n */
+	list_add_tail(&window->list, n);
 }
 EXPORT_SYMBOL(pci_add_resource_offset);
 
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr
  2013-11-26  1:28 [PATCH v2 00/10] PCI: Double removing fix and allocate 64bit mmio pref Yinghai Lu
                   ` (8 preceding siblings ...)
  2013-11-26  1:28 ` [PATCH v2 09/10] PCI: Sort pci root bus resources list Yinghai Lu
@ 2013-11-26  1:28 ` Yinghai Lu
  2013-11-26  3:46   ` Bjorn Helgaas
  2013-12-21  0:27   ` Bjorn Helgaas
  9 siblings, 2 replies; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26  1:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	Yinghai Lu, David Airlie

That bar could be 64bit pref mem and above 4G.

-v2: refresh to 3.13-rc1

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
Cc: David Airlie <airlied@linux.ie>
Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
---
 drivers/char/agp/intel-gtt.c | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c
index b8e2014..b929e9d 100644
--- a/drivers/char/agp/intel-gtt.c
+++ b/drivers/char/agp/intel-gtt.c
@@ -609,8 +609,10 @@ static bool intel_gtt_can_wc(void)
 static int intel_gtt_init(void)
 {
 	u32 gma_addr;
+	u32 addr_hi = 0;
 	u32 gtt_map_size;
 	int ret;
+	int pos;
 
 	ret = intel_private.driver->setup();
 	if (ret != 0)
@@ -660,13 +662,17 @@ static int intel_gtt_init(void)
 	}
 
 	if (INTEL_GTT_GEN <= 2)
-		pci_read_config_dword(intel_private.pcidev, I810_GMADDR,
-				      &gma_addr);
+		pos = I810_GMADDR;
 	else
-		pci_read_config_dword(intel_private.pcidev, I915_GMADDR,
-				      &gma_addr);
+		pos = I915_GMADDR;
+
+	pci_read_config_dword(intel_private.pcidev, pos, &gma_addr);
+
+	if (gma_addr & PCI_BASE_ADDRESS_MEM_TYPE_64)
+		pci_read_config_dword(intel_private.pcidev, pos + 4, &addr_hi);
 
 	intel_private.gma_bus_addr = (gma_addr & PCI_BASE_ADDRESS_MEM_MASK);
+	intel_private.gma_bus_addr |= (u64)addr_hi << 32;
 
 	return 0;
 }
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-11-26  1:28 ` [PATCH v2 04/10] PCI: Destroy pci dev only once Yinghai Lu
@ 2013-11-26  3:38   ` Bjorn Helgaas
  2013-11-26 19:34     ` Yinghai Lu
  0 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2013-11-26  3:38 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel

On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> Mutliple removing via /sys will call pci_destroy_dev two times.
>
> | When concurent removing pci devices which are in the same pci subtree
> | via sysfs, such as:
> | echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove ; echo -n 1 >
> | /sys/bus/pci/devices/0000\:1a\:01.0/remove
> | (1a:01.0 device is downstream from the 10:00.0 bridge)
> |
> | the following warning will show:
> | [ 1799.280918] ------------[ cut here ]------------
> | [ 1799.336199] WARNING: CPU: 7 PID: 126 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0()
> | [ 1799.433093] list_del corruption, ffff8807b4a7c000->next is LIST_POISON1 (dead000000100100)
> | [ 1800.276623] CPU: 7 PID: 126 Comm: kworker/u512:1 Tainted: G        W    3.12.0-rc5+ #196
> | [ 1800.508918] Workqueue: sysfsd sysfs_schedule_callback_work
> | [ 1800.574703]  0000000000000009 ffff8807adbadbd8 ffffffff8168b26c ffff8807c27d08a8
> | [ 1800.663860]  ffff8807adbadc28 ffff8807adbadc18 ffffffff810711dc ffff8807adbadc68
> | [ 1800.753130]  ffff8807b4a7c000 ffff8807b4a7c000 ffff8807ad089c00 0000000000000000
> | [ 1800.842282] Call Trace:
> | [ 1800.871651]  [<ffffffff8168b26c>] dump_stack+0x55/0x76
> | [ 1800.933301]  [<ffffffff810711dc>] warn_slowpath_common+0x8c/0xc0
> | [ 1801.005283]  [<ffffffff810712c6>] warn_slowpath_fmt+0x46/0x50
> | [ 1801.074081]  [<ffffffff8135a343>] __list_del_entry+0x63/0xd0
> | [ 1801.141839]  [<ffffffff8135a3c1>] list_del+0x11/0x40
> | [ 1801.201320]  [<ffffffff813734da>] pci_remove_bus_device+0x6a/0xe0
> | [ 1801.274279]  [<ffffffff8137356e>] pci_stop_and_remove_bus_device+0x1e/0x30
> | [ 1801.356606]  [<ffffffff8137b20b>] remove_callback+0x2b/0x40
> | [ 1801.423412]  [<ffffffff81251848>] sysfs_schedule_callback_work+0x18/0x60
> | [ 1801.503744]  [<ffffffff8108eab5>] process_one_work+0x1f5/0x540
> | [ 1801.573640]  [<ffffffff8108ea53>] ? process_one_work+0x193/0x540
> | [ 1801.645616]  [<ffffffff8108f2ac>] worker_thread+0x11c/0x370
> | [ 1801.712337]  [<ffffffff8108f190>] ? rescuer_thread+0x350/0x350
> | [ 1801.782178]  [<ffffffff8109731d>] kthread+0xed/0x100
> | [ 1801.841661]  [<ffffffff81097230>] ? kthread_create_on_node+0x160/0x160
> | [ 1801.919919]  [<ffffffff8169cc3c>] ret_from_fork+0x7c/0xb0
> | [ 1801.984608]  [<ffffffff81097230>] ? kthread_create_on_node+0x160/0x160
> | [ 1802.062825] ---[ end trace d77f2054de000fb7 ]---
> |
> | This issue is related to the bug 54411:
> | https://bugzilla.kernel.org/show_bug.cgi?id=54411
>
> Add is_removed to record if pci_destroy_dev is called already.
>
> During second calling, still have extra dev ref hold via
> device_schedule_call, so we are safe to check dev->is_removed.
>
> It fixs the problem In Gu's test.
>
> -v2: add partial changelog from Gu Zheng <guz.fnst@cn.fujitsu.com>
>      refresh after patch of moving device_del from Rafael.
>
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> ---
>  drivers/pci/remove.c | 8 +++++---
>  include/linux/pci.h  | 1 +
>  2 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> index f452148..b090cec 100644
> --- a/drivers/pci/remove.c
> +++ b/drivers/pci/remove.c
> @@ -20,9 +20,11 @@ static void pci_stop_dev(struct pci_dev *dev)
>
>  static void pci_destroy_dev(struct pci_dev *dev)
>  {
> -       device_del(&dev->dev);
> -
> -       put_device(&dev->dev);
> +       if (!dev->is_removed) {
> +               device_del(&dev->dev);
> +               dev->is_removed = 1;

As Rafael pointed out, this looks like a race.  What prevents two
concurrent calls to pci_destroy_dev() from seeing "dev->is_removed ==
0" and both calling device_del() on the same device?

> +               put_device(&dev->dev);
> +       }
>  }
>
>  void pci_remove_bus(struct pci_bus *bus)
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 1084a15..ccb316d 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -321,6 +321,7 @@ struct pci_dev {
>         unsigned int    multifunction:1;/* Part of multi-function device */
>         /* keep track of device state */
>         unsigned int    is_added:1;
> +       unsigned int    is_removed:1;   /* pci_destroy_dev is called */
>         unsigned int    is_busmaster:1; /* device is busmaster */
>         unsigned int    no_msi:1;       /* device may not use msi */
>         unsigned int    block_cfg_access:1;     /* config space access is blocked */
> --
> 1.8.1.4
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr
  2013-11-26  1:28 ` [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr Yinghai Lu
@ 2013-11-26  3:46   ` Bjorn Helgaas
  2013-11-26 19:35     ` Yinghai Lu
  2013-12-21  0:27   ` Bjorn Helgaas
  1 sibling, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2013-11-26  3:46 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	David Airlie

On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> That bar could be 64bit pref mem and above 4G.
>
> -v2: refresh to 3.13-rc1
>
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> Cc: David Airlie <airlied@linux.ie>
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>

This looks OK to me.  Does it depend on any previous patches in this
series?  If not, I think Dave should pick it up.

Bjorn

> ---
>  drivers/char/agp/intel-gtt.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c
> index b8e2014..b929e9d 100644
> --- a/drivers/char/agp/intel-gtt.c
> +++ b/drivers/char/agp/intel-gtt.c
> @@ -609,8 +609,10 @@ static bool intel_gtt_can_wc(void)
>  static int intel_gtt_init(void)
>  {
>         u32 gma_addr;
> +       u32 addr_hi = 0;
>         u32 gtt_map_size;
>         int ret;
> +       int pos;
>
>         ret = intel_private.driver->setup();
>         if (ret != 0)
> @@ -660,13 +662,17 @@ static int intel_gtt_init(void)
>         }
>
>         if (INTEL_GTT_GEN <= 2)
> -               pci_read_config_dword(intel_private.pcidev, I810_GMADDR,
> -                                     &gma_addr);
> +               pos = I810_GMADDR;
>         else
> -               pci_read_config_dword(intel_private.pcidev, I915_GMADDR,
> -                                     &gma_addr);
> +               pos = I915_GMADDR;
> +
> +       pci_read_config_dword(intel_private.pcidev, pos, &gma_addr);
> +
> +       if (gma_addr & PCI_BASE_ADDRESS_MEM_TYPE_64)
> +               pci_read_config_dword(intel_private.pcidev, pos + 4, &addr_hi);
>
>         intel_private.gma_bus_addr = (gma_addr & PCI_BASE_ADDRESS_MEM_MASK);
> +       intel_private.gma_bus_addr |= (u64)addr_hi << 32;
>
>         return 0;
>  }
> --
> 1.8.1.4
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/10] PCI: Try to allocate mem64 above 4G at first
  2013-11-26  1:28 ` [PATCH v2 07/10] PCI: Try to allocate mem64 above 4G at first Yinghai Lu
@ 2013-11-26  4:15   ` Bjorn Helgaas
  2013-11-26 20:14     ` Yinghai Lu
  0 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2013-11-26  4:15 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel

On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> Will fall back to below 4g if it can not find any above 4g.

Does this fix a bug?  If so, please include a bugzilla or mailing list URL.

> x86 32bit without X86_PAE support will have bottom set to 0, because
> resource_size_t is 32bit.
>
> Also for 32bit with resource_size_t 64bit kernel on machine with pae support
> we are safe because iomem_resource is limited to 32bit according to
> x86_phys_bits.
>
> -v2: update bottom assigning to make it clear for non-pae support machine.
> -v3: Bjorn's change:
>         use MAX_RESOURCE instead of -1
>         use start/end instead of bottom/max
>         for all arch instead of just x86_64
> -v4: updated after PCI_MAX_RESOURCE_32 change.
> -v5: restore io handling to use PCI_MAX_RESOURCE_32 as limit.
> -v6: checking pcibios_resource_to_bus return for every bus res, to decide it
>         if we need to try high at first.
>      It supports all arches instead of just x86_64.
>
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> ---
>  arch/x86/include/asm/pci.h |  1 -
>  drivers/pci/bus.c          | 42 ++++++++++++++++++++++++++++++++++--------
>  drivers/pci/pci.h          |  2 ++
>  include/linux/pci.h        |  4 ----
>  4 files changed, 36 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
> index 947b5c4..122c299 100644
> --- a/arch/x86/include/asm/pci.h
> +++ b/arch/x86/include/asm/pci.h
> @@ -125,7 +125,6 @@ int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc,
>
>  /* generic pci stuff */
>  #include <asm-generic/pci.h>
> -#define PCIBIOS_MAX_MEM_32 0xffffffff
>
>  #ifdef CONFIG_NUMA
>  /* Returns the node based on pci bus */
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index 1ffd95b..f801f6a 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -125,15 +125,13 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
>  {
>         int i, ret = -ENOMEM;
>         struct resource *r;
> -       resource_size_t max = -1;
>
>         type_mask |= IORESOURCE_IO | IORESOURCE_MEM;
>
> -       /* don't allocate too high if the pref mem doesn't support 64bit*/
> -       if (!(res->flags & IORESOURCE_MEM_64))
> -               max = PCIBIOS_MAX_MEM_32;
> -
>         pci_bus_for_each_resource(bus, r, i) {
> +               resource_size_t start, end, middle;
> +               struct pci_bus_region region;
> +

I think you're doing two things at once in this patch:

1) Fixing the problem that the IORESOURCE_MEM_64 constraint was being
applied to CPU addresses, not bus addresses, and

2) Trying to allocate above 4G first.

Please separate these into two patches.  The first thing is an obvious
problem and should have little risk of breaking anything.  The second
probably makes sense, but the allocation change could certainly break
something and have to be reverted.  It would be good if we could save
the first fix if that happened.

>                 if (!r)
>                         continue;
>
> @@ -147,14 +145,42 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
>                     !(res->flags & IORESOURCE_PREFETCH))
>                         continue;
>
> +               start = 0;
> +               end = MAX_RESOURCE;
> +               /*
> +                * don't allocate too high if the pref mem doesn't
> +                * support 64bit, also if this is a 64-bit mem
> +                * resource, try above 4GB first
> +                */
> +               __pcibios_resource_to_bus(bus, &region, r);
> +               if (region.start <= PCI_MAX_ADDR_32 &&
> +                   region.end > PCI_MAX_ADDR_32) {
> +                       middle = pcibios_bus_addr_to_res(bus, res->flags,
> +                                                     PCI_MAX_ADDR_32);
> +                       if (res->flags & IORESOURCE_MEM_64)
> +                               start = middle + 1;
> +                       else
> +                               end = middle;
> +               } else if (region.start > PCI_MAX_ADDR_32 &&
> +                          !(res->flags & IORESOURCE_MEM_64))
> +                               continue;

This is sort of ugly.  Can you make some sort of "pci_clip_resource()"
 so this loop remains readable?  E.g., something like:

  static pci_bus_region pci_mem_32 = { 0, 0xffffffff };
  static pci_bus_region pci_mem_64 = { 0x100000000, 0xffffffffffffffff };

  struct resource avail = *r;

  if (res->flags & IORESOURCE_MEM_64)
    pci_clip_resource(&avail, &pci_mem_64);
  else
    pci_clip_resource(&avail, &pci_mem_32);
  if (!resource_size(&avail))
    continue;

> +
> +again:
>                 /* Ok, try it out.. */
>                 ret = allocate_resource(r, res, size,
> -                                       r->start ? : min,
> -                                       max, align,
> +                                       max(start, r->start ? : min),
> +                                       end, align,
>                                         alignf, alignf_data);
>                 if (ret == 0)
> -                       break;
> +                       return 0;
> +
> +               if (start != 0) {
> +                       start = 0;
> +                       goto again;
> +               }
>         }
> +
> +
>         return ret;
>  }
>
> diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
> index 9c91ecc..aea4efb 100644
> --- a/drivers/pci/pci.h
> +++ b/drivers/pci/pci.h
> @@ -198,6 +198,8 @@ enum pci_bar_type {
>         pci_bar_mem64,          /* A 64-bit memory BAR */
>  };
>
> +#define PCI_MAX_ADDR_32        ((resource_size_t)0xffffffff)
> +
>  bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl,
>                                 int crs_timeout);
>  int pci_setup_device(struct pci_dev *dev);
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 3c6e399..1c69789 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -1491,10 +1491,6 @@ static inline struct pci_dev *pci_dev_get(struct pci_dev *dev)
>
>  #include <asm/pci.h>
>
> -#ifndef PCIBIOS_MAX_MEM_32
> -#define PCIBIOS_MAX_MEM_32 (-1)
> -#endif
> -
>  /* these helpers provide future and backwards compatibility
>   * for accessing popular PCI BAR info */
>  #define pci_resource_start(dev, bar)   ((dev)->resource[(bar)].start)
> --
> 1.8.1.4
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 08/10] PCI: Try best to allocate pref mmio 64bit above 4g
  2013-11-26  1:28 ` [PATCH v2 08/10] PCI: Try best to allocate pref mmio 64bit above 4g Yinghai Lu
@ 2013-11-26  4:17   ` Bjorn Helgaas
  2013-11-26  6:59     ` Guo Chao
  0 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2013-11-26  4:17 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel

On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> When one of children resources does not support MEM_64, MEM_64 for
> bridge get reset, so pull down whole pref resource on the bridge under 4G.
>
> If the bridge support pref mem 64, will only allocate that with pref mem64 to
> children that support it.
> For children resources if they only support pref mem 32, will allocate them
> from non pref mem instead.
>
> If the bridge only support 32bit pref mmio, will still have all children pref
> mmio under that.

I can't figure out if this is supposed to fix a problem, and if so,
what problem it is.  Can you include a URL for a bugzilla or other
problem description?

> -v2: Add release bridge res support with bridge mem res for pref_mem children res.
> -v3: refresh and make it can be applied early before for_each_dev_res patchset.
>
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> Tested-by: Guo Chao <yan@linux.vnet.ibm.com>
> ---
>  drivers/pci/setup-bus.c | 133 ++++++++++++++++++++++++++++++++----------------
>  drivers/pci/setup-res.c |  14 ++++-
>  2 files changed, 101 insertions(+), 46 deletions(-)
>
> diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> index 219a410..b98419e 100644
> --- a/drivers/pci/setup-bus.c
> +++ b/drivers/pci/setup-bus.c
> @@ -711,12 +711,11 @@ static void pci_bridge_check_ranges(struct pci_bus *bus)
>     bus resource of a given type. Note: we intentionally skip
>     the bus resources which have already been assigned (that is,
>     have non-NULL parent resource). */
> -static struct resource *find_free_bus_resource(struct pci_bus *bus, unsigned long type)
> +static struct resource *find_free_bus_resource(struct pci_bus *bus,
> +                        unsigned long type_mask, unsigned long type)
>  {
>         int i;
>         struct resource *r;
> -       unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> -                                 IORESOURCE_PREFETCH;
>
>         pci_bus_for_each_resource(bus, r, i) {
>                 if (r == &ioport_resource || r == &iomem_resource)
> @@ -813,7 +812,8 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
>                 resource_size_t add_size, struct list_head *realloc_head)
>  {
>         struct pci_dev *dev;
> -       struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO);
> +       struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO,
> +                                                       IORESOURCE_IO);
>         resource_size_t size = 0, size0 = 0, size1 = 0;
>         resource_size_t children_add_size = 0;
>         resource_size_t min_align, align;
> @@ -913,15 +913,16 @@ static inline resource_size_t calculate_mem_align(resource_size_t *aligns,
>   * guarantees that all child resources fit in this size.
>   */
>  static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> -                        unsigned long type, resource_size_t min_size,
> -                       resource_size_t add_size,
> -                       struct list_head *realloc_head)
> +                        unsigned long type, unsigned long type2,
> +                        resource_size_t min_size, resource_size_t add_size,
> +                        struct list_head *realloc_head)
>  {
>         struct pci_dev *dev;
>         resource_size_t min_align, align, size, size0, size1;
>         resource_size_t aligns[12];     /* Alignments from 1Mb to 2Gb */
>         int order, max_order;
> -       struct resource *b_res = find_free_bus_resource(bus, type);
> +       struct resource *b_res = find_free_bus_resource(bus,
> +                                        mask | IORESOURCE_PREFETCH, type);
>         unsigned int mem64_mask = 0;
>         resource_size_t children_add_size = 0;
>
> @@ -942,7 +943,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>                         struct resource *r = &dev->resource[i];
>                         resource_size_t r_size;
>
> -                       if (r->parent || (r->flags & mask) != type)
> +                       if (r->parent || ((r->flags & mask) != type &&
> +                                         (r->flags & mask) != type2))
>                                 continue;
>                         r_size = resource_size(r);
>  #ifdef CONFIG_PCI_IOV
> @@ -1115,8 +1117,9 @@ void __ref __pci_bus_size_bridges(struct pci_bus *bus,
>                         struct list_head *realloc_head)
>  {
>         struct pci_dev *dev;
> -       unsigned long mask, prefmask;
> +       unsigned long mask, prefmask, type2 = 0;
>         resource_size_t additional_mem_size = 0, additional_io_size = 0;
> +       struct resource *b_res;
>
>         list_for_each_entry(dev, &bus->devices, bus_list) {
>                 struct pci_bus *b = dev->subordinate;
> @@ -1161,15 +1164,31 @@ void __ref __pci_bus_size_bridges(struct pci_bus *bus,
>                    has already been allocated by arch code, try
>                    non-prefetchable range for both types of PCI memory
>                    resources. */
> +               b_res = &bus->self->resource[PCI_BRIDGE_RESOURCES];
>                 mask = IORESOURCE_MEM;
>                 prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH;
> -               if (pbus_size_mem(bus, prefmask, prefmask,
> +               if (b_res[2].flags & IORESOURCE_MEM_64) {
> +                       prefmask |= IORESOURCE_MEM_64;
> +                       if (pbus_size_mem(bus, prefmask, prefmask, prefmask,
>                                   realloc_head ? 0 : additional_mem_size,
> -                                 additional_mem_size, realloc_head))
> -                       mask = prefmask; /* Success, size non-prefetch only. */
> -               else
> -                       additional_mem_size += additional_mem_size;
> -               pbus_size_mem(bus, mask, IORESOURCE_MEM,
> +                                 additional_mem_size, realloc_head)) {
> +                                       /* Success, size non-pref64 only. */
> +                                       mask = prefmask;
> +                                       type2 = prefmask & ~IORESOURCE_MEM_64;
> +                       }
> +               }
> +               if (!type2) {
> +                       prefmask &= ~IORESOURCE_MEM_64;
> +                       if (pbus_size_mem(bus, prefmask, prefmask, prefmask,
> +                                        realloc_head ? 0 : additional_mem_size,
> +                                        additional_mem_size, realloc_head)) {
> +                               /* Success, size non-prefetch only. */
> +                               mask = prefmask;
> +                       } else
> +                               additional_mem_size += additional_mem_size;
> +                       type2 = IORESOURCE_MEM;
> +               }
> +               pbus_size_mem(bus, mask, IORESOURCE_MEM, type2,
>                                 realloc_head ? 0 : additional_mem_size,
>                                 additional_mem_size, realloc_head);
>                 break;
> @@ -1255,42 +1274,66 @@ static void __ref __pci_bridge_assign_resources(const struct pci_dev *bridge,
>  static void pci_bridge_release_resources(struct pci_bus *bus,
>                                           unsigned long type)
>  {
> -       int idx;
> -       bool changed = false;
> -       struct pci_dev *dev;
> +       struct pci_dev *dev = bus->self;
>         struct resource *r;
>         unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> -                                 IORESOURCE_PREFETCH;
> +                                 IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
> +       unsigned old_flags = 0;
> +       struct resource *b_res;
> +       int idx = 1;
>
> -       dev = bus->self;
> -       for (idx = PCI_BRIDGE_RESOURCES; idx <= PCI_BRIDGE_RESOURCE_END;
> -            idx++) {
> -               r = &dev->resource[idx];
> -               if ((r->flags & type_mask) != type)
> -                       continue;
> -               if (!r->parent)
> -                       continue;
> -               /*
> -                * if there are children under that, we should release them
> -                *  all
> -                */
> -               release_child_resources(r);
> -               if (!release_resource(r)) {
> -                       dev_printk(KERN_DEBUG, &dev->dev,
> -                                "resource %d %pR released\n", idx, r);
> -                       /* keep the old size */
> -                       r->end = resource_size(r) - 1;
> -                       r->start = 0;
> -                       r->flags = 0;
> -                       changed = true;
> -               }
> -       }
> +       b_res = &dev->resource[PCI_BRIDGE_RESOURCES];
> +
> +       /*
> +        *     1. if there is io port assign fail, will release bridge
> +        *        io port.
> +        *     2. if there is non pref mmio assign fail, release bridge
> +        *        nonpref mmio.
> +        *     3. if there is 64bit pref mmio assign fail, and bridge pref
> +        *        is 64bit, release bridge pref mmio.
> +        *     4. if there is pref mmio assign fail, and bridge pref is
> +        *        32bit mmio, release bridge pref mmio
> +        *     5. if there is pref mmio assign fail, and bridge pref is not
> +        *        assigned, release bridge nonpref mmio.
> +        */
> +       if (type & IORESOURCE_IO)
> +               idx = 0;
> +       else if (!(type & IORESOURCE_PREFETCH))
> +               idx = 1;
> +       else if ((type & IORESOURCE_MEM_64) &&
> +                (b_res[2].flags & IORESOURCE_MEM_64))
> +               idx = 2;
> +       else if (!(b_res[2].flags & IORESOURCE_MEM_64) &&
> +                (b_res[2].flags & IORESOURCE_PREFETCH))
> +               idx = 2;
> +       else
> +               idx = 1;
> +
> +       r = &b_res[idx];
> +
> +       if (!r->parent)
> +               return;
> +
> +       /*
> +        * if there are children under that, we should release them
> +        *  all
> +        */
> +       release_child_resources(r);
> +       if (!release_resource(r)) {
> +               type = old_flags = r->flags & type_mask;
> +               dev_printk(KERN_DEBUG, &dev->dev, "resource %d %pR released\n",
> +                                       PCI_BRIDGE_RESOURCES + idx, r);
> +               /* keep the old size */
> +               r->end = resource_size(r) - 1;
> +               r->start = 0;
> +               r->flags = 0;
>
> -       if (changed) {
>                 /* avoiding touch the one without PREF */
>                 if (type & IORESOURCE_PREFETCH)
>                         type = IORESOURCE_PREFETCH;
>                 __pci_setup_bridge(bus, type);
> +               /* for next child res under same bridge */
> +               r->flags = old_flags;
>         }
>  }
>
> @@ -1469,7 +1512,7 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
>         LIST_HEAD(fail_head);
>         struct pci_dev_resource *fail_res;
>         unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> -                                 IORESOURCE_PREFETCH;
> +                                 IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
>         int pci_try_num = 1;
>         enum enable_type enable_local;
>
> diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
> index 83c4d3b..e968412 100644
> --- a/drivers/pci/setup-res.c
> +++ b/drivers/pci/setup-res.c
> @@ -208,9 +208,21 @@ static int __pci_assign_resource(struct pci_bus *bus, struct pci_dev *dev,
>
>         /* First, try exact prefetching match.. */
>         ret = pci_bus_alloc_resource(bus, res, size, align, min,
> -                                    IORESOURCE_PREFETCH,
> +                                    IORESOURCE_PREFETCH | IORESOURCE_MEM_64,
>                                      pcibios_align_resource, dev);
>
> +       if (ret < 0 &&
> +           (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64))) {
> +               /*
> +                * That failed.
> +                *
> +                * Try below 4g pref
> +                */
> +               ret = pci_bus_alloc_resource(bus, res, size, align, min,
> +                                            IORESOURCE_PREFETCH,
> +                                            pcibios_align_resource, dev);
> +       }
> +
>         if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) {
>                 /*
>                  * That failed.
> --
> 1.8.1.4
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 09/10] PCI: Sort pci root bus resources list
  2013-11-26  1:28 ` [PATCH v2 09/10] PCI: Sort pci root bus resources list Yinghai Lu
@ 2013-11-26  4:18   ` Bjorn Helgaas
  0 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2013-11-26  4:18 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel

On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> Some x86 systems expose above 4G 64bit mmio in _CRS as non-pref mmio range.
> [   49.415281] PCI host bridge to bus 0000:00
> [   49.419921] pci_bus 0000:00: root bus resource [bus 00-1e]
> [   49.426107] pci_bus 0000:00: root bus resource [io  0x0000-0x0cf7]
> [   49.433041] pci_bus 0000:00: root bus resource [io  0x1000-0x5fff]
> [   49.440010] pci_bus 0000:00: root bus resource [mem 0x000a0000-0x000bffff]
> [   49.447768] pci_bus 0000:00: root bus resource [mem 0xfed8c000-0xfedfffff]
> [   49.455532] pci_bus 0000:00: root bus resource [mem 0x90000000-0x9fffbfff]
> [   49.463259] pci_bus 0000:00: root bus resource [mem 0x380000000000-0x381fffffffff]
>
> During assign unassigned 64bit mmio resource, it will go through
> every non-pref mmio for root bus in pci_bus_alloc_resource().
> As the loop is with pci_bus_for_each_resource(), and could have chance
> to use under 4G mmio range instead of above 4G mmio range if the requested
> range is not big enough, even it could handle above 4G 64bit pref mmio.
>
> For root bus, we can order list from high to low in pci_add_resource_offset(),
> during creating root bus, it will still keep the same order in final bus
> resource list.
>         pci_acpi_scan_root
>                 ==> add_resources
>                         ==> pci_add_resource_offset: # Add to temp resources
>                 ==> pci_create_root_bus
>                         ==> pci_bus_add_resource # add to final bus resources.
>
> After that, we can make sure 64bit pref mmio for pci bridges will be allocated
> higest of mmio non-pref, and in this case it is above 4G instead of under 4G.

Sorry I'm so slow; I'd like to know what problem this solves, too.
I'm trying to help people at distros figure out whether they will need
to backport this change.

> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> ---
>  drivers/pci/bus.c | 15 +++++++++++++--
>  1 file changed, 13 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
> index f801f6a..adf17858 100644
> --- a/drivers/pci/bus.c
> +++ b/drivers/pci/bus.c
> @@ -21,7 +21,8 @@
>  void pci_add_resource_offset(struct list_head *resources, struct resource *res,
>                              resource_size_t offset)
>  {
> -       struct pci_host_bridge_window *window;
> +       struct pci_host_bridge_window *window, *tmp;
> +       struct list_head *n;
>
>         window = kzalloc(sizeof(struct pci_host_bridge_window), GFP_KERNEL);
>         if (!window) {
> @@ -31,7 +32,17 @@ void pci_add_resource_offset(struct list_head *resources, struct resource *res,
>
>         window->res = res;
>         window->offset = offset;
> -       list_add_tail(&window->list, resources);
> +
> +       /* sorted it according to res end */
> +       n = resources;
> +       list_for_each_entry(tmp, resources, list)
> +               if (window->res->end > tmp->res->end) {
> +                       n = &tmp->list;
> +                       break;
> +               }
> +
> +       /* Insert it just before n */
> +       list_add_tail(&window->list, n);
>  }
>  EXPORT_SYMBOL(pci_add_resource_offset);
>
> --
> 1.8.1.4
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 08/10] PCI: Try best to allocate pref mmio 64bit above 4g
  2013-11-26  4:17   ` Bjorn Helgaas
@ 2013-11-26  6:59     ` Guo Chao
  2013-11-26 17:53       ` Bjorn Helgaas
  0 siblings, 1 reply; 69+ messages in thread
From: Guo Chao @ 2013-11-26  6:59 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, linux-pci, linux-kernel

Hi, Bjorn:

On Mon, Nov 25, 2013 at 09:17:11PM -0700, Bjorn Helgaas wrote:
> On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> > When one of children resources does not support MEM_64, MEM_64 for
> > bridge get reset, so pull down whole pref resource on the bridge under 4G.
> >
> > If the bridge support pref mem 64, will only allocate that with pref mem64 to
> > children that support it.
> > For children resources if they only support pref mem 32, will allocate them
> > from non pref mem instead.
> >
> > If the bridge only support 32bit pref mmio, will still have all children pref
> > mmio under that.
> 
> I can't figure out if this is supposed to fix a problem, and if so,
> what problem it is.  Can you include a URL for a bugzilla or other
> problem description?
>

This is intended to fix resource allocation problem when we expose
64-bit MMIO window in PowerNV platform. Please see issue 3 in:

http://www.spinics.net/lists/linux-pci/msg26472.html

Without this, any 32-bit prefetchable BARs will pull down the
prefetahable window to allocate resource from 32-bit non-prefetchable
range, preventing 64-bit MMIO from being used at all.

What's worse, in some machines, 32-bit range is too small to provide
fall back space for prefetchable window, causing all prefetchable
BAR failing to get address.

64-bit MMIO on PowerNV is still pending (but definitely in plan).
So if no one else complained, it seems not fix any problems in upstream.

Thanks,
Guo Chao

> > -v2: Add release bridge res support with bridge mem res for pref_mem children res.
> > -v3: refresh and make it can be applied early before for_each_dev_res patchset.
> >
> > Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> > Tested-by: Guo Chao <yan@linux.vnet.ibm.com>
> > ---
> >  drivers/pci/setup-bus.c | 133 ++++++++++++++++++++++++++++++++----------------
> >  drivers/pci/setup-res.c |  14 ++++-
> >  2 files changed, 101 insertions(+), 46 deletions(-)
> >
> > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
> > index 219a410..b98419e 100644
> > --- a/drivers/pci/setup-bus.c
> > +++ b/drivers/pci/setup-bus.c
> > @@ -711,12 +711,11 @@ static void pci_bridge_check_ranges(struct pci_bus *bus)
> >     bus resource of a given type. Note: we intentionally skip
> >     the bus resources which have already been assigned (that is,
> >     have non-NULL parent resource). */
> > -static struct resource *find_free_bus_resource(struct pci_bus *bus, unsigned long type)
> > +static struct resource *find_free_bus_resource(struct pci_bus *bus,
> > +                        unsigned long type_mask, unsigned long type)
> >  {
> >         int i;
> >         struct resource *r;
> > -       unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> > -                                 IORESOURCE_PREFETCH;
> >
> >         pci_bus_for_each_resource(bus, r, i) {
> >                 if (r == &ioport_resource || r == &iomem_resource)
> > @@ -813,7 +812,8 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
> >                 resource_size_t add_size, struct list_head *realloc_head)
> >  {
> >         struct pci_dev *dev;
> > -       struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO);
> > +       struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO,
> > +                                                       IORESOURCE_IO);
> >         resource_size_t size = 0, size0 = 0, size1 = 0;
> >         resource_size_t children_add_size = 0;
> >         resource_size_t min_align, align;
> > @@ -913,15 +913,16 @@ static inline resource_size_t calculate_mem_align(resource_size_t *aligns,
> >   * guarantees that all child resources fit in this size.
> >   */
> >  static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> > -                        unsigned long type, resource_size_t min_size,
> > -                       resource_size_t add_size,
> > -                       struct list_head *realloc_head)
> > +                        unsigned long type, unsigned long type2,
> > +                        resource_size_t min_size, resource_size_t add_size,
> > +                        struct list_head *realloc_head)
> >  {
> >         struct pci_dev *dev;
> >         resource_size_t min_align, align, size, size0, size1;
> >         resource_size_t aligns[12];     /* Alignments from 1Mb to 2Gb */
> >         int order, max_order;
> > -       struct resource *b_res = find_free_bus_resource(bus, type);
> > +       struct resource *b_res = find_free_bus_resource(bus,
> > +                                        mask | IORESOURCE_PREFETCH, type);
> >         unsigned int mem64_mask = 0;
> >         resource_size_t children_add_size = 0;
> >
> > @@ -942,7 +943,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
> >                         struct resource *r = &dev->resource[i];
> >                         resource_size_t r_size;
> >
> > -                       if (r->parent || (r->flags & mask) != type)
> > +                       if (r->parent || ((r->flags & mask) != type &&
> > +                                         (r->flags & mask) != type2))
> >                                 continue;
> >                         r_size = resource_size(r);
> >  #ifdef CONFIG_PCI_IOV
> > @@ -1115,8 +1117,9 @@ void __ref __pci_bus_size_bridges(struct pci_bus *bus,
> >                         struct list_head *realloc_head)
> >  {
> >         struct pci_dev *dev;
> > -       unsigned long mask, prefmask;
> > +       unsigned long mask, prefmask, type2 = 0;
> >         resource_size_t additional_mem_size = 0, additional_io_size = 0;
> > +       struct resource *b_res;
> >
> >         list_for_each_entry(dev, &bus->devices, bus_list) {
> >                 struct pci_bus *b = dev->subordinate;
> > @@ -1161,15 +1164,31 @@ void __ref __pci_bus_size_bridges(struct pci_bus *bus,
> >                    has already been allocated by arch code, try
> >                    non-prefetchable range for both types of PCI memory
> >                    resources. */
> > +               b_res = &bus->self->resource[PCI_BRIDGE_RESOURCES];
> >                 mask = IORESOURCE_MEM;
> >                 prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH;
> > -               if (pbus_size_mem(bus, prefmask, prefmask,
> > +               if (b_res[2].flags & IORESOURCE_MEM_64) {
> > +                       prefmask |= IORESOURCE_MEM_64;
> > +                       if (pbus_size_mem(bus, prefmask, prefmask, prefmask,
> >                                   realloc_head ? 0 : additional_mem_size,
> > -                                 additional_mem_size, realloc_head))
> > -                       mask = prefmask; /* Success, size non-prefetch only. */
> > -               else
> > -                       additional_mem_size += additional_mem_size;
> > -               pbus_size_mem(bus, mask, IORESOURCE_MEM,
> > +                                 additional_mem_size, realloc_head)) {
> > +                                       /* Success, size non-pref64 only. */
> > +                                       mask = prefmask;
> > +                                       type2 = prefmask & ~IORESOURCE_MEM_64;
> > +                       }
> > +               }
> > +               if (!type2) {
> > +                       prefmask &= ~IORESOURCE_MEM_64;
> > +                       if (pbus_size_mem(bus, prefmask, prefmask, prefmask,
> > +                                        realloc_head ? 0 : additional_mem_size,
> > +                                        additional_mem_size, realloc_head)) {
> > +                               /* Success, size non-prefetch only. */
> > +                               mask = prefmask;
> > +                       } else
> > +                               additional_mem_size += additional_mem_size;
> > +                       type2 = IORESOURCE_MEM;
> > +               }
> > +               pbus_size_mem(bus, mask, IORESOURCE_MEM, type2,
> >                                 realloc_head ? 0 : additional_mem_size,
> >                                 additional_mem_size, realloc_head);
> >                 break;
> > @@ -1255,42 +1274,66 @@ static void __ref __pci_bridge_assign_resources(const struct pci_dev *bridge,
> >  static void pci_bridge_release_resources(struct pci_bus *bus,
> >                                           unsigned long type)
> >  {
> > -       int idx;
> > -       bool changed = false;
> > -       struct pci_dev *dev;
> > +       struct pci_dev *dev = bus->self;
> >         struct resource *r;
> >         unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> > -                                 IORESOURCE_PREFETCH;
> > +                                 IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
> > +       unsigned old_flags = 0;
> > +       struct resource *b_res;
> > +       int idx = 1;
> >
> > -       dev = bus->self;
> > -       for (idx = PCI_BRIDGE_RESOURCES; idx <= PCI_BRIDGE_RESOURCE_END;
> > -            idx++) {
> > -               r = &dev->resource[idx];
> > -               if ((r->flags & type_mask) != type)
> > -                       continue;
> > -               if (!r->parent)
> > -                       continue;
> > -               /*
> > -                * if there are children under that, we should release them
> > -                *  all
> > -                */
> > -               release_child_resources(r);
> > -               if (!release_resource(r)) {
> > -                       dev_printk(KERN_DEBUG, &dev->dev,
> > -                                "resource %d %pR released\n", idx, r);
> > -                       /* keep the old size */
> > -                       r->end = resource_size(r) - 1;
> > -                       r->start = 0;
> > -                       r->flags = 0;
> > -                       changed = true;
> > -               }
> > -       }
> > +       b_res = &dev->resource[PCI_BRIDGE_RESOURCES];
> > +
> > +       /*
> > +        *     1. if there is io port assign fail, will release bridge
> > +        *        io port.
> > +        *     2. if there is non pref mmio assign fail, release bridge
> > +        *        nonpref mmio.
> > +        *     3. if there is 64bit pref mmio assign fail, and bridge pref
> > +        *        is 64bit, release bridge pref mmio.
> > +        *     4. if there is pref mmio assign fail, and bridge pref is
> > +        *        32bit mmio, release bridge pref mmio
> > +        *     5. if there is pref mmio assign fail, and bridge pref is not
> > +        *        assigned, release bridge nonpref mmio.
> > +        */
> > +       if (type & IORESOURCE_IO)
> > +               idx = 0;
> > +       else if (!(type & IORESOURCE_PREFETCH))
> > +               idx = 1;
> > +       else if ((type & IORESOURCE_MEM_64) &&
> > +                (b_res[2].flags & IORESOURCE_MEM_64))
> > +               idx = 2;
> > +       else if (!(b_res[2].flags & IORESOURCE_MEM_64) &&
> > +                (b_res[2].flags & IORESOURCE_PREFETCH))
> > +               idx = 2;
> > +       else
> > +               idx = 1;
> > +
> > +       r = &b_res[idx];
> > +
> > +       if (!r->parent)
> > +               return;
> > +
> > +       /*
> > +        * if there are children under that, we should release them
> > +        *  all
> > +        */
> > +       release_child_resources(r);
> > +       if (!release_resource(r)) {
> > +               type = old_flags = r->flags & type_mask;
> > +               dev_printk(KERN_DEBUG, &dev->dev, "resource %d %pR released\n",
> > +                                       PCI_BRIDGE_RESOURCES + idx, r);
> > +               /* keep the old size */
> > +               r->end = resource_size(r) - 1;
> > +               r->start = 0;
> > +               r->flags = 0;
> >
> > -       if (changed) {
> >                 /* avoiding touch the one without PREF */
> >                 if (type & IORESOURCE_PREFETCH)
> >                         type = IORESOURCE_PREFETCH;
> >                 __pci_setup_bridge(bus, type);
> > +               /* for next child res under same bridge */
> > +               r->flags = old_flags;
> >         }
> >  }
> >
> > @@ -1469,7 +1512,7 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
> >         LIST_HEAD(fail_head);
> >         struct pci_dev_resource *fail_res;
> >         unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
> > -                                 IORESOURCE_PREFETCH;
> > +                                 IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
> >         int pci_try_num = 1;
> >         enum enable_type enable_local;
> >
> > diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
> > index 83c4d3b..e968412 100644
> > --- a/drivers/pci/setup-res.c
> > +++ b/drivers/pci/setup-res.c
> > @@ -208,9 +208,21 @@ static int __pci_assign_resource(struct pci_bus *bus, struct pci_dev *dev,
> >
> >         /* First, try exact prefetching match.. */
> >         ret = pci_bus_alloc_resource(bus, res, size, align, min,
> > -                                    IORESOURCE_PREFETCH,
> > +                                    IORESOURCE_PREFETCH | IORESOURCE_MEM_64,
> >                                      pcibios_align_resource, dev);
> >
> > +       if (ret < 0 &&
> > +           (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64))) {
> > +               /*
> > +                * That failed.
> > +                *
> > +                * Try below 4g pref
> > +                */
> > +               ret = pci_bus_alloc_resource(bus, res, size, align, min,
> > +                                            IORESOURCE_PREFETCH,
> > +                                            pcibios_align_resource, dev);
> > +       }
> > +
> >         if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) {
> >                 /*
> >                  * That failed.
> > --
> > 1.8.1.4
> >
> 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 08/10] PCI: Try best to allocate pref mmio 64bit above 4g
  2013-11-26  6:59     ` Guo Chao
@ 2013-11-26 17:53       ` Bjorn Helgaas
  2013-11-26 22:00         ` Yinghai Lu
  0 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2013-11-26 17:53 UTC (permalink / raw)
  To: Guo Chao; +Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, linux-pci, linux-kernel

On Mon, Nov 25, 2013 at 11:59 PM, Guo Chao <yan@linux.vnet.ibm.com> wrote:
> On Mon, Nov 25, 2013 at 09:17:11PM -0700, Bjorn Helgaas wrote:
>> On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>> > When one of children resources does not support MEM_64, MEM_64 for
>> > bridge get reset, so pull down whole pref resource on the bridge under 4G.
>> >
>> > If the bridge support pref mem 64, will only allocate that with pref mem64 to
>> > children that support it.
>> > For children resources if they only support pref mem 32, will allocate them
>> > from non pref mem instead.
>> >
>> > If the bridge only support 32bit pref mmio, will still have all children pref
>> > mmio under that.
>>
>> I can't figure out if this is supposed to fix a problem, and if so,
>> what problem it is.  Can you include a URL for a bugzilla or other
>> problem description?
>
> This is intended to fix resource allocation problem when we expose
> 64-bit MMIO window in PowerNV platform. Please see issue 3 in:
>
> http://www.spinics.net/lists/linux-pci/msg26472.html
>
> Without this, any 32-bit prefetchable BARs will pull down the
> prefetahable window to allocate resource from 32-bit non-prefetchable
> range, preventing 64-bit MMIO from being used at all.
>
> What's worse, in some machines, 32-bit range is too small to provide
> fall back space for prefetchable window, causing all prefetchable
> BAR failing to get address.
>
> 64-bit MMIO on PowerNV is still pending (but definitely in plan).
> So if no one else complained, it seems not fix any problems in upstream.

I don't mind fixing a problem even if it's for pending platforms.  But
I do need a concrete specific description of the problem, e.g., a
dmesg log and pointers to specific bridge windows or device BARs that
are not allocated correctly, and some explanation about what is
different with this patch.

I don't know what "MEM_64 for bridge get reset" means -- there are a
couple places that clear IORESOURCE_MEM_64, but they don't seem
relevant.

>> > -v2: Add release bridge res support with bridge mem res for pref_mem children res.
>> > -v3: refresh and make it can be applied early before for_each_dev_res patchset.
>> >
>> > Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>> > Tested-by: Guo Chao <yan@linux.vnet.ibm.com>
>> > ---
>> >  drivers/pci/setup-bus.c | 133 ++++++++++++++++++++++++++++++++----------------
>> >  drivers/pci/setup-res.c |  14 ++++-
>> >  2 files changed, 101 insertions(+), 46 deletions(-)
>> >
>> > diff --git a/drivers/pci/setup-bus.c b/drivers/pci/setup-bus.c
>> > index 219a410..b98419e 100644
>> > --- a/drivers/pci/setup-bus.c
>> > +++ b/drivers/pci/setup-bus.c
>> > @@ -711,12 +711,11 @@ static void pci_bridge_check_ranges(struct pci_bus *bus)
>> >     bus resource of a given type. Note: we intentionally skip
>> >     the bus resources which have already been assigned (that is,
>> >     have non-NULL parent resource). */
>> > -static struct resource *find_free_bus_resource(struct pci_bus *bus, unsigned long type)
>> > +static struct resource *find_free_bus_resource(struct pci_bus *bus,
>> > +                        unsigned long type_mask, unsigned long type)
>> >  {
>> >         int i;
>> >         struct resource *r;
>> > -       unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
>> > -                                 IORESOURCE_PREFETCH;
>> >
>> >         pci_bus_for_each_resource(bus, r, i) {
>> >                 if (r == &ioport_resource || r == &iomem_resource)
>> > @@ -813,7 +812,8 @@ static void pbus_size_io(struct pci_bus *bus, resource_size_t min_size,
>> >                 resource_size_t add_size, struct list_head *realloc_head)
>> >  {
>> >         struct pci_dev *dev;
>> > -       struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO);
>> > +       struct resource *b_res = find_free_bus_resource(bus, IORESOURCE_IO,
>> > +                                                       IORESOURCE_IO);
>> >         resource_size_t size = 0, size0 = 0, size1 = 0;
>> >         resource_size_t children_add_size = 0;
>> >         resource_size_t min_align, align;
>> > @@ -913,15 +913,16 @@ static inline resource_size_t calculate_mem_align(resource_size_t *aligns,
>> >   * guarantees that all child resources fit in this size.
>> >   */
>> >  static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>> > -                        unsigned long type, resource_size_t min_size,
>> > -                       resource_size_t add_size,
>> > -                       struct list_head *realloc_head)
>> > +                        unsigned long type, unsigned long type2,
>> > +                        resource_size_t min_size, resource_size_t add_size,
>> > +                        struct list_head *realloc_head)
>> >  {
>> >         struct pci_dev *dev;
>> >         resource_size_t min_align, align, size, size0, size1;
>> >         resource_size_t aligns[12];     /* Alignments from 1Mb to 2Gb */
>> >         int order, max_order;
>> > -       struct resource *b_res = find_free_bus_resource(bus, type);
>> > +       struct resource *b_res = find_free_bus_resource(bus,
>> > +                                        mask | IORESOURCE_PREFETCH, type);
>> >         unsigned int mem64_mask = 0;
>> >         resource_size_t children_add_size = 0;
>> >
>> > @@ -942,7 +943,8 @@ static int pbus_size_mem(struct pci_bus *bus, unsigned long mask,
>> >                         struct resource *r = &dev->resource[i];
>> >                         resource_size_t r_size;
>> >
>> > -                       if (r->parent || (r->flags & mask) != type)
>> > +                       if (r->parent || ((r->flags & mask) != type &&
>> > +                                         (r->flags & mask) != type2))
>> >                                 continue;
>> >                         r_size = resource_size(r);
>> >  #ifdef CONFIG_PCI_IOV
>> > @@ -1115,8 +1117,9 @@ void __ref __pci_bus_size_bridges(struct pci_bus *bus,
>> >                         struct list_head *realloc_head)
>> >  {
>> >         struct pci_dev *dev;
>> > -       unsigned long mask, prefmask;
>> > +       unsigned long mask, prefmask, type2 = 0;
>> >         resource_size_t additional_mem_size = 0, additional_io_size = 0;
>> > +       struct resource *b_res;
>> >
>> >         list_for_each_entry(dev, &bus->devices, bus_list) {
>> >                 struct pci_bus *b = dev->subordinate;
>> > @@ -1161,15 +1164,31 @@ void __ref __pci_bus_size_bridges(struct pci_bus *bus,
>> >                    has already been allocated by arch code, try
>> >                    non-prefetchable range for both types of PCI memory
>> >                    resources. */
>> > +               b_res = &bus->self->resource[PCI_BRIDGE_RESOURCES];
>> >                 mask = IORESOURCE_MEM;
>> >                 prefmask = IORESOURCE_MEM | IORESOURCE_PREFETCH;
>> > -               if (pbus_size_mem(bus, prefmask, prefmask,
>> > +               if (b_res[2].flags & IORESOURCE_MEM_64) {
>> > +                       prefmask |= IORESOURCE_MEM_64;
>> > +                       if (pbus_size_mem(bus, prefmask, prefmask, prefmask,
>> >                                   realloc_head ? 0 : additional_mem_size,
>> > -                                 additional_mem_size, realloc_head))
>> > -                       mask = prefmask; /* Success, size non-prefetch only. */
>> > -               else
>> > -                       additional_mem_size += additional_mem_size;
>> > -               pbus_size_mem(bus, mask, IORESOURCE_MEM,
>> > +                                 additional_mem_size, realloc_head)) {
>> > +                                       /* Success, size non-pref64 only. */
>> > +                                       mask = prefmask;
>> > +                                       type2 = prefmask & ~IORESOURCE_MEM_64;
>> > +                       }
>> > +               }
>> > +               if (!type2) {
>> > +                       prefmask &= ~IORESOURCE_MEM_64;
>> > +                       if (pbus_size_mem(bus, prefmask, prefmask, prefmask,
>> > +                                        realloc_head ? 0 : additional_mem_size,
>> > +                                        additional_mem_size, realloc_head)) {
>> > +                               /* Success, size non-prefetch only. */
>> > +                               mask = prefmask;
>> > +                       } else
>> > +                               additional_mem_size += additional_mem_size;
>> > +                       type2 = IORESOURCE_MEM;
>> > +               }
>> > +               pbus_size_mem(bus, mask, IORESOURCE_MEM, type2,
>> >                                 realloc_head ? 0 : additional_mem_size,
>> >                                 additional_mem_size, realloc_head);
>> >                 break;
>> > @@ -1255,42 +1274,66 @@ static void __ref __pci_bridge_assign_resources(const struct pci_dev *bridge,
>> >  static void pci_bridge_release_resources(struct pci_bus *bus,
>> >                                           unsigned long type)
>> >  {
>> > -       int idx;
>> > -       bool changed = false;
>> > -       struct pci_dev *dev;
>> > +       struct pci_dev *dev = bus->self;
>> >         struct resource *r;
>> >         unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
>> > -                                 IORESOURCE_PREFETCH;
>> > +                                 IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
>> > +       unsigned old_flags = 0;
>> > +       struct resource *b_res;
>> > +       int idx = 1;
>> >
>> > -       dev = bus->self;
>> > -       for (idx = PCI_BRIDGE_RESOURCES; idx <= PCI_BRIDGE_RESOURCE_END;
>> > -            idx++) {
>> > -               r = &dev->resource[idx];
>> > -               if ((r->flags & type_mask) != type)
>> > -                       continue;
>> > -               if (!r->parent)
>> > -                       continue;
>> > -               /*
>> > -                * if there are children under that, we should release them
>> > -                *  all
>> > -                */
>> > -               release_child_resources(r);
>> > -               if (!release_resource(r)) {
>> > -                       dev_printk(KERN_DEBUG, &dev->dev,
>> > -                                "resource %d %pR released\n", idx, r);
>> > -                       /* keep the old size */
>> > -                       r->end = resource_size(r) - 1;
>> > -                       r->start = 0;
>> > -                       r->flags = 0;
>> > -                       changed = true;
>> > -               }
>> > -       }
>> > +       b_res = &dev->resource[PCI_BRIDGE_RESOURCES];
>> > +
>> > +       /*
>> > +        *     1. if there is io port assign fail, will release bridge
>> > +        *        io port.
>> > +        *     2. if there is non pref mmio assign fail, release bridge
>> > +        *        nonpref mmio.
>> > +        *     3. if there is 64bit pref mmio assign fail, and bridge pref
>> > +        *        is 64bit, release bridge pref mmio.
>> > +        *     4. if there is pref mmio assign fail, and bridge pref is
>> > +        *        32bit mmio, release bridge pref mmio
>> > +        *     5. if there is pref mmio assign fail, and bridge pref is not
>> > +        *        assigned, release bridge nonpref mmio.
>> > +        */
>> > +       if (type & IORESOURCE_IO)
>> > +               idx = 0;
>> > +       else if (!(type & IORESOURCE_PREFETCH))
>> > +               idx = 1;
>> > +       else if ((type & IORESOURCE_MEM_64) &&
>> > +                (b_res[2].flags & IORESOURCE_MEM_64))
>> > +               idx = 2;
>> > +       else if (!(b_res[2].flags & IORESOURCE_MEM_64) &&
>> > +                (b_res[2].flags & IORESOURCE_PREFETCH))
>> > +               idx = 2;
>> > +       else
>> > +               idx = 1;
>> > +
>> > +       r = &b_res[idx];
>> > +
>> > +       if (!r->parent)
>> > +               return;
>> > +
>> > +       /*
>> > +        * if there are children under that, we should release them
>> > +        *  all
>> > +        */
>> > +       release_child_resources(r);
>> > +       if (!release_resource(r)) {
>> > +               type = old_flags = r->flags & type_mask;
>> > +               dev_printk(KERN_DEBUG, &dev->dev, "resource %d %pR released\n",
>> > +                                       PCI_BRIDGE_RESOURCES + idx, r);
>> > +               /* keep the old size */
>> > +               r->end = resource_size(r) - 1;
>> > +               r->start = 0;
>> > +               r->flags = 0;
>> >
>> > -       if (changed) {
>> >                 /* avoiding touch the one without PREF */
>> >                 if (type & IORESOURCE_PREFETCH)
>> >                         type = IORESOURCE_PREFETCH;
>> >                 __pci_setup_bridge(bus, type);
>> > +               /* for next child res under same bridge */
>> > +               r->flags = old_flags;
>> >         }
>> >  }
>> >
>> > @@ -1469,7 +1512,7 @@ void pci_assign_unassigned_root_bus_resources(struct pci_bus *bus)
>> >         LIST_HEAD(fail_head);
>> >         struct pci_dev_resource *fail_res;
>> >         unsigned long type_mask = IORESOURCE_IO | IORESOURCE_MEM |
>> > -                                 IORESOURCE_PREFETCH;
>> > +                                 IORESOURCE_PREFETCH | IORESOURCE_MEM_64;
>> >         int pci_try_num = 1;
>> >         enum enable_type enable_local;
>> >
>> > diff --git a/drivers/pci/setup-res.c b/drivers/pci/setup-res.c
>> > index 83c4d3b..e968412 100644
>> > --- a/drivers/pci/setup-res.c
>> > +++ b/drivers/pci/setup-res.c
>> > @@ -208,9 +208,21 @@ static int __pci_assign_resource(struct pci_bus *bus, struct pci_dev *dev,
>> >
>> >         /* First, try exact prefetching match.. */
>> >         ret = pci_bus_alloc_resource(bus, res, size, align, min,
>> > -                                    IORESOURCE_PREFETCH,
>> > +                                    IORESOURCE_PREFETCH | IORESOURCE_MEM_64,
>> >                                      pcibios_align_resource, dev);
>> >
>> > +       if (ret < 0 &&
>> > +           (res->flags & (IORESOURCE_PREFETCH | IORESOURCE_MEM_64))) {
>> > +               /*
>> > +                * That failed.
>> > +                *
>> > +                * Try below 4g pref
>> > +                */
>> > +               ret = pci_bus_alloc_resource(bus, res, size, align, min,
>> > +                                            IORESOURCE_PREFETCH,
>> > +                                            pcibios_align_resource, dev);
>> > +       }
>> > +
>> >         if (ret < 0 && (res->flags & IORESOURCE_PREFETCH)) {
>> >                 /*
>> >                  * That failed.
>> > --
>> > 1.8.1.4
>> >
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-11-26  3:38   ` Bjorn Helgaas
@ 2013-11-26 19:34     ` Yinghai Lu
  2013-11-26 20:13       ` Yinghai Lu
  2013-11-27  1:17       ` [PATCH v2 04/10] PCI: Destroy pci dev only once Rafael J. Wysocki
  0 siblings, 2 replies; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26 19:34 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel

On Mon, Nov 25, 2013 at 7:38 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>> Mutliple removing via /sys will call pci_destroy_dev two times.
>>
>> | When concurent removing pci devices which are in the same pci subtree
>> | via sysfs, such as:
>> | echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove ; echo -n 1 >
>> | /sys/bus/pci/devices/0000\:1a\:01.0/remove
>> | (1a:01.0 device is downstream from the 10:00.0 bridge)
>> |
>> | the following warning will show:
>> | [ 1799.280918] ------------[ cut here ]------------
>> | [ 1799.336199] WARNING: CPU: 7 PID: 126 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0()
>> | [ 1799.433093] list_del corruption, ffff8807b4a7c000->next is LIST_POISON1 (dead000000100100)
>> | [ 1800.276623] CPU: 7 PID: 126 Comm: kworker/u512:1 Tainted: G        W    3.12.0-rc5+ #196
>> | [ 1800.508918] Workqueue: sysfsd sysfs_schedule_callback_work
>> | [ 1800.574703]  0000000000000009 ffff8807adbadbd8 ffffffff8168b26c ffff8807c27d08a8
>> | [ 1800.663860]  ffff8807adbadc28 ffff8807adbadc18 ffffffff810711dc ffff8807adbadc68
>> | [ 1800.753130]  ffff8807b4a7c000 ffff8807b4a7c000 ffff8807ad089c00 0000000000000000
>> | [ 1800.842282] Call Trace:
>> | [ 1800.871651]  [<ffffffff8168b26c>] dump_stack+0x55/0x76
>> | [ 1800.933301]  [<ffffffff810711dc>] warn_slowpath_common+0x8c/0xc0
>> | [ 1801.005283]  [<ffffffff810712c6>] warn_slowpath_fmt+0x46/0x50
>> | [ 1801.074081]  [<ffffffff8135a343>] __list_del_entry+0x63/0xd0
>> | [ 1801.141839]  [<ffffffff8135a3c1>] list_del+0x11/0x40
>> | [ 1801.201320]  [<ffffffff813734da>] pci_remove_bus_device+0x6a/0xe0
>> | [ 1801.274279]  [<ffffffff8137356e>] pci_stop_and_remove_bus_device+0x1e/0x30
>> | [ 1801.356606]  [<ffffffff8137b20b>] remove_callback+0x2b/0x40
>> | [ 1801.423412]  [<ffffffff81251848>] sysfs_schedule_callback_work+0x18/0x60
>> | [ 1801.503744]  [<ffffffff8108eab5>] process_one_work+0x1f5/0x540
>> | [ 1801.573640]  [<ffffffff8108ea53>] ? process_one_work+0x193/0x540
>> | [ 1801.645616]  [<ffffffff8108f2ac>] worker_thread+0x11c/0x370
>> | [ 1801.712337]  [<ffffffff8108f190>] ? rescuer_thread+0x350/0x350
>> | [ 1801.782178]  [<ffffffff8109731d>] kthread+0xed/0x100
>> | [ 1801.841661]  [<ffffffff81097230>] ? kthread_create_on_node+0x160/0x160
>> | [ 1801.919919]  [<ffffffff8169cc3c>] ret_from_fork+0x7c/0xb0
>> | [ 1801.984608]  [<ffffffff81097230>] ? kthread_create_on_node+0x160/0x160
>> | [ 1802.062825] ---[ end trace d77f2054de000fb7 ]---
>> |
>> | This issue is related to the bug 54411:
>> | https://bugzilla.kernel.org/show_bug.cgi?id=54411
>>
>> Add is_removed to record if pci_destroy_dev is called already.
>>
>> During second calling, still have extra dev ref hold via
>> device_schedule_call, so we are safe to check dev->is_removed.
>>
>> It fixs the problem In Gu's test.
>>
>> -v2: add partial changelog from Gu Zheng <guz.fnst@cn.fujitsu.com>
>>      refresh after patch of moving device_del from Rafael.
>>
>> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>> ---
>>  drivers/pci/remove.c | 8 +++++---
>>  include/linux/pci.h  | 1 +
>>  2 files changed, 6 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
>> index f452148..b090cec 100644
>> --- a/drivers/pci/remove.c
>> +++ b/drivers/pci/remove.c
>> @@ -20,9 +20,11 @@ static void pci_stop_dev(struct pci_dev *dev)
>>
>>  static void pci_destroy_dev(struct pci_dev *dev)
>>  {
>> -       device_del(&dev->dev);
>> -
>> -       put_device(&dev->dev);
>> +       if (!dev->is_removed) {
>> +               device_del(&dev->dev);
>> +               dev->is_removed = 1;
>
> As Rafael pointed out, this looks like a race.  What prevents two
> concurrent calls to pci_destroy_dev() from seeing "dev->is_removed ==
> 0" and both calling device_del() on the same device?

I don't think that is going to happen. as those two pci_destroy_dev is
serialized
during
 echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove ; echo -n 1 >
     /sys/bus/pci/devices/0000\:1a\:01.0/remove
is called.

Yinghai

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr
  2013-11-26  3:46   ` Bjorn Helgaas
@ 2013-11-26 19:35     ` Yinghai Lu
  2013-12-11 18:48       ` Bjorn Helgaas
  0 siblings, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26 19:35 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	David Airlie

On Mon, Nov 25, 2013 at 7:46 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>> That bar could be 64bit pref mem and above 4G.
>>
>> -v2: refresh to 3.13-rc1
>>
>> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>> Cc: David Airlie <airlied@linux.ie>
>> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>
> This looks OK to me.  Does it depend on any previous patches in this
> series?  If not, I think Dave should pick it up.

No.

could be exposed after 5-9 get applied.

Yinghai

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-11-26 19:34     ` Yinghai Lu
@ 2013-11-26 20:13       ` Yinghai Lu
  2013-11-27  1:24         ` Rafael J. Wysocki
  2013-11-27  1:17       ` [PATCH v2 04/10] PCI: Destroy pci dev only once Rafael J. Wysocki
  1 sibling, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26 20:13 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel

On Tue, Nov 26, 2013 at 11:34 AM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Mon, Nov 25, 2013 at 7:38 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>>> Mutliple removing via /sys will call pci_destroy_dev two times.
>>>
>>> | When concurent removing pci devices which are in the same pci subtree
>>> | via sysfs, such as:
>>> | echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove ; echo -n 1 >
>>> | /sys/bus/pci/devices/0000\:1a\:01.0/remove
>>> | (1a:01.0 device is downstream from the 10:00.0 bridge)
>>> |
>>> | the following warning will show:
>>> | [ 1799.280918] ------------[ cut here ]------------
>>> | [ 1799.336199] WARNING: CPU: 7 PID: 126 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0()
>>> | [ 1799.433093] list_del corruption, ffff8807b4a7c000->next is LIST_POISON1 (dead000000100100)
>>> | [ 1800.276623] CPU: 7 PID: 126 Comm: kworker/u512:1 Tainted: G        W    3.12.0-rc5+ #196
>>> | [ 1800.508918] Workqueue: sysfsd sysfs_schedule_callback_work
>>> | [ 1800.574703]  0000000000000009 ffff8807adbadbd8 ffffffff8168b26c ffff8807c27d08a8
>>> | [ 1800.663860]  ffff8807adbadc28 ffff8807adbadc18 ffffffff810711dc ffff8807adbadc68
>>> | [ 1800.753130]  ffff8807b4a7c000 ffff8807b4a7c000 ffff8807ad089c00 0000000000000000
>>> | [ 1800.842282] Call Trace:
>>> | [ 1800.871651]  [<ffffffff8168b26c>] dump_stack+0x55/0x76
>>> | [ 1800.933301]  [<ffffffff810711dc>] warn_slowpath_common+0x8c/0xc0
>>> | [ 1801.005283]  [<ffffffff810712c6>] warn_slowpath_fmt+0x46/0x50
>>> | [ 1801.074081]  [<ffffffff8135a343>] __list_del_entry+0x63/0xd0
>>> | [ 1801.141839]  [<ffffffff8135a3c1>] list_del+0x11/0x40
>>> | [ 1801.201320]  [<ffffffff813734da>] pci_remove_bus_device+0x6a/0xe0
>>> | [ 1801.274279]  [<ffffffff8137356e>] pci_stop_and_remove_bus_device+0x1e/0x30
>>> | [ 1801.356606]  [<ffffffff8137b20b>] remove_callback+0x2b/0x40
>>> | [ 1801.423412]  [<ffffffff81251848>] sysfs_schedule_callback_work+0x18/0x60
>>> | [ 1801.503744]  [<ffffffff8108eab5>] process_one_work+0x1f5/0x540
>>> | [ 1801.573640]  [<ffffffff8108ea53>] ? process_one_work+0x193/0x540
>>> | [ 1801.645616]  [<ffffffff8108f2ac>] worker_thread+0x11c/0x370
>>> | [ 1801.712337]  [<ffffffff8108f190>] ? rescuer_thread+0x350/0x350
>>> | [ 1801.782178]  [<ffffffff8109731d>] kthread+0xed/0x100
>>> | [ 1801.841661]  [<ffffffff81097230>] ? kthread_create_on_node+0x160/0x160
>>> | [ 1801.919919]  [<ffffffff8169cc3c>] ret_from_fork+0x7c/0xb0
>>> | [ 1801.984608]  [<ffffffff81097230>] ? kthread_create_on_node+0x160/0x160
>>> | [ 1802.062825] ---[ end trace d77f2054de000fb7 ]---
>>> |
>>> | This issue is related to the bug 54411:
>>> | https://bugzilla.kernel.org/show_bug.cgi?id=54411
>>>
>>> Add is_removed to record if pci_destroy_dev is called already.
>>>
>>> During second calling, still have extra dev ref hold via
>>> device_schedule_call, so we are safe to check dev->is_removed.
>>>
>>> It fixs the problem In Gu's test.
>>>
>>> -v2: add partial changelog from Gu Zheng <guz.fnst@cn.fujitsu.com>
>>>      refresh after patch of moving device_del from Rafael.
>>>
>>> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>>> ---
>>>  drivers/pci/remove.c | 8 +++++---
>>>  include/linux/pci.h  | 1 +
>>>  2 files changed, 6 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
>>> index f452148..b090cec 100644
>>> --- a/drivers/pci/remove.c
>>> +++ b/drivers/pci/remove.c
>>> @@ -20,9 +20,11 @@ static void pci_stop_dev(struct pci_dev *dev)
>>>
>>>  static void pci_destroy_dev(struct pci_dev *dev)
>>>  {
>>> -       device_del(&dev->dev);
>>> -
>>> -       put_device(&dev->dev);
>>> +       if (!dev->is_removed) {
>>> +               device_del(&dev->dev);
>>> +               dev->is_removed = 1;
>>
>> As Rafael pointed out, this looks like a race.  What prevents two
>> concurrent calls to pci_destroy_dev() from seeing "dev->is_removed ==
>> 0" and both calling device_del() on the same device?
>

hope you are happy with this one:

-v3: use atomic operations to prevent racing that Rafael and Bjorn concern.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 drivers/pci/probe.c  |    2 ++
 drivers/pci/remove.c |    8 +++++---
 include/linux/pci.h  |    1 +
 3 files changed, 8 insertions(+), 3 deletions(-)

Index: linux-2.6/drivers/pci/remove.c
===================================================================
--- linux-2.6.orig/drivers/pci/remove.c
+++ linux-2.6/drivers/pci/remove.c
@@ -20,9 +20,11 @@ static void pci_stop_dev(struct pci_dev

 static void pci_destroy_dev(struct pci_dev *dev)
 {
-    device_del(&dev->dev);
-
-    put_device(&dev->dev);
+    if (atomic_inc_and_test(&dev->removed_count)) {
+        device_del(&dev->dev);
+        put_device(&dev->dev);
+    } else
+        atomic_dec(&dev->removed_count);
 }

 void pci_remove_bus(struct pci_bus *bus)
Index: linux-2.6/include/linux/pci.h
===================================================================
--- linux-2.6.orig/include/linux/pci.h
+++ linux-2.6/include/linux/pci.h
@@ -316,6 +316,7 @@ struct pci_dev {
     struct resource resource[DEVICE_COUNT_RESOURCE]; /* I/O and
memory regions + expansion ROMs */

     bool match_driver;        /* Skip attaching driver */
+    atomic_t    removed_count;    /* pci_destroy_dev is called */
     /* These fields are used by common fixups */
     unsigned int    transparent:1;    /* Subtractive decode PCI bridge */
     unsigned int    multifunction:1;/* Part of multi-function device */
Index: linux-2.6/drivers/pci/probe.c
===================================================================
--- linux-2.6.orig/drivers/pci/probe.c
+++ linux-2.6/drivers/pci/probe.c
@@ -1398,6 +1398,8 @@ void pci_device_add(struct pci_dev *dev,
     dev->match_driver = false;
     ret = device_add(&dev->dev);
     WARN_ON(ret < 0);
+
+    atomic_set(&dev->removed_count, -1);
 }

 struct pci_dev *__ref pci_scan_single_device(struct pci_bus *bus, int devfn)

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 07/10] PCI: Try to allocate mem64 above 4G at first
  2013-11-26  4:15   ` Bjorn Helgaas
@ 2013-11-26 20:14     ` Yinghai Lu
  0 siblings, 0 replies; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26 20:14 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel

On Mon, Nov 25, 2013 at 8:15 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>> Will fall back to below 4g if it can not find any above 4g.
>
> Does this fix a bug?  If so, please include a bugzilla or mailing list URL.
>
>> x86 32bit without X86_PAE support will have bottom set to 0, because
>> resource_size_t is 32bit.
>>
>> Also for 32bit with resource_size_t 64bit kernel on machine with pae support
>> we are safe because iomem_resource is limited to 32bit according to
>> x86_phys_bits.
>>
>> -v2: update bottom assigning to make it clear for non-pae support machine.
>> -v3: Bjorn's change:
>>         use MAX_RESOURCE instead of -1
>>         use start/end instead of bottom/max
>>         for all arch instead of just x86_64
>> -v4: updated after PCI_MAX_RESOURCE_32 change.
>> -v5: restore io handling to use PCI_MAX_RESOURCE_32 as limit.
>> -v6: checking pcibios_resource_to_bus return for every bus res, to decide it
>>         if we need to try high at first.
>>      It supports all arches instead of just x86_64.
>>
>> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>> ---
>>  arch/x86/include/asm/pci.h |  1 -
>>  drivers/pci/bus.c          | 42 ++++++++++++++++++++++++++++++++++--------
>>  drivers/pci/pci.h          |  2 ++
>>  include/linux/pci.h        |  4 ----
>>  4 files changed, 36 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
>> index 947b5c4..122c299 100644
>> --- a/arch/x86/include/asm/pci.h
>> +++ b/arch/x86/include/asm/pci.h
>> @@ -125,7 +125,6 @@ int setup_msi_irq(struct pci_dev *dev, struct msi_desc *msidesc,
>>
>>  /* generic pci stuff */
>>  #include <asm-generic/pci.h>
>> -#define PCIBIOS_MAX_MEM_32 0xffffffff
>>
>>  #ifdef CONFIG_NUMA
>>  /* Returns the node based on pci bus */
>> diff --git a/drivers/pci/bus.c b/drivers/pci/bus.c
>> index 1ffd95b..f801f6a 100644
>> --- a/drivers/pci/bus.c
>> +++ b/drivers/pci/bus.c
>> @@ -125,15 +125,13 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
>>  {
>>         int i, ret = -ENOMEM;
>>         struct resource *r;
>> -       resource_size_t max = -1;
>>
>>         type_mask |= IORESOURCE_IO | IORESOURCE_MEM;
>>
>> -       /* don't allocate too high if the pref mem doesn't support 64bit*/
>> -       if (!(res->flags & IORESOURCE_MEM_64))
>> -               max = PCIBIOS_MAX_MEM_32;
>> -
>>         pci_bus_for_each_resource(bus, r, i) {
>> +               resource_size_t start, end, middle;
>> +               struct pci_bus_region region;
>> +
>
> I think you're doing two things at once in this patch:
>
> 1) Fixing the problem that the IORESOURCE_MEM_64 constraint was being
> applied to CPU addresses, not bus addresses, and
>
> 2) Trying to allocate above 4G first.
>
> Please separate these into two patches.  The first thing is an obvious
> problem and should have little risk of breaking anything.  The second
> probably makes sense, but the allocation change could certainly break
> something and have to be reverted.  It would be good if we could save
> the first fix if that happened.

sure.

>
>>                 if (!r)
>>                         continue;
>>
>> @@ -147,14 +145,42 @@ pci_bus_alloc_resource(struct pci_bus *bus, struct resource *res,
>>                     !(res->flags & IORESOURCE_PREFETCH))
>>                         continue;
>>
>> +               start = 0;
>> +               end = MAX_RESOURCE;
>> +               /*
>> +                * don't allocate too high if the pref mem doesn't
>> +                * support 64bit, also if this is a 64-bit mem
>> +                * resource, try above 4GB first
>> +                */
>> +               __pcibios_resource_to_bus(bus, &region, r);
>> +               if (region.start <= PCI_MAX_ADDR_32 &&
>> +                   region.end > PCI_MAX_ADDR_32) {
>> +                       middle = pcibios_bus_addr_to_res(bus, res->flags,
>> +                                                     PCI_MAX_ADDR_32);
>> +                       if (res->flags & IORESOURCE_MEM_64)
>> +                               start = middle + 1;
>> +                       else
>> +                               end = middle;
>> +               } else if (region.start > PCI_MAX_ADDR_32 &&
>> +                          !(res->flags & IORESOURCE_MEM_64))
>> +                               continue;
>
> This is sort of ugly.  Can you make some sort of "pci_clip_resource()"
>  so this loop remains readable?  E.g., something like:
>
>   static pci_bus_region pci_mem_32 = { 0, 0xffffffff };
>   static pci_bus_region pci_mem_64 = { 0x100000000, 0xffffffffffffffff };
>
>   struct resource avail = *r;
>
>   if (res->flags & IORESOURCE_MEM_64)
>     pci_clip_resource(&avail, &pci_mem_64);
>   else
>     pci_clip_resource(&avail, &pci_mem_32);
>   if (!resource_size(&avail))
>     continue;
>

ok.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 08/10] PCI: Try best to allocate pref mmio 64bit above 4g
  2013-11-26 17:53       ` Bjorn Helgaas
@ 2013-11-26 22:00         ` Yinghai Lu
  2013-11-26 22:01           ` Bjorn Helgaas
  0 siblings, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-11-26 22:00 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Guo Chao, Rafael J. Wysocki, Gu Zheng, linux-pci, linux-kernel

On Tue, Nov 26, 2013 at 9:53 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> I don't know what "MEM_64 for bridge get reset" means -- there are a
> couple places that clear IORESOURCE_MEM_64, but they don't seem
> relevant.

during size bridge resource, we try to clear bridge mmio64 pref MEM_64
bit in bridge resource flags.
if one children pref mmio does not support 64bit pref mmio.

Yinghai

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 08/10] PCI: Try best to allocate pref mmio 64bit above 4g
  2013-11-26 22:00         ` Yinghai Lu
@ 2013-11-26 22:01           ` Bjorn Helgaas
  2013-11-27  0:33             ` Yinghai Lu
  0 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2013-11-26 22:01 UTC (permalink / raw)
  To: Yinghai Lu; +Cc: Guo Chao, Rafael J. Wysocki, Gu Zheng, linux-pci, linux-kernel

On Tue, Nov 26, 2013 at 3:00 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Tue, Nov 26, 2013 at 9:53 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> I don't know what "MEM_64 for bridge get reset" means -- there are a
>> couple places that clear IORESOURCE_MEM_64, but they don't seem
>> relevant.
>
> during size bridge resource, we try to clear bridge mmio64 pref MEM_64
> bit in bridge resource flags.
> if one children pref mmio does not support 64bit pref mmio.

A function name?  Please?

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 08/10] PCI: Try best to allocate pref mmio 64bit above 4g
  2013-11-26 22:01           ` Bjorn Helgaas
@ 2013-11-27  0:33             ` Yinghai Lu
  0 siblings, 0 replies; 69+ messages in thread
From: Yinghai Lu @ 2013-11-27  0:33 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Guo Chao, Rafael J. Wysocki, Gu Zheng, linux-pci, linux-kernel

On Tue, Nov 26, 2013 at 2:01 PM, Bjorn Helgaas <bhelgaas@google.com>
>> during size bridge resource, we try to clear bridge mmio64 pref MEM_64
>> bit in bridge resource flags.
>> if one children pref mmio does not support 64bit pref mmio.
>
> A function name?  Please?

drivers/pci/setup-bus.c::pbus_size_mem()

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 01/10] PCI: Use device_release_driver in pci_stop_root_bus
  2013-11-26  1:28 ` [PATCH v2 01/10] PCI: Use device_release_driver in pci_stop_root_bus Yinghai Lu
@ 2013-11-27  1:09   ` Rafael J. Wysocki
  0 siblings, 0 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2013-11-27  1:09 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel

On Monday, November 25, 2013 05:28:01 PM Yinghai Lu wrote:
> To be consistent with change in
> | PCI: Move device_del() from pci_stop_dev() to pci_destroy_dev()
> 
> Use device_release_driver for root bus/hostbridge.
> 
> Also use device_unregister() in pci_remove_root_bus() instead of
> device_del/put_device, that will be corresponding device_register()
> for pci_create_root_bus for hostbridge.
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>

Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

> ---
>  drivers/pci/remove.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> index cc9337a..692f4c3 100644
> --- a/drivers/pci/remove.c
> +++ b/drivers/pci/remove.c
> @@ -128,7 +128,7 @@ void pci_stop_root_bus(struct pci_bus *bus)
>  		pci_stop_bus_device(child);
>  
>  	/* stop the host bridge */
> -	device_del(&host_bridge->dev);
> +	device_release_driver(&host_bridge->dev);
>  }
>  
>  void pci_remove_root_bus(struct pci_bus *bus)
> @@ -147,5 +147,5 @@ void pci_remove_root_bus(struct pci_bus *bus)
>  	host_bridge->bus = NULL;
>  
>  	/* remove the host bridge */
> -	put_device(&host_bridge->dev);
> +	device_unregister(&host_bridge->dev);
>  }
> 
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 03/10] PCI: Move resources and bus_list releasing to pci_release_dev
  2013-11-26  1:28 ` [PATCH v2 03/10] PCI: Move resources and bus_list releasing to pci_release_dev Yinghai Lu
@ 2013-11-27  1:15   ` Rafael J. Wysocki
  2013-11-27  2:15     ` Yinghai Lu
  0 siblings, 1 reply; 69+ messages in thread
From: Rafael J. Wysocki @ 2013-11-27  1:15 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel

On Monday, November 25, 2013 05:28:03 PM Yinghai Lu wrote:
> We should not release resource in pci_destroy that is too early
> as there could be still other use hold reference.
> 
> release them or remove it from bus devices list at last
> in pci_release_dev instead.
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> ---
>  drivers/pci/probe.c  | 21 +++++++++++++++++++--
>  drivers/pci/remove.c | 19 -------------------
>  2 files changed, 19 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index 173a9cf..12ec56c 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1154,6 +1154,18 @@ static void pci_release_capabilities(struct pci_dev *dev)
>  	pci_free_cap_save_buffers(dev);
>  }
>  
> +static void pci_free_resources(struct pci_dev *dev)
> +{
> +	int i;
> +
> +	pci_cleanup_rom(dev);
> +	for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> +		struct resource *res = dev->resource + i;
> +		if (res->parent)
> +			release_resource(res);
> +	}
> +}
> +
>  /**
>   * pci_release_dev - free a pci device structure when all users of it are finished.
>   * @dev: device that's been disconnected
> @@ -1163,9 +1175,14 @@ static void pci_release_capabilities(struct pci_dev *dev)
>   */
>  static void pci_release_dev(struct device *dev)
>  {
> -	struct pci_dev *pci_dev;
> +	struct pci_dev *pci_dev = to_pci_dev(dev);
> +
> +	down_write(&pci_bus_sem);
> +	list_del(&pci_dev->bus_list);
> +	up_write(&pci_bus_sem);
> +
> +	pci_free_resources(pci_dev);

What are the possible side effects of this change?

> -	pci_dev = to_pci_dev(dev);
>  	pci_release_capabilities(pci_dev);
>  	pci_release_of_node(pci_dev);
>  	pcibios_release_device(pci_dev);
> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> index 692f4c3..f452148 100644
> --- a/drivers/pci/remove.c
> +++ b/drivers/pci/remove.c
> @@ -3,20 +3,6 @@
>  #include <linux/pci-aspm.h>
>  #include "pci.h"
>  
> -static void pci_free_resources(struct pci_dev *dev)
> -{
> -	int i;
> -
> -	msi_remove_pci_irq_vectors(dev);
> -
> -	pci_cleanup_rom(dev);
> -	for (i = 0; i < PCI_NUM_RESOURCES; i++) {
> -		struct resource *res = dev->resource + i;
> -		if (res->parent)
> -			release_resource(res);
> -	}
> -}
> -
>  static void pci_stop_dev(struct pci_dev *dev)
>  {
>  	pci_pme_active(dev, false);
> @@ -36,11 +22,6 @@ static void pci_destroy_dev(struct pci_dev *dev)
>  {
>  	device_del(&dev->dev);
>  
> -	down_write(&pci_bus_sem);
> -	list_del(&dev->bus_list);
> -	up_write(&pci_bus_sem);
> -
> -	pci_free_resources(dev);
>  	put_device(&dev->dev);

And if the side effects are benign enough, why don't we do a device_unregister()
here?

>  }
>  

Rafael


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-11-26 19:34     ` Yinghai Lu
  2013-11-26 20:13       ` Yinghai Lu
@ 2013-11-27  1:17       ` Rafael J. Wysocki
  1 sibling, 0 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2013-11-27  1:17 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel

On Tuesday, November 26, 2013 11:34:24 AM Yinghai Lu wrote:
> On Mon, Nov 25, 2013 at 7:38 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> > On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> >> Mutliple removing via /sys will call pci_destroy_dev two times.
> >>
> >> | When concurent removing pci devices which are in the same pci subtree
> >> | via sysfs, such as:
> >> | echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove ; echo -n 1 >
> >> | /sys/bus/pci/devices/0000\:1a\:01.0/remove
> >> | (1a:01.0 device is downstream from the 10:00.0 bridge)
> >> |
> >> | the following warning will show:
> >> | [ 1799.280918] ------------[ cut here ]------------
> >> | [ 1799.336199] WARNING: CPU: 7 PID: 126 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0()
> >> | [ 1799.433093] list_del corruption, ffff8807b4a7c000->next is LIST_POISON1 (dead000000100100)
> >> | [ 1800.276623] CPU: 7 PID: 126 Comm: kworker/u512:1 Tainted: G        W    3.12.0-rc5+ #196
> >> | [ 1800.508918] Workqueue: sysfsd sysfs_schedule_callback_work
> >> | [ 1800.574703]  0000000000000009 ffff8807adbadbd8 ffffffff8168b26c ffff8807c27d08a8
> >> | [ 1800.663860]  ffff8807adbadc28 ffff8807adbadc18 ffffffff810711dc ffff8807adbadc68
> >> | [ 1800.753130]  ffff8807b4a7c000 ffff8807b4a7c000 ffff8807ad089c00 0000000000000000
> >> | [ 1800.842282] Call Trace:
> >> | [ 1800.871651]  [<ffffffff8168b26c>] dump_stack+0x55/0x76
> >> | [ 1800.933301]  [<ffffffff810711dc>] warn_slowpath_common+0x8c/0xc0
> >> | [ 1801.005283]  [<ffffffff810712c6>] warn_slowpath_fmt+0x46/0x50
> >> | [ 1801.074081]  [<ffffffff8135a343>] __list_del_entry+0x63/0xd0
> >> | [ 1801.141839]  [<ffffffff8135a3c1>] list_del+0x11/0x40
> >> | [ 1801.201320]  [<ffffffff813734da>] pci_remove_bus_device+0x6a/0xe0
> >> | [ 1801.274279]  [<ffffffff8137356e>] pci_stop_and_remove_bus_device+0x1e/0x30
> >> | [ 1801.356606]  [<ffffffff8137b20b>] remove_callback+0x2b/0x40
> >> | [ 1801.423412]  [<ffffffff81251848>] sysfs_schedule_callback_work+0x18/0x60
> >> | [ 1801.503744]  [<ffffffff8108eab5>] process_one_work+0x1f5/0x540
> >> | [ 1801.573640]  [<ffffffff8108ea53>] ? process_one_work+0x193/0x540
> >> | [ 1801.645616]  [<ffffffff8108f2ac>] worker_thread+0x11c/0x370
> >> | [ 1801.712337]  [<ffffffff8108f190>] ? rescuer_thread+0x350/0x350
> >> | [ 1801.782178]  [<ffffffff8109731d>] kthread+0xed/0x100
> >> | [ 1801.841661]  [<ffffffff81097230>] ? kthread_create_on_node+0x160/0x160
> >> | [ 1801.919919]  [<ffffffff8169cc3c>] ret_from_fork+0x7c/0xb0
> >> | [ 1801.984608]  [<ffffffff81097230>] ? kthread_create_on_node+0x160/0x160
> >> | [ 1802.062825] ---[ end trace d77f2054de000fb7 ]---
> >> |
> >> | This issue is related to the bug 54411:
> >> | https://bugzilla.kernel.org/show_bug.cgi?id=54411
> >>
> >> Add is_removed to record if pci_destroy_dev is called already.
> >>
> >> During second calling, still have extra dev ref hold via
> >> device_schedule_call, so we are safe to check dev->is_removed.
> >>
> >> It fixs the problem In Gu's test.
> >>
> >> -v2: add partial changelog from Gu Zheng <guz.fnst@cn.fujitsu.com>
> >>      refresh after patch of moving device_del from Rafael.
> >>
> >> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> >> ---
> >>  drivers/pci/remove.c | 8 +++++---
> >>  include/linux/pci.h  | 1 +
> >>  2 files changed, 6 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> >> index f452148..b090cec 100644
> >> --- a/drivers/pci/remove.c
> >> +++ b/drivers/pci/remove.c
> >> @@ -20,9 +20,11 @@ static void pci_stop_dev(struct pci_dev *dev)
> >>
> >>  static void pci_destroy_dev(struct pci_dev *dev)
> >>  {
> >> -       device_del(&dev->dev);
> >> -
> >> -       put_device(&dev->dev);
> >> +       if (!dev->is_removed) {
> >> +               device_del(&dev->dev);
> >> +               dev->is_removed = 1;
> >
> > As Rafael pointed out, this looks like a race.  What prevents two
> > concurrent calls to pci_destroy_dev() from seeing "dev->is_removed ==
> > 0" and both calling device_del() on the same device?
> 
> I don't think that is going to happen. as those two pci_destroy_dev is
> serialized
> during
>  echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove ; echo -n 1 >
>      /sys/bus/pci/devices/0000\:1a\:01.0/remove
> is called.

And what exactly does serialize that with the removals started via
trim_stale_devices() from acpiphp_check_bridge()?

Rafael


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-11-26 20:13       ` Yinghai Lu
@ 2013-11-27  1:24         ` Rafael J. Wysocki
  2013-11-27  2:26           ` Yinghai Lu
  0 siblings, 1 reply; 69+ messages in thread
From: Rafael J. Wysocki @ 2013-11-27  1:24 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel

On Tuesday, November 26, 2013 12:13:50 PM Yinghai Lu wrote:
> On Tue, Nov 26, 2013 at 11:34 AM, Yinghai Lu <yinghai@kernel.org> wrote:
> > On Mon, Nov 25, 2013 at 7:38 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> >> On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> >>> Mutliple removing via /sys will call pci_destroy_dev two times.
> >>>
> >>> | When concurent removing pci devices which are in the same pci subtree
> >>> | via sysfs, such as:
> >>> | echo -n 1 > /sys/bus/pci/devices/0000\:10\:00.0/remove ; echo -n 1 >
> >>> | /sys/bus/pci/devices/0000\:1a\:01.0/remove
> >>> | (1a:01.0 device is downstream from the 10:00.0 bridge)
> >>> |
> >>> | the following warning will show:
> >>> | [ 1799.280918] ------------[ cut here ]------------
> >>> | [ 1799.336199] WARNING: CPU: 7 PID: 126 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0()
> >>> | [ 1799.433093] list_del corruption, ffff8807b4a7c000->next is LIST_POISON1 (dead000000100100)
> >>> | [ 1800.276623] CPU: 7 PID: 126 Comm: kworker/u512:1 Tainted: G        W    3.12.0-rc5+ #196
> >>> | [ 1800.508918] Workqueue: sysfsd sysfs_schedule_callback_work
> >>> | [ 1800.574703]  0000000000000009 ffff8807adbadbd8 ffffffff8168b26c ffff8807c27d08a8
> >>> | [ 1800.663860]  ffff8807adbadc28 ffff8807adbadc18 ffffffff810711dc ffff8807adbadc68
> >>> | [ 1800.753130]  ffff8807b4a7c000 ffff8807b4a7c000 ffff8807ad089c00 0000000000000000
> >>> | [ 1800.842282] Call Trace:
> >>> | [ 1800.871651]  [<ffffffff8168b26c>] dump_stack+0x55/0x76
> >>> | [ 1800.933301]  [<ffffffff810711dc>] warn_slowpath_common+0x8c/0xc0
> >>> | [ 1801.005283]  [<ffffffff810712c6>] warn_slowpath_fmt+0x46/0x50
> >>> | [ 1801.074081]  [<ffffffff8135a343>] __list_del_entry+0x63/0xd0
> >>> | [ 1801.141839]  [<ffffffff8135a3c1>] list_del+0x11/0x40
> >>> | [ 1801.201320]  [<ffffffff813734da>] pci_remove_bus_device+0x6a/0xe0
> >>> | [ 1801.274279]  [<ffffffff8137356e>] pci_stop_and_remove_bus_device+0x1e/0x30
> >>> | [ 1801.356606]  [<ffffffff8137b20b>] remove_callback+0x2b/0x40
> >>> | [ 1801.423412]  [<ffffffff81251848>] sysfs_schedule_callback_work+0x18/0x60
> >>> | [ 1801.503744]  [<ffffffff8108eab5>] process_one_work+0x1f5/0x540
> >>> | [ 1801.573640]  [<ffffffff8108ea53>] ? process_one_work+0x193/0x540
> >>> | [ 1801.645616]  [<ffffffff8108f2ac>] worker_thread+0x11c/0x370
> >>> | [ 1801.712337]  [<ffffffff8108f190>] ? rescuer_thread+0x350/0x350
> >>> | [ 1801.782178]  [<ffffffff8109731d>] kthread+0xed/0x100
> >>> | [ 1801.841661]  [<ffffffff81097230>] ? kthread_create_on_node+0x160/0x160
> >>> | [ 1801.919919]  [<ffffffff8169cc3c>] ret_from_fork+0x7c/0xb0
> >>> | [ 1801.984608]  [<ffffffff81097230>] ? kthread_create_on_node+0x160/0x160
> >>> | [ 1802.062825] ---[ end trace d77f2054de000fb7 ]---
> >>> |
> >>> | This issue is related to the bug 54411:
> >>> | https://bugzilla.kernel.org/show_bug.cgi?id=54411
> >>>
> >>> Add is_removed to record if pci_destroy_dev is called already.
> >>>
> >>> During second calling, still have extra dev ref hold via
> >>> device_schedule_call, so we are safe to check dev->is_removed.
> >>>
> >>> It fixs the problem In Gu's test.
> >>>
> >>> -v2: add partial changelog from Gu Zheng <guz.fnst@cn.fujitsu.com>
> >>>      refresh after patch of moving device_del from Rafael.
> >>>
> >>> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> >>> ---
> >>>  drivers/pci/remove.c | 8 +++++---
> >>>  include/linux/pci.h  | 1 +
> >>>  2 files changed, 6 insertions(+), 3 deletions(-)
> >>>
> >>> diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
> >>> index f452148..b090cec 100644
> >>> --- a/drivers/pci/remove.c
> >>> +++ b/drivers/pci/remove.c
> >>> @@ -20,9 +20,11 @@ static void pci_stop_dev(struct pci_dev *dev)
> >>>
> >>>  static void pci_destroy_dev(struct pci_dev *dev)
> >>>  {
> >>> -       device_del(&dev->dev);
> >>> -
> >>> -       put_device(&dev->dev);
> >>> +       if (!dev->is_removed) {
> >>> +               device_del(&dev->dev);
> >>> +               dev->is_removed = 1;
> >>
> >> As Rafael pointed out, this looks like a race.  What prevents two
> >> concurrent calls to pci_destroy_dev() from seeing "dev->is_removed ==
> >> 0" and both calling device_del() on the same device?
> >
> 
> hope you are happy with this one:
> 
> -v3: use atomic operations to prevent racing that Rafael and Bjorn concern.
> 
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> 
> ---
>  drivers/pci/probe.c  |    2 ++
>  drivers/pci/remove.c |    8 +++++---
>  include/linux/pci.h  |    1 +
>  3 files changed, 8 insertions(+), 3 deletions(-)
> 
> Index: linux-2.6/drivers/pci/remove.c
> ===================================================================
> --- linux-2.6.orig/drivers/pci/remove.c
> +++ linux-2.6/drivers/pci/remove.c
> @@ -20,9 +20,11 @@ static void pci_stop_dev(struct pci_dev
> 
>  static void pci_destroy_dev(struct pci_dev *dev)
>  {
> -    device_del(&dev->dev);
> -
> -    put_device(&dev->dev);
> +    if (atomic_inc_and_test(&dev->removed_count)) {
> +        device_del(&dev->dev);
> +        put_device(&dev->dev);
> +    } else
> +        atomic_dec(&dev->removed_count);
>  }

So assume pci_destroy_dev() is called twice in parallel for the same dev
by two different threads.  Thread 1 does the atomic_inc_and_test() and
finds that it is OK to do the device_del() and put_device() which causes
the device object to be freed.  Then thread 2 does the atomic_inc_and_test()
on the already freed device object and crashes the kernel.

I think we need to be much more clever here ...

Rafael


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 03/10] PCI: Move resources and bus_list releasing to pci_release_dev
  2013-11-27  1:15   ` Rafael J. Wysocki
@ 2013-11-27  2:15     ` Yinghai Lu
  0 siblings, 0 replies; 69+ messages in thread
From: Yinghai Lu @ 2013-11-27  2:15 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	Linux Kernel Mailing List

On Tue, Nov 26, 2013 at 5:15 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> @@ -36,11 +22,6 @@ static void pci_destroy_dev(struct pci_dev *dev)
>>  {
>>       device_del(&dev->dev);
>>
>> -     down_write(&pci_bus_sem);
>> -     list_del(&dev->bus_list);
>> -     up_write(&pci_bus_sem);
>> -
>> -     pci_free_resources(dev);
>>       put_device(&dev->dev);
>
> And if the side effects are benign enough, why don't we do a device_unregister()
> here?

Yes, that is same, but we are using device_add in pci_device_add...

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-11-27  1:24         ` Rafael J. Wysocki
@ 2013-11-27  2:26           ` Yinghai Lu
  2013-11-29 23:38             ` Rafael J. Wysocki
  0 siblings, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-11-27  2:26 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel

On Tue, Nov 26, 2013 at 5:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> So assume pci_destroy_dev() is called twice in parallel for the same dev
> by two different threads.  Thread 1 does the atomic_inc_and_test() and
> finds that it is OK to do the device_del() and put_device() which causes
> the device object to be freed.  Then thread 2 does the atomic_inc_and_test()
> on the already freed device object and crashes the kernel.
>
thread2 should still hold one extra reference.
that is in
  device_schedule_callback
     ==> sysfs_schedule_callback
         ==> kobject_get(kobj)

pci_destroy_dev for thread2 is called at this point.

and that reference will be released from
        sysfs_schedule_callback
        ==> kobject_put()...

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-11-27  2:26           ` Yinghai Lu
@ 2013-11-29 23:38             ` Rafael J. Wysocki
  2013-11-29 23:45               ` Rafael J. Wysocki
  0 siblings, 1 reply; 69+ messages in thread
From: Rafael J. Wysocki @ 2013-11-29 23:38 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg

On Tuesday, November 26, 2013 06:26:54 PM Yinghai Lu wrote:
> On Tue, Nov 26, 2013 at 5:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > So assume pci_destroy_dev() is called twice in parallel for the same dev
> > by two different threads.  Thread 1 does the atomic_inc_and_test() and
> > finds that it is OK to do the device_del() and put_device() which causes
> > the device object to be freed.  Then thread 2 does the atomic_inc_and_test()
> > on the already freed device object and crashes the kernel.
> >
> thread2 should still hold one extra reference.
> that is in
>   device_schedule_callback
>      ==> sysfs_schedule_callback
>          ==> kobject_get(kobj)
> 
> pci_destroy_dev for thread2 is called at this point.
> 
> and that reference will be released from
>         sysfs_schedule_callback
>         ==> kobject_put()...

Well, that would be the case if thread 2 was started by device_schedule_callback(),
but again, for example, it may be trim_stale_devices() started by acpiphp_check_bridge()
that doesn't hold extra references to the pci_dev.  [Well, that piece of code
is racy anyway, because it walks bus->devices without locking.  Which is my
fault too, because I overlooked that.  Shame, shame.]

Perhaps we can do something like the (untested) patch below (in addition to the
$subject patch).  Do you see any immediate problems with it?

Also I wonder if it is safe to do pci_stop_and_remove_device() in acpiphp_glue.c
without putting it under pci_remove_rescan_mutex?

Rafael


---
 drivers/pci/hotplug/acpiphp_glue.c |   47 +++++++++++++++++++++++++------------
 1 file changed, 33 insertions(+), 14 deletions(-)

Index: linux-pm/drivers/pci/hotplug/acpiphp_glue.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/acpiphp_glue.c
+++ linux-pm/drivers/pci/hotplug/acpiphp_glue.c
@@ -687,10 +687,11 @@ static unsigned int get_slot_status(stru
 }
 
 /**
- * trim_stale_devices - remove PCI devices that are not responding.
+ * find_stale_devices - Select PCI devices that are not responding for removal.
  * @dev: PCI device to start walking the hierarchy from.
+ * @trim_list: List of devices to remove.
  */
-static void trim_stale_devices(struct pci_dev *dev)
+static void find_stale_devices(struct pci_dev *dev, struct list_head *trim_list)
 {
 	acpi_handle handle = ACPI_HANDLE(&dev->dev);
 	struct pci_bus *bus = dev->subordinate;
@@ -710,21 +711,46 @@ static void trim_stale_devices(struct pc
 		alive = pci_bus_read_dev_vendor_id(dev->bus, dev->devfn, &v, 0);
 	}
 	if (!alive) {
-		pci_stop_and_remove_bus_device(dev);
-		if (handle)
-			acpiphp_bus_trim(handle);
+		pci_dev_get(dev);
+		list_move(&dev->bus_list, trim_list);
 	} else if (bus) {
 		struct pci_dev *child, *tmp;
 
 		/* The device is a bridge. so check the bus below it. */
 		pm_runtime_get_sync(&dev->dev);
 		list_for_each_entry_safe(child, tmp, &bus->devices, bus_list)
-			trim_stale_devices(child);
+			find_stale_devices(child, trim_list);
 
 		pm_runtime_put(&dev->dev);
 	}
 }
 
+void acpiphp_trim_stale_devices(struct acpiphp_slot *slot)
+{
+	struct pci_bus *bus = slot->bus;
+	struct pci_dev *dev, *tmp;
+	LIST_HEAD(trim_list);
+
+	down_write(&pci_bus_sem);
+
+	list_for_each_entry_safe(dev, tmp, &bus->devices, bus_list)
+		if (PCI_SLOT(dev->devfn) == slot->device)
+			find_stale_devices(dev, &trim_list);
+
+	up_write(&pci_bus_sem);
+
+	while (!list_empty(&trim_list)) {
+		acpi_handle handle;
+
+		dev = list_first_entry(&trim_list, struct pci_dev, bus_list);
+		handle = ACPI_HANDLE(&dev->dev);
+		pci_stop_and_remove_bus_device(dev);
+		pci_dev_put(dev);
+		if (handle)
+			acpiphp_bus_trim(handle);
+	}
+}
+
 /**
  * acpiphp_check_bridge - re-enumerate devices
  * @bridge: where to begin re-enumeration
@@ -737,18 +763,11 @@ static void acpiphp_check_bridge(struct
 	struct acpiphp_slot *slot;
 
 	list_for_each_entry(slot, &bridge->slots, node) {
-		struct pci_bus *bus = slot->bus;
-		struct pci_dev *dev, *tmp;
-
 		mutex_lock(&slot->crit_sect);
 		/* wake up all functions */
 		if (get_slot_status(slot) == ACPI_STA_ALL) {
 			/* remove stale devices if any */
-			list_for_each_entry_safe(dev, tmp, &bus->devices,
-						 bus_list)
-				if (PCI_SLOT(dev->devfn) == slot->device)
-					trim_stale_devices(dev);
-
+			acpiphp_trim_stale_devices(slot);
 			/* configure all functions */
 			enable_slot(slot);
 		} else {


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-11-29 23:38             ` Rafael J. Wysocki
@ 2013-11-29 23:45               ` Rafael J. Wysocki
  2013-11-30  0:31                 ` Rafael J. Wysocki
  0 siblings, 1 reply; 69+ messages in thread
From: Rafael J. Wysocki @ 2013-11-29 23:45 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg

On Saturday, November 30, 2013 12:38:26 AM Rafael J. Wysocki wrote:
> On Tuesday, November 26, 2013 06:26:54 PM Yinghai Lu wrote:
> > On Tue, Nov 26, 2013 at 5:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >
> > > So assume pci_destroy_dev() is called twice in parallel for the same dev
> > > by two different threads.  Thread 1 does the atomic_inc_and_test() and
> > > finds that it is OK to do the device_del() and put_device() which causes
> > > the device object to be freed.  Then thread 2 does the atomic_inc_and_test()
> > > on the already freed device object and crashes the kernel.
> > >
> > thread2 should still hold one extra reference.
> > that is in
> >   device_schedule_callback
> >      ==> sysfs_schedule_callback
> >          ==> kobject_get(kobj)
> > 
> > pci_destroy_dev for thread2 is called at this point.
> > 
> > and that reference will be released from
> >         sysfs_schedule_callback
> >         ==> kobject_put()...
> 
> Well, that would be the case if thread 2 was started by device_schedule_callback(),
> but again, for example, it may be trim_stale_devices() started by acpiphp_check_bridge()
> that doesn't hold extra references to the pci_dev.  [Well, that piece of code
> is racy anyway, because it walks bus->devices without locking.  Which is my
> fault too, because I overlooked that.  Shame, shame.]
> 
> Perhaps we can do something like the (untested) patch below (in addition to the
> $subject patch).  Do you see any immediate problems with it?

Ah, I see one.  It will break pci_stop_bus_device() and pci_remove_bus_device().
So much for being clever.

Moreover, it looks like those two routines above are racy too for the same
reason?

Rafael


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-11-29 23:45               ` Rafael J. Wysocki
@ 2013-11-30  0:31                 ` Rafael J. Wysocki
  2013-11-30 21:37                   ` Rafael J. Wysocki
  0 siblings, 1 reply; 69+ messages in thread
From: Rafael J. Wysocki @ 2013-11-30  0:31 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg

On Saturday, November 30, 2013 12:45:55 AM Rafael J. Wysocki wrote:
> On Saturday, November 30, 2013 12:38:26 AM Rafael J. Wysocki wrote:
> > On Tuesday, November 26, 2013 06:26:54 PM Yinghai Lu wrote:
> > > On Tue, Nov 26, 2013 at 5:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > >
> > > > So assume pci_destroy_dev() is called twice in parallel for the same dev
> > > > by two different threads.  Thread 1 does the atomic_inc_and_test() and
> > > > finds that it is OK to do the device_del() and put_device() which causes
> > > > the device object to be freed.  Then thread 2 does the atomic_inc_and_test()
> > > > on the already freed device object and crashes the kernel.
> > > >
> > > thread2 should still hold one extra reference.
> > > that is in
> > >   device_schedule_callback
> > >      ==> sysfs_schedule_callback
> > >          ==> kobject_get(kobj)
> > > 
> > > pci_destroy_dev for thread2 is called at this point.
> > > 
> > > and that reference will be released from
> > >         sysfs_schedule_callback
> > >         ==> kobject_put()...
> > 
> > Well, that would be the case if thread 2 was started by device_schedule_callback(),
> > but again, for example, it may be trim_stale_devices() started by acpiphp_check_bridge()
> > that doesn't hold extra references to the pci_dev.  [Well, that piece of code
> > is racy anyway, because it walks bus->devices without locking.  Which is my
> > fault too, because I overlooked that.  Shame, shame.]
> > 
> > Perhaps we can do something like the (untested) patch below (in addition to the
> > $subject patch).  Do you see any immediate problems with it?
> 
> Ah, I see one.  It will break pci_stop_bus_device() and pci_remove_bus_device().
> So much for being clever.
> 
> Moreover, it looks like those two routines above are racy too for the same
> reason?

The (still untested) patch below is what I have come up with for now.  The
is_gone flag is now only operated under pci_remove_rescan_mutex, so it need
not be atomic.  Of course, whoever calls pci_stop_and_remove_bus_device()
(the "locked" one) should hold a ref to the device being removed to avoid
use-after-free (the callers need to be audited for that).

Well, I probably still missed something, because it's the middle of the night
and I should be going to sleep instead of starig at the PCI removal code.  Sigh.

Thanks,
Rafael


---
 drivers/pci/hotplug/acpiphp_glue.c |   15 ++++++++++++---
 drivers/pci/pci-sysfs.c            |   17 ++++++++++++-----
 drivers/pci/pci.h                  |    3 +++
 drivers/pci/remove.c               |   15 +++++++++++++--
 include/linux/pci.h                |    1 +
 5 files changed, 41 insertions(+), 10 deletions(-)

Index: linux-pm/drivers/pci/hotplug/acpiphp_glue.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/acpiphp_glue.c
+++ linux-pm/drivers/pci/hotplug/acpiphp_glue.c
@@ -553,6 +553,8 @@ static void __ref enable_slot(struct acp
 	int max, pass;
 	LIST_HEAD(add_list);
 
+	lock_pci_remove_rescan();
+
 	acpiphp_rescan_slot(slot);
 	max = acpiphp_max_busnr(bus);
 	for (pass = 0; pass < 2; pass++) {
@@ -586,6 +588,8 @@ static void __ref enable_slot(struct acp
 
 	pci_bus_add_devices(bus);
 
+	unlock_pci_remove_rescan();
+
 	slot->flags |= SLOT_ENABLED;
 	list_for_each_entry(func, &slot->funcs, sibling) {
 		dev = pci_get_slot(bus, PCI_DEVFN(slot->device,
@@ -626,6 +630,7 @@ static void disable_slot(struct acpiphp_
 	struct acpiphp_func *func;
 	struct pci_dev *pdev;
 
+	lock_pci_remove_rescan();
 	/*
 	 * enable_slot() enumerates all functions in this device via
 	 * pci_scan_slot(), whether they have associated ACPI hotplug
@@ -633,9 +638,10 @@ static void disable_slot(struct acpiphp_
 	 * here.
 	 */
 	while ((pdev = dev_in_slot(slot))) {
-		pci_stop_and_remove_bus_device(pdev);
+		__pci_stop_and_remove_bus_device(pdev);
 		pci_dev_put(pdev);
 	}
+	unlock_pci_remove_rescan();
 
 	list_for_each_entry(func, &slot->funcs, sibling)
 		acpiphp_bus_trim(func_to_handle(func));
@@ -710,7 +716,7 @@ static void trim_stale_devices(struct pc
 		alive = pci_bus_read_dev_vendor_id(dev->bus, dev->devfn, &v, 0);
 	}
 	if (!alive) {
-		pci_stop_and_remove_bus_device(dev);
+		__pci_stop_and_remove_bus_device(dev);
 		if (handle)
 			acpiphp_bus_trim(handle);
 	} else if (bus) {
@@ -743,12 +749,15 @@ static void acpiphp_check_bridge(struct
 		mutex_lock(&slot->crit_sect);
 		/* wake up all functions */
 		if (get_slot_status(slot) == ACPI_STA_ALL) {
+			lock_pci_remove_rescan();
+
 			/* remove stale devices if any */
 			list_for_each_entry_safe(dev, tmp, &bus->devices,
 						 bus_list)
 				if (PCI_SLOT(dev->devfn) == slot->device)
 					trim_stale_devices(dev);
 
+			unlock_pci_remove_rescan();
 			/* configure all functions */
 			enable_slot(slot);
 		} else {
@@ -783,7 +792,7 @@ static void acpiphp_sanitize_bus(struct
 					res->end) {
 				/* Could not assign a required resources
 				 * for this device, remove it */
-				pci_stop_and_remove_bus_device(dev);
+				__pci_stop_and_remove_bus_device(dev);
 				break;
 			}
 		}
Index: linux-pm/drivers/pci/pci-sysfs.c
===================================================================
--- linux-pm.orig/drivers/pci/pci-sysfs.c
+++ linux-pm/drivers/pci/pci-sysfs.c
@@ -298,6 +298,17 @@ msi_bus_store(struct device *dev, struct
 static DEVICE_ATTR_RW(msi_bus);
 
 static DEFINE_MUTEX(pci_remove_rescan_mutex);
+
+void lock_pci_remove_rescan(void)
+{
+	mutex_lock(&pci_remove_rescan_mutex);
+}
+
+void unlock_pci_remove_rescan(void)
+{
+	mutex_unlock(&pci_remove_rescan_mutex);
+}
+
 static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf,
 				size_t count)
 {
@@ -354,11 +365,7 @@ static struct device_attribute dev_resca
 
 static void remove_callback(struct device *dev)
 {
-	struct pci_dev *pdev = to_pci_dev(dev);
-
-	mutex_lock(&pci_remove_rescan_mutex);
-	pci_stop_and_remove_bus_device(pdev);
-	mutex_unlock(&pci_remove_rescan_mutex);
+	pci_stop_and_remove_bus_device(to_pci_dev(dev));
 }
 
 static ssize_t
Index: linux-pm/drivers/pci/pci.h
===================================================================
--- linux-pm.orig/drivers/pci/pci.h
+++ linux-pm/drivers/pci/pci.h
@@ -11,8 +11,11 @@ extern const unsigned char pcie_link_spe
 
 /* Functions internal to the PCI core code */
 
+void lock_pci_remove_rescan(void);
+void unlock_pci_remove_rescan(void);
 int pci_create_sysfs_dev_files(struct pci_dev *pdev);
 void pci_remove_sysfs_dev_files(struct pci_dev *pdev);
+void __pci_stop_and_remove_bus_device(struct pci_dev *pdev);
 #if !defined(CONFIG_DMI) && !defined(CONFIG_ACPI)
 static inline void pci_create_firmware_label_files(struct pci_dev *pdev)
 { return; }
Index: linux-pm/drivers/pci/remove.c
===================================================================
--- linux-pm.orig/drivers/pci/remove.c
+++ linux-pm/drivers/pci/remove.c
@@ -34,6 +34,10 @@ static void pci_stop_dev(struct pci_dev
 
 static void pci_destroy_dev(struct pci_dev *dev)
 {
+	if (dev->is_gone)
+		return;
+
+	dev->is_gone = 1;
 	device_del(&dev->dev);
 
 	down_write(&pci_bus_sem);
@@ -95,6 +99,12 @@ static void pci_remove_bus_device(struct
 	pci_destroy_dev(dev);
 }
 
+void __pci_stop_and_remove_bus_device(struct pci_dev *dev)
+{
+	pci_stop_bus_device(dev);
+	pci_remove_bus_device(dev);
+}
+
 /**
  * pci_stop_and_remove_bus_device - remove a PCI device and any children
  * @dev: the device to remove
@@ -109,8 +119,9 @@ static void pci_remove_bus_device(struct
  */
 void pci_stop_and_remove_bus_device(struct pci_dev *dev)
 {
-	pci_stop_bus_device(dev);
-	pci_remove_bus_device(dev);
+	lock_pci_remove_rescan();
+	__pci_stop_and_remove_bus_device(dev);
+	unlock_pci_remove_rescan();
 }
 EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
 
Index: linux-pm/include/linux/pci.h
===================================================================
--- linux-pm.orig/include/linux/pci.h
+++ linux-pm/include/linux/pci.h
@@ -321,6 +321,7 @@ struct pci_dev {
 	unsigned int	multifunction:1;/* Part of multi-function device */
 	/* keep track of device state */
 	unsigned int	is_added:1;
+	unsigned int	is_gone:1;
 	unsigned int	is_busmaster:1; /* device is busmaster */
 	unsigned int	no_msi:1;	/* device may not use msi */
 	unsigned int	block_cfg_access:1;	/* config space access is blocked */


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-11-30  0:31                 ` Rafael J. Wysocki
@ 2013-11-30 21:37                   ` Rafael J. Wysocki
  2013-11-30 22:27                     ` Yinghai Lu
  0 siblings, 1 reply; 69+ messages in thread
From: Rafael J. Wysocki @ 2013-11-30 21:37 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg

On Saturday, November 30, 2013 01:31:33 AM Rafael J. Wysocki wrote:
> On Saturday, November 30, 2013 12:45:55 AM Rafael J. Wysocki wrote:
> > On Saturday, November 30, 2013 12:38:26 AM Rafael J. Wysocki wrote:
> > > On Tuesday, November 26, 2013 06:26:54 PM Yinghai Lu wrote:
> > > > On Tue, Nov 26, 2013 at 5:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > > >
> > > > > So assume pci_destroy_dev() is called twice in parallel for the same dev
> > > > > by two different threads.  Thread 1 does the atomic_inc_and_test() and
> > > > > finds that it is OK to do the device_del() and put_device() which causes
> > > > > the device object to be freed.  Then thread 2 does the atomic_inc_and_test()
> > > > > on the already freed device object and crashes the kernel.
> > > > >
> > > > thread2 should still hold one extra reference.
> > > > that is in
> > > >   device_schedule_callback
> > > >      ==> sysfs_schedule_callback
> > > >          ==> kobject_get(kobj)
> > > > 
> > > > pci_destroy_dev for thread2 is called at this point.
> > > > 
> > > > and that reference will be released from
> > > >         sysfs_schedule_callback
> > > >         ==> kobject_put()...
> > > 
> > > Well, that would be the case if thread 2 was started by device_schedule_callback(),
> > > but again, for example, it may be trim_stale_devices() started by acpiphp_check_bridge()
> > > that doesn't hold extra references to the pci_dev.  [Well, that piece of code
> > > is racy anyway, because it walks bus->devices without locking.  Which is my
> > > fault too, because I overlooked that.  Shame, shame.]
> > > 
> > > Perhaps we can do something like the (untested) patch below (in addition to the
> > > $subject patch).  Do you see any immediate problems with it?
> > 
> > Ah, I see one.  It will break pci_stop_bus_device() and pci_remove_bus_device().
> > So much for being clever.
> > 
> > Moreover, it looks like those two routines above are racy too for the same
> > reason?
> 
> The (still untested) patch below is what I have come up with for now.  The
> is_gone flag is now only operated under pci_remove_rescan_mutex, so it need
> not be atomic.  Of course, whoever calls pci_stop_and_remove_bus_device()
> (the "locked" one) should hold a ref to the device being removed to avoid
> use-after-free (the callers need to be audited for that).
> 
> Well, I probably still missed something, because it's the middle of the night
> and I should be going to sleep instead of starig at the PCI removal code.  Sigh.

Thunderbolt hotplug works for me with this patch applied FWIW.

> ---
>  drivers/pci/hotplug/acpiphp_glue.c |   15 ++++++++++++---
>  drivers/pci/pci-sysfs.c            |   17 ++++++++++++-----
>  drivers/pci/pci.h                  |    3 +++
>  drivers/pci/remove.c               |   15 +++++++++++++--
>  include/linux/pci.h                |    1 +
>  5 files changed, 41 insertions(+), 10 deletions(-)
> 
> Index: linux-pm/drivers/pci/hotplug/acpiphp_glue.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/hotplug/acpiphp_glue.c
> +++ linux-pm/drivers/pci/hotplug/acpiphp_glue.c
> @@ -553,6 +553,8 @@ static void __ref enable_slot(struct acp
>  	int max, pass;
>  	LIST_HEAD(add_list);
>  
> +	lock_pci_remove_rescan();
> +
>  	acpiphp_rescan_slot(slot);
>  	max = acpiphp_max_busnr(bus);
>  	for (pass = 0; pass < 2; pass++) {
> @@ -586,6 +588,8 @@ static void __ref enable_slot(struct acp
>  
>  	pci_bus_add_devices(bus);
>  
> +	unlock_pci_remove_rescan();
> +
>  	slot->flags |= SLOT_ENABLED;
>  	list_for_each_entry(func, &slot->funcs, sibling) {
>  		dev = pci_get_slot(bus, PCI_DEVFN(slot->device,
> @@ -626,6 +630,7 @@ static void disable_slot(struct acpiphp_
>  	struct acpiphp_func *func;
>  	struct pci_dev *pdev;
>  
> +	lock_pci_remove_rescan();
>  	/*
>  	 * enable_slot() enumerates all functions in this device via
>  	 * pci_scan_slot(), whether they have associated ACPI hotplug
> @@ -633,9 +638,10 @@ static void disable_slot(struct acpiphp_
>  	 * here.
>  	 */
>  	while ((pdev = dev_in_slot(slot))) {
> -		pci_stop_and_remove_bus_device(pdev);
> +		__pci_stop_and_remove_bus_device(pdev);
>  		pci_dev_put(pdev);
>  	}
> +	unlock_pci_remove_rescan();
>  
>  	list_for_each_entry(func, &slot->funcs, sibling)
>  		acpiphp_bus_trim(func_to_handle(func));
> @@ -710,7 +716,7 @@ static void trim_stale_devices(struct pc
>  		alive = pci_bus_read_dev_vendor_id(dev->bus, dev->devfn, &v, 0);
>  	}
>  	if (!alive) {
> -		pci_stop_and_remove_bus_device(dev);
> +		__pci_stop_and_remove_bus_device(dev);
>  		if (handle)
>  			acpiphp_bus_trim(handle);
>  	} else if (bus) {
> @@ -743,12 +749,15 @@ static void acpiphp_check_bridge(struct
>  		mutex_lock(&slot->crit_sect);
>  		/* wake up all functions */
>  		if (get_slot_status(slot) == ACPI_STA_ALL) {
> +			lock_pci_remove_rescan();
> +
>  			/* remove stale devices if any */
>  			list_for_each_entry_safe(dev, tmp, &bus->devices,
>  						 bus_list)
>  				if (PCI_SLOT(dev->devfn) == slot->device)
>  					trim_stale_devices(dev);
>  
> +			unlock_pci_remove_rescan();
>  			/* configure all functions */
>  			enable_slot(slot);
>  		} else {
> @@ -783,7 +792,7 @@ static void acpiphp_sanitize_bus(struct
>  					res->end) {
>  				/* Could not assign a required resources
>  				 * for this device, remove it */
> -				pci_stop_and_remove_bus_device(dev);
> +				__pci_stop_and_remove_bus_device(dev);
>  				break;
>  			}
>  		}
> Index: linux-pm/drivers/pci/pci-sysfs.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pci-sysfs.c
> +++ linux-pm/drivers/pci/pci-sysfs.c
> @@ -298,6 +298,17 @@ msi_bus_store(struct device *dev, struct
>  static DEVICE_ATTR_RW(msi_bus);
>  
>  static DEFINE_MUTEX(pci_remove_rescan_mutex);
> +
> +void lock_pci_remove_rescan(void)
> +{
> +	mutex_lock(&pci_remove_rescan_mutex);
> +}
> +
> +void unlock_pci_remove_rescan(void)
> +{
> +	mutex_unlock(&pci_remove_rescan_mutex);
> +}
> +
>  static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf,
>  				size_t count)
>  {
> @@ -354,11 +365,7 @@ static struct device_attribute dev_resca
>  
>  static void remove_callback(struct device *dev)
>  {
> -	struct pci_dev *pdev = to_pci_dev(dev);
> -
> -	mutex_lock(&pci_remove_rescan_mutex);
> -	pci_stop_and_remove_bus_device(pdev);
> -	mutex_unlock(&pci_remove_rescan_mutex);
> +	pci_stop_and_remove_bus_device(to_pci_dev(dev));
>  }
>  
>  static ssize_t
> Index: linux-pm/drivers/pci/pci.h
> ===================================================================
> --- linux-pm.orig/drivers/pci/pci.h
> +++ linux-pm/drivers/pci/pci.h
> @@ -11,8 +11,11 @@ extern const unsigned char pcie_link_spe
>  
>  /* Functions internal to the PCI core code */
>  
> +void lock_pci_remove_rescan(void);
> +void unlock_pci_remove_rescan(void);
>  int pci_create_sysfs_dev_files(struct pci_dev *pdev);
>  void pci_remove_sysfs_dev_files(struct pci_dev *pdev);
> +void __pci_stop_and_remove_bus_device(struct pci_dev *pdev);
>  #if !defined(CONFIG_DMI) && !defined(CONFIG_ACPI)
>  static inline void pci_create_firmware_label_files(struct pci_dev *pdev)
>  { return; }
> Index: linux-pm/drivers/pci/remove.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/remove.c
> +++ linux-pm/drivers/pci/remove.c
> @@ -34,6 +34,10 @@ static void pci_stop_dev(struct pci_dev
>  
>  static void pci_destroy_dev(struct pci_dev *dev)
>  {
> +	if (dev->is_gone)
> +		return;
> +
> +	dev->is_gone = 1;
>  	device_del(&dev->dev);
>  
>  	down_write(&pci_bus_sem);
> @@ -95,6 +99,12 @@ static void pci_remove_bus_device(struct
>  	pci_destroy_dev(dev);
>  }
>  
> +void __pci_stop_and_remove_bus_device(struct pci_dev *dev)
> +{
> +	pci_stop_bus_device(dev);
> +	pci_remove_bus_device(dev);
> +}
> +
>  /**
>   * pci_stop_and_remove_bus_device - remove a PCI device and any children
>   * @dev: the device to remove
> @@ -109,8 +119,9 @@ static void pci_remove_bus_device(struct
>   */
>  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
>  {
> -	pci_stop_bus_device(dev);
> -	pci_remove_bus_device(dev);
> +	lock_pci_remove_rescan();
> +	__pci_stop_and_remove_bus_device(dev);
> +	unlock_pci_remove_rescan();
>  }
>  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
>  
> Index: linux-pm/include/linux/pci.h
> ===================================================================
> --- linux-pm.orig/include/linux/pci.h
> +++ linux-pm/include/linux/pci.h
> @@ -321,6 +321,7 @@ struct pci_dev {
>  	unsigned int	multifunction:1;/* Part of multi-function device */
>  	/* keep track of device state */
>  	unsigned int	is_added:1;
> +	unsigned int	is_gone:1;
>  	unsigned int	is_busmaster:1; /* device is busmaster */
>  	unsigned int	no_msi:1;	/* device may not use msi */
>  	unsigned int	block_cfg_access:1;	/* config space access is blocked */
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-11-30 21:37                   ` Rafael J. Wysocki
@ 2013-11-30 22:27                     ` Yinghai Lu
  2013-12-01  1:24                       ` Rafael J. Wysocki
  0 siblings, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-11-30 22:27 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg

On Sat, Nov 30, 2013 at 1:37 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> On Saturday, November 30, 2013 01:31:33 AM Rafael J. Wysocki wrote:
>> On Saturday, November 30, 2013 12:45:55 AM Rafael J. Wysocki wrote:
>> > On Saturday, November 30, 2013 12:38:26 AM Rafael J. Wysocki wrote:
>> > > On Tuesday, November 26, 2013 06:26:54 PM Yinghai Lu wrote:
>> > > > On Tue, Nov 26, 2013 at 5:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> > > > >
>> > > > > So assume pci_destroy_dev() is called twice in parallel for the same dev
>> > > > > by two different threads.  Thread 1 does the atomic_inc_and_test() and
>> > > > > finds that it is OK to do the device_del() and put_device() which causes
>> > > > > the device object to be freed.  Then thread 2 does the atomic_inc_and_test()
>> > > > > on the already freed device object and crashes the kernel.
>> > > > >
>> > > > thread2 should still hold one extra reference.
>> > > > that is in
>> > > >   device_schedule_callback
>> > > >      ==> sysfs_schedule_callback
>> > > >          ==> kobject_get(kobj)
>> > > >
>> > > > pci_destroy_dev for thread2 is called at this point.
>> > > >
>> > > > and that reference will be released from
>> > > >         sysfs_schedule_callback
>> > > >         ==> kobject_put()...
>> > >
>> > > Well, that would be the case if thread 2 was started by device_schedule_callback(),
>> > > but again, for example, it may be trim_stale_devices() started by acpiphp_check_bridge()
>> > > that doesn't hold extra references to the pci_dev.  [Well, that piece of code
>> > > is racy anyway, because it walks bus->devices without locking.  Which is my
>> > > fault too, because I overlooked that.  Shame, shame.]
>> > >

can you add extra reference to that path?

>> > > Perhaps we can do something like the (untested) patch below (in addition to the
>> > > $subject patch).  Do you see any immediate problems with it?
>> >
>> > Ah, I see one.  It will break pci_stop_bus_device() and pci_remove_bus_device().
>> > So much for being clever.
>> >
>> > Moreover, it looks like those two routines above are racy too for the same
>> > reason?
>>
>> The (still untested) patch below is what I have come up with for now.  The
>> is_gone flag is now only operated under pci_remove_rescan_mutex, so it need
>> not be atomic.  Of course, whoever calls pci_stop_and_remove_bus_device()
>> (the "locked" one) should hold a ref to the device being removed to avoid
>> use-after-free (the callers need to be audited for that).

if you can use device_schedule_..., should have hold reference may be
atomic would be better than lock/unlock everywhere?

>>
>> Well, I probably still missed something, because it's the middle of the night
>> and I should be going to sleep instead of starig at the PCI removal code.  Sigh.


>
> Thunderbolt hotplug works for me with this patch applied FWIW.
>
>> ---
>>  drivers/pci/hotplug/acpiphp_glue.c |   15 ++++++++++++---
>>  drivers/pci/pci-sysfs.c            |   17 ++++++++++++-----
>>  drivers/pci/pci.h                  |    3 +++
>>  drivers/pci/remove.c               |   15 +++++++++++++--
>>  include/linux/pci.h                |    1 +
>>  5 files changed, 41 insertions(+), 10 deletions(-)
>>
>> Index: linux-pm/drivers/pci/hotplug/acpiphp_glue.c
>> ===================================================================
>> --- linux-pm.orig/drivers/pci/hotplug/acpiphp_glue.c
>> +++ linux-pm/drivers/pci/hotplug/acpiphp_glue.c
>> @@ -553,6 +553,8 @@ static void __ref enable_slot(struct acp
>>       int max, pass;
>>       LIST_HEAD(add_list);
>>
>> +     lock_pci_remove_rescan();
>> +
>>       acpiphp_rescan_slot(slot);
>>       max = acpiphp_max_busnr(bus);
>>       for (pass = 0; pass < 2; pass++) {
>> @@ -586,6 +588,8 @@ static void __ref enable_slot(struct acp
>>
>>       pci_bus_add_devices(bus);
>>
>> +     unlock_pci_remove_rescan();
>> +
>>       slot->flags |= SLOT_ENABLED;
>>       list_for_each_entry(func, &slot->funcs, sibling) {
>>               dev = pci_get_slot(bus, PCI_DEVFN(slot->device,
>> @@ -626,6 +630,7 @@ static void disable_slot(struct acpiphp_
>>       struct acpiphp_func *func;
>>       struct pci_dev *pdev;
>>
>> +     lock_pci_remove_rescan();
>>       /*
>>        * enable_slot() enumerates all functions in this device via
>>        * pci_scan_slot(), whether they have associated ACPI hotplug
>> @@ -633,9 +638,10 @@ static void disable_slot(struct acpiphp_
>>        * here.
>>        */
>>       while ((pdev = dev_in_slot(slot))) {
>> -             pci_stop_and_remove_bus_device(pdev);
>> +             __pci_stop_and_remove_bus_device(pdev);
>>               pci_dev_put(pdev);
>>       }
>> +     unlock_pci_remove_rescan();
>>
>>       list_for_each_entry(func, &slot->funcs, sibling)
>>               acpiphp_bus_trim(func_to_handle(func));
>> @@ -710,7 +716,7 @@ static void trim_stale_devices(struct pc
>>               alive = pci_bus_read_dev_vendor_id(dev->bus, dev->devfn, &v, 0);
>>       }
>>       if (!alive) {
>> -             pci_stop_and_remove_bus_device(dev);
>> +             __pci_stop_and_remove_bus_device(dev);
>>               if (handle)
>>                       acpiphp_bus_trim(handle);
>>       } else if (bus) {
>> @@ -743,12 +749,15 @@ static void acpiphp_check_bridge(struct
>>               mutex_lock(&slot->crit_sect);
>>               /* wake up all functions */
>>               if (get_slot_status(slot) == ACPI_STA_ALL) {
>> +                     lock_pci_remove_rescan();
>> +
>>                       /* remove stale devices if any */
>>                       list_for_each_entry_safe(dev, tmp, &bus->devices,
>>                                                bus_list)
>>                               if (PCI_SLOT(dev->devfn) == slot->device)
>>                                       trim_stale_devices(dev);
>>
>> +                     unlock_pci_remove_rescan();
>>                       /* configure all functions */
>>                       enable_slot(slot);
>>               } else {
>> @@ -783,7 +792,7 @@ static void acpiphp_sanitize_bus(struct
>>                                       res->end) {
>>                               /* Could not assign a required resources
>>                                * for this device, remove it */
>> -                             pci_stop_and_remove_bus_device(dev);
>> +                             __pci_stop_and_remove_bus_device(dev);
>>                               break;
>>                       }
>>               }
>> Index: linux-pm/drivers/pci/pci-sysfs.c
>> ===================================================================
>> --- linux-pm.orig/drivers/pci/pci-sysfs.c
>> +++ linux-pm/drivers/pci/pci-sysfs.c
>> @@ -298,6 +298,17 @@ msi_bus_store(struct device *dev, struct
>>  static DEVICE_ATTR_RW(msi_bus);
>>
>>  static DEFINE_MUTEX(pci_remove_rescan_mutex);
>> +
>> +void lock_pci_remove_rescan(void)
>> +{
>> +     mutex_lock(&pci_remove_rescan_mutex);
>> +}
>> +
>> +void unlock_pci_remove_rescan(void)
>> +{
>> +     mutex_unlock(&pci_remove_rescan_mutex);
>> +}
>> +
>>  static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf,
>>                               size_t count)
>>  {
>> @@ -354,11 +365,7 @@ static struct device_attribute dev_resca
>>
>>  static void remove_callback(struct device *dev)
>>  {
>> -     struct pci_dev *pdev = to_pci_dev(dev);
>> -
>> -     mutex_lock(&pci_remove_rescan_mutex);
>> -     pci_stop_and_remove_bus_device(pdev);
>> -     mutex_unlock(&pci_remove_rescan_mutex);
>> +     pci_stop_and_remove_bus_device(to_pci_dev(dev));
>>  }
>>
>>  static ssize_t
>> Index: linux-pm/drivers/pci/pci.h
>> ===================================================================
>> --- linux-pm.orig/drivers/pci/pci.h
>> +++ linux-pm/drivers/pci/pci.h
>> @@ -11,8 +11,11 @@ extern const unsigned char pcie_link_spe
>>
>>  /* Functions internal to the PCI core code */
>>
>> +void lock_pci_remove_rescan(void);
>> +void unlock_pci_remove_rescan(void);
>>  int pci_create_sysfs_dev_files(struct pci_dev *pdev);
>>  void pci_remove_sysfs_dev_files(struct pci_dev *pdev);
>> +void __pci_stop_and_remove_bus_device(struct pci_dev *pdev);
>>  #if !defined(CONFIG_DMI) && !defined(CONFIG_ACPI)
>>  static inline void pci_create_firmware_label_files(struct pci_dev *pdev)
>>  { return; }
>> Index: linux-pm/drivers/pci/remove.c
>> ===================================================================
>> --- linux-pm.orig/drivers/pci/remove.c
>> +++ linux-pm/drivers/pci/remove.c
>> @@ -34,6 +34,10 @@ static void pci_stop_dev(struct pci_dev
>>
>>  static void pci_destroy_dev(struct pci_dev *dev)
>>  {
>> +     if (dev->is_gone)
>> +             return;
>> +
>> +     dev->is_gone = 1;
>>       device_del(&dev->dev);
>>
>>       down_write(&pci_bus_sem);
>> @@ -95,6 +99,12 @@ static void pci_remove_bus_device(struct
>>       pci_destroy_dev(dev);
>>  }
>>
>> +void __pci_stop_and_remove_bus_device(struct pci_dev *dev)
>> +{
>> +     pci_stop_bus_device(dev);
>> +     pci_remove_bus_device(dev);
>> +}
>> +
>>  /**
>>   * pci_stop_and_remove_bus_device - remove a PCI device and any children
>>   * @dev: the device to remove
>> @@ -109,8 +119,9 @@ static void pci_remove_bus_device(struct
>>   */
>>  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
>>  {
>> -     pci_stop_bus_device(dev);
>> -     pci_remove_bus_device(dev);
>> +     lock_pci_remove_rescan();
>> +     __pci_stop_and_remove_bus_device(dev);
>> +     unlock_pci_remove_rescan();
>>  }
>>  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
>>
>> Index: linux-pm/include/linux/pci.h
>> ===================================================================
>> --- linux-pm.orig/include/linux/pci.h
>> +++ linux-pm/include/linux/pci.h
>> @@ -321,6 +321,7 @@ struct pci_dev {
>>       unsigned int    multifunction:1;/* Part of multi-function device */
>>       /* keep track of device state */
>>       unsigned int    is_added:1;
>> +     unsigned int    is_gone:1;
>>       unsigned int    is_busmaster:1; /* device is busmaster */
>>       unsigned int    no_msi:1;       /* device may not use msi */
>>       unsigned int    block_cfg_access:1;     /* config space access is blocked */
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> I speak only for myself.
> Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-11-30 22:27                     ` Yinghai Lu
@ 2013-12-01  1:24                       ` Rafael J. Wysocki
  2013-12-02  1:29                         ` Rafael J. Wysocki
  0 siblings, 1 reply; 69+ messages in thread
From: Rafael J. Wysocki @ 2013-12-01  1:24 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg

On Saturday, November 30, 2013 02:27:15 PM Yinghai Lu wrote:
> On Sat, Nov 30, 2013 at 1:37 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > On Saturday, November 30, 2013 01:31:33 AM Rafael J. Wysocki wrote:
> >> On Saturday, November 30, 2013 12:45:55 AM Rafael J. Wysocki wrote:
> >> > On Saturday, November 30, 2013 12:38:26 AM Rafael J. Wysocki wrote:
> >> > > On Tuesday, November 26, 2013 06:26:54 PM Yinghai Lu wrote:
> >> > > > On Tue, Nov 26, 2013 at 5:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >> > > > >
> >> > > > > So assume pci_destroy_dev() is called twice in parallel for the same dev
> >> > > > > by two different threads.  Thread 1 does the atomic_inc_and_test() and
> >> > > > > finds that it is OK to do the device_del() and put_device() which causes
> >> > > > > the device object to be freed.  Then thread 2 does the atomic_inc_and_test()
> >> > > > > on the already freed device object and crashes the kernel.
> >> > > > >
> >> > > > thread2 should still hold one extra reference.
> >> > > > that is in
> >> > > >   device_schedule_callback
> >> > > >      ==> sysfs_schedule_callback
> >> > > >          ==> kobject_get(kobj)
> >> > > >
> >> > > > pci_destroy_dev for thread2 is called at this point.
> >> > > >
> >> > > > and that reference will be released from
> >> > > >         sysfs_schedule_callback
> >> > > >         ==> kobject_put()...
> >> > >
> >> > > Well, that would be the case if thread 2 was started by device_schedule_callback(),
> >> > > but again, for example, it may be trim_stale_devices() started by acpiphp_check_bridge()
> >> > > that doesn't hold extra references to the pci_dev.  [Well, that piece of code
> >> > > is racy anyway, because it walks bus->devices without locking.  Which is my
> >> > > fault too, because I overlooked that.  Shame, shame.]
> >> > >
> 
> can you add extra reference to that path?

hotplug_event_work()
	hotplug_event()
		acpiphp_check_bridge()
			trim_stale_devices()
				pci_stop_and_remove_bus_device()

Yes, it should hold a reference to dev, but adding it there doesn't really help,
because there are list walks over &bus->devices in acpiphp_check_bridge() and
trim_stale_devices() that are racy with respect to pci_stop_and_remove_bus_device()
run from device_schedule_callback().

> >> > > Perhaps we can do something like the (untested) patch below (in addition to the
> >> > > $subject patch).  Do you see any immediate problems with it?
> >> >
> >> > Ah, I see one.  It will break pci_stop_bus_device() and pci_remove_bus_device().
> >> > So much for being clever.
> >> >
> >> > Moreover, it looks like those two routines above are racy too for the same
> >> > reason?
> >>
> >> The (still untested) patch below is what I have come up with for now.  The
> >> is_gone flag is now only operated under pci_remove_rescan_mutex, so it need
> >> not be atomic.  Of course, whoever calls pci_stop_and_remove_bus_device()
> >> (the "locked" one) should hold a ref to the device being removed to avoid
> >> use-after-free (the callers need to be audited for that).
> 
> if you can use device_schedule_...,

No, I can't.  I need to hold acpi_scan_lock taken in hotplug_event_work()
throughout all bus trimming/scanning and I need to protect list walks over
&bus->devices too.

> should have hold reference may be
> atomic would be better than lock/unlock everywhere?

The locking is necessary not only for the device removal itself, but also for
the safety of the &bus->devices list walks.

Besides, remove_callback() in remove.c already holds pci_remove_rescan_mutex
around pci_stop_and_remove_bus_device() and I don't see how it would be safe
to run pci_stop_and_remove_bus_device() without holding that mutex from
anywhere else.

For one example, pci_stop_and_remove_bus_device() that is not run under
pci_remove_rescan_mutex can race with the stuff called under that mutex
in dev_bus_rescan_store() (and elsewhere in pci-sysfs.c).

So either pci_remove_rescan_mutex is useless and should be dropped, or
it is there for a purpose, in which case it needs to be used around
pci_stop_and_remove_bus_device() everywhere.  There's no other possibility
and to my eyes that mutex is necessary.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-12-01  1:24                       ` Rafael J. Wysocki
@ 2013-12-02  1:29                         ` Rafael J. Wysocki
  2013-12-02 14:49                           ` Rafael J. Wysocki
  0 siblings, 1 reply; 69+ messages in thread
From: Rafael J. Wysocki @ 2013-12-02  1:29 UTC (permalink / raw)
  To: Yinghai Lu, Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	Mika Westerberg, Myron Stowe

On Sunday, December 01, 2013 02:24:33 AM Rafael J. Wysocki wrote:
> On Saturday, November 30, 2013 02:27:15 PM Yinghai Lu wrote:
> > On Sat, Nov 30, 2013 at 1:37 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > On Saturday, November 30, 2013 01:31:33 AM Rafael J. Wysocki wrote:
> > >> On Saturday, November 30, 2013 12:45:55 AM Rafael J. Wysocki wrote:
> > >> > On Saturday, November 30, 2013 12:38:26 AM Rafael J. Wysocki wrote:
> > >> > > On Tuesday, November 26, 2013 06:26:54 PM Yinghai Lu wrote:
> > >> > > > On Tue, Nov 26, 2013 at 5:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >> > > > >
> > >> > > > > So assume pci_destroy_dev() is called twice in parallel for the same dev
> > >> > > > > by two different threads.  Thread 1 does the atomic_inc_and_test() and
> > >> > > > > finds that it is OK to do the device_del() and put_device() which causes
> > >> > > > > the device object to be freed.  Then thread 2 does the atomic_inc_and_test()
> > >> > > > > on the already freed device object and crashes the kernel.
> > >> > > > >
> > >> > > > thread2 should still hold one extra reference.
> > >> > > > that is in
> > >> > > >   device_schedule_callback
> > >> > > >      ==> sysfs_schedule_callback
> > >> > > >          ==> kobject_get(kobj)
> > >> > > >
> > >> > > > pci_destroy_dev for thread2 is called at this point.
> > >> > > >
> > >> > > > and that reference will be released from
> > >> > > >         sysfs_schedule_callback
> > >> > > >         ==> kobject_put()...
> > >> > >
> > >> > > Well, that would be the case if thread 2 was started by device_schedule_callback(),
> > >> > > but again, for example, it may be trim_stale_devices() started by acpiphp_check_bridge()
> > >> > > that doesn't hold extra references to the pci_dev.  [Well, that piece of code
> > >> > > is racy anyway, because it walks bus->devices without locking.  Which is my
> > >> > > fault too, because I overlooked that.  Shame, shame.]
> > >> > >
> > 
> > can you add extra reference to that path?
> 
> hotplug_event_work()
> 	hotplug_event()
> 		acpiphp_check_bridge()
> 			trim_stale_devices()
> 				pci_stop_and_remove_bus_device()
> 
> Yes, it should hold a reference to dev, but adding it there doesn't really help,
> because there are list walks over &bus->devices in acpiphp_check_bridge() and
> trim_stale_devices() that are racy with respect to pci_stop_and_remove_bus_device()
> run from device_schedule_callback().
> 
> > >> > > Perhaps we can do something like the (untested) patch below (in addition to the
> > >> > > $subject patch).  Do you see any immediate problems with it?
> > >> >
> > >> > Ah, I see one.  It will break pci_stop_bus_device() and pci_remove_bus_device().
> > >> > So much for being clever.
> > >> >
> > >> > Moreover, it looks like those two routines above are racy too for the same
> > >> > reason?
> > >>
> > >> The (still untested) patch below is what I have come up with for now.  The
> > >> is_gone flag is now only operated under pci_remove_rescan_mutex, so it need
> > >> not be atomic.  Of course, whoever calls pci_stop_and_remove_bus_device()
> > >> (the "locked" one) should hold a ref to the device being removed to avoid
> > >> use-after-free (the callers need to be audited for that).
> > 
> > if you can use device_schedule_...,
> 
> No, I can't.  I need to hold acpi_scan_lock taken in hotplug_event_work()
> throughout all bus trimming/scanning and I need to protect list walks over
> &bus->devices too.
> 
> > should have hold reference may be
> > atomic would be better than lock/unlock everywhere?
> 
> The locking is necessary not only for the device removal itself, but also for
> the safety of the &bus->devices list walks.
> 
> Besides, remove_callback() in remove.c already holds pci_remove_rescan_mutex
> around pci_stop_and_remove_bus_device() and I don't see how it would be safe
> to run pci_stop_and_remove_bus_device() without holding that mutex from
> anywhere else.
> 
> For one example, pci_stop_and_remove_bus_device() that is not run under
> pci_remove_rescan_mutex can race with the stuff called under that mutex
> in dev_bus_rescan_store() (and elsewhere in pci-sysfs.c).
> 
> So either pci_remove_rescan_mutex is useless and should be dropped, or
> it is there for a purpose, in which case it needs to be used around
> pci_stop_and_remove_bus_device() everywhere.  There's no other possibility
> and to my eyes that mutex is necessary.

So below is a new version of the patch (which has been tested on my Thunderbolt
rig without visibly breaking anything) with the description of all the problems
it attempts to address.  If any of the scenarios described in the changelog are
not possible for some reason, please tell me why that is the case.  I couldn't
find such reasons myself.

Thanks,
Rafael


---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: PCI / hotplug / ACPI: Fix concurrency problems related to device removal

The following are concurrency problems related to the PCI device
removal code in pci-sysfs.c and in ACPIPHP present in the current
mainline kernel:

Scenario 1: pci_stop_and_remove_bus_device() run concurrently for
  the same top device from remove_callback() in pci-sysfs.c and
  from trim_stale_devices() in acpiphp_glue.c.

  In this scenario the second code path is executed without
  pci_remove_rescan_mutex locked, so the &bus->devices list
  walks in either trim_stale_devices() itself or in
  acpiphp_check_bridge() can suffer from list corruption while the
  first code path is executing pci_destroy_dev() for one of the
  devices on those lists.

  Moreover, if any of the device objects in question is freed
  after pci_destroy_dev() executed by the first code path, the
  second code path may suffer a use-after-free problem while
  trying to access that device object.

  Conversely, the second code path may execute pci_destroy_dev()
  for one of the devices in question such that one of the
  &bus->devices list walks in pci_stop_bus_device()
  or pci_remove_bus_device() executed by the first code path will
  suffer from a list corruption.

  Moreover, use-after-free is also possible if one of the device
  objects in question is freed as a result of calling
  pci_destroy_dev() by the second code path and then the first
  code path tries to access it (the first code path only holds
  an extra reference to the device it has been run for, but not
  for its child devices).

Scenario 2: ACPI hotplug event occurs for a device under a bridge
  being removed by pci_stop_and_remove_bus_device() run from
  remove_callback() in pci-sysfs.c.

  In that case it doesn't make sense to handle the hotplug event,
  because the device in question will be removed anyway along with
  its parent bridge and that will cause the context objects needed
  for hotplug handling to be freed as well.

  Moreover, if the event is handled regardless, it may cause one
  or more devices already removed by pci_stop_and_remove_bus_device()
  to be added again by the code handling the event, which will
  conflict with the bridge removal.

Scenario 3: pci_stop_and_remove_bus_device() is run from
  trim_stale_devices() (as a result of an ACPI hotplug event) in
  parallel with dev_bus_rescan_store() or bus_rescan_store(),
  or dev_rescan_store().

  In that scenario the second code path may attempt to operate
  on device objects being removed by the first code path which
  may lead to many interesting types of breakage.

Scenario 4: acpi_pci_root_remove() run (as a result of an ACPI PCI
  host bridge removal event) in  parallel with bus_rescan_store(),
  dev_bus_rescan_store(), dev_rescan_store(), or remove_callback()
  for any devices under the host bridge in question.

  In that case the same symptoms as in Scenarios 1 and 3 may occur
  depending on which code path wins the races involved.

Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
  for a device and its parent bridge via remove_callback().

  In that case both code paths attempt to acquire
  pci_remove_rescan_mutex.  If the child device removal acquires
  it first, there will be no problems.  However, if the parent
  bridge removal acquires it first, it will eventually execute
  pci_destroy_dev() for the child device, but that device will
  not be freed yet due to the reference held by the concurrent
  child removal.  Consequently, both pci_stop_bus_device() and
  pci_remove_bus_device() will be executed for that device
  unnecessarily and pci_destroy_dev() will see a corrupted list
  head in that object.  Moreover, an excess put_device() will
  be executed for that device in that case which may lead to a
  use-after-free in the final kobject_put() done by
  sysfs_schedule_callback_work().

All of these scenarios are addressed by the patch below as follows.

(1) To prevent Scenarios 1 and 3 from happening hold
    pci_remove_rescan_mutex around hotplug_event() in
    hotplug_event_work(() (acpiphp_glue.c).

(2) To prevent Scenario 2 from happening, add an ACPIPHP bridge
    flag is_going_away indicating that hotplug events should be
    ignored for children below that bridge.  That flag is set
    by cleanup_bridge() that for non-root bridges should be run
    under pci_remove_rescan_mutex (for root bridges it is only
    run under acpi_scan_lock anyway).

(3) To prevent Scenario 4 from happening, hold
    pci_remove_rescan_mutex around pci_stop_root_bus() and
    pci_remove_root_bus() in acpi_pci_root_remove().

(4) To prevent Scenario 5 from happening, add an new is_gone
    flag to struct pci_dev that will be set by pci_destroy_dev()
    and checked by pci_stop_and_remove_bus_device().  That only
    covers cases in which pci_stop_and_remove_bus_device() is
    run under pci_remove_rescan_mutex, but the other existing
    cases need to be fixed to use that mutex anyway for other
    reasons (analogous to Scenarios 1 and 3, for example).

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/acpi/pci_root.c            |    4 ++++
 drivers/pci/hotplug/acpiphp.h      |    1 +
 drivers/pci/hotplug/acpiphp_glue.c |   22 ++++++++++++++++++++--
 drivers/pci/pci-sysfs.c            |   11 +++++++++++
 drivers/pci/remove.c               |    7 +++++--
 include/linux/pci.h                |    3 +++
 6 files changed, 44 insertions(+), 4 deletions(-)

Index: linux-pm/include/linux/pci.h
===================================================================
--- linux-pm.orig/include/linux/pci.h
+++ linux-pm/include/linux/pci.h
@@ -321,6 +321,7 @@ struct pci_dev {
 	unsigned int	multifunction:1;/* Part of multi-function device */
 	/* keep track of device state */
 	unsigned int	is_added:1;
+	unsigned int	is_gone:1;
 	unsigned int	is_busmaster:1; /* device is busmaster */
 	unsigned int	no_msi:1;	/* device may not use msi */
 	unsigned int	block_cfg_access:1;	/* config space access is blocked */
@@ -1021,6 +1022,8 @@ void set_pcie_hotplug_bridge(struct pci_
 int pci_bus_find_capability(struct pci_bus *bus, unsigned int devfn, int cap);
 unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge);
 unsigned int pci_rescan_bus(struct pci_bus *bus);
+void lock_pci_remove_rescan(void);
+void unlock_pci_remove_rescan(void);
 
 /* Vital product data routines */
 ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
Index: linux-pm/drivers/pci/pci-sysfs.c
===================================================================
--- linux-pm.orig/drivers/pci/pci-sysfs.c
+++ linux-pm/drivers/pci/pci-sysfs.c
@@ -298,6 +298,17 @@ msi_bus_store(struct device *dev, struct
 static DEVICE_ATTR_RW(msi_bus);
 
 static DEFINE_MUTEX(pci_remove_rescan_mutex);
+
+void lock_pci_remove_rescan(void)
+{
+	mutex_lock(&pci_remove_rescan_mutex);
+}
+
+void unlock_pci_remove_rescan(void)
+{
+	mutex_unlock(&pci_remove_rescan_mutex);
+}
+
 static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf,
 				size_t count)
 {
Index: linux-pm/drivers/pci/remove.c
===================================================================
--- linux-pm.orig/drivers/pci/remove.c
+++ linux-pm/drivers/pci/remove.c
@@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
 
 static void pci_destroy_dev(struct pci_dev *dev)
 {
+	dev->is_gone = 1;
 	device_del(&dev->dev);
 
 	down_write(&pci_bus_sem);
@@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
  */
 void pci_stop_and_remove_bus_device(struct pci_dev *dev)
 {
-	pci_stop_bus_device(dev);
-	pci_remove_bus_device(dev);
+	if (!dev->is_gone) {
+		pci_stop_bus_device(dev);
+		pci_remove_bus_device(dev);
+	}
 }
 EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
 
Index: linux-pm/drivers/pci/hotplug/acpiphp.h
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/acpiphp.h
+++ linux-pm/drivers/pci/hotplug/acpiphp.h
@@ -71,6 +71,7 @@ struct acpiphp_bridge {
 	struct acpiphp_context *context;
 
 	int nr_slots;
+	bool is_going_away;
 
 	/* This bus (host bridge) or Secondary bus (PCI-to-PCI bridge) */
 	struct pci_bus *pci_bus;
Index: linux-pm/drivers/pci/hotplug/acpiphp_glue.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/acpiphp_glue.c
+++ linux-pm/drivers/pci/hotplug/acpiphp_glue.c
@@ -439,6 +439,13 @@ static void cleanup_bridge(struct acpiph
 	mutex_lock(&bridge_mutex);
 	list_del(&bridge->list);
 	mutex_unlock(&bridge_mutex);
+
+	/*
+	 * For non-root bridges this flag is protected by the PCI remove/rescan
+	 * locking.  For root bridges it is only operated under acpi_scan_lock
+	 * anyway.
+	 */
+	bridge->is_going_away = true;
 }
 
 /**
@@ -733,11 +740,17 @@ static void trim_stale_devices(struct pc
  *
  * Iterate over all slots under this bridge and make sure that if a
  * card is present they are enabled, and if not they are disabled.
+ *
+ * For non-root bridges call under the PCI remove/rescan mutex.
  */
 static void acpiphp_check_bridge(struct acpiphp_bridge *bridge)
 {
 	struct acpiphp_slot *slot;
 
+	/* Bail out if the bridge is going away. */
+	if (bridge->is_going_away)
+		return;
+
 	list_for_each_entry(slot, &bridge->slots, node) {
 		struct pci_bus *bus = slot->bus;
 		struct pci_dev *dev, *tmp;
@@ -878,14 +891,19 @@ static void hotplug_event_work(void *dat
 {
 	struct acpiphp_context *context = data;
 	acpi_handle handle = context->handle;
+	struct acpiphp_bridge *bridge = context->func.parent;
 
 	acpi_scan_lock_acquire();
+	lock_pci_remove_rescan();
 
-	hotplug_event(handle, type, context);
+	/* Bail out if the parent bridge is going away. */
+	if (!bridge->is_going_away)
+		hotplug_event(handle, type, context);
 
+	unlock_pci_remove_rescan();
 	acpi_scan_lock_release();
 	acpi_evaluate_hotplug_ost(handle, type, ACPI_OST_SC_SUCCESS, NULL);
-	put_bridge(context->func.parent);
+	put_bridge(bridge);
 }
 
 /**
Index: linux-pm/drivers/acpi/pci_root.c
===================================================================
--- linux-pm.orig/drivers/acpi/pci_root.c
+++ linux-pm/drivers/acpi/pci_root.c
@@ -616,6 +616,8 @@ static void acpi_pci_root_remove(struct
 {
 	struct acpi_pci_root *root = acpi_driver_data(device);
 
+	lock_pci_remove_rescan();
+
 	pci_stop_root_bus(root->bus);
 
 	device_set_run_wake(root->bus->bridge, false);
@@ -623,6 +625,8 @@ static void acpi_pci_root_remove(struct
 
 	pci_remove_root_bus(root->bus);
 
+	unlock_pci_remove_rescan();
+
 	kfree(root);
 }
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-12-02  1:29                         ` Rafael J. Wysocki
@ 2013-12-02 14:49                           ` Rafael J. Wysocki
  2013-12-05 22:40                             ` Bjorn Helgaas
  2013-12-06  6:52                             ` [PATCH v2 04/10] PCI: Destroy pci dev only once Yinghai Lu
  0 siblings, 2 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2013-12-02 14:49 UTC (permalink / raw)
  To: Yinghai Lu, Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	Mika Westerberg, Myron Stowe

On Monday, December 02, 2013 02:29:46 AM Rafael J. Wysocki wrote:
> On Sunday, December 01, 2013 02:24:33 AM Rafael J. Wysocki wrote:
> > On Saturday, November 30, 2013 02:27:15 PM Yinghai Lu wrote:
> > > On Sat, Nov 30, 2013 at 1:37 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > > On Saturday, November 30, 2013 01:31:33 AM Rafael J. Wysocki wrote:
> > > >> On Saturday, November 30, 2013 12:45:55 AM Rafael J. Wysocki wrote:
> > > >> > On Saturday, November 30, 2013 12:38:26 AM Rafael J. Wysocki wrote:
> > > >> > > On Tuesday, November 26, 2013 06:26:54 PM Yinghai Lu wrote:
> > > >> > > > On Tue, Nov 26, 2013 at 5:24 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > > >> > > > >
> > > >> > > > > So assume pci_destroy_dev() is called twice in parallel for the same dev
> > > >> > > > > by two different threads.  Thread 1 does the atomic_inc_and_test() and
> > > >> > > > > finds that it is OK to do the device_del() and put_device() which causes
> > > >> > > > > the device object to be freed.  Then thread 2 does the atomic_inc_and_test()
> > > >> > > > > on the already freed device object and crashes the kernel.
> > > >> > > > >
> > > >> > > > thread2 should still hold one extra reference.
> > > >> > > > that is in
> > > >> > > >   device_schedule_callback
> > > >> > > >      ==> sysfs_schedule_callback
> > > >> > > >          ==> kobject_get(kobj)
> > > >> > > >
> > > >> > > > pci_destroy_dev for thread2 is called at this point.
> > > >> > > >
> > > >> > > > and that reference will be released from
> > > >> > > >         sysfs_schedule_callback
> > > >> > > >         ==> kobject_put()...
> > > >> > >
> > > >> > > Well, that would be the case if thread 2 was started by device_schedule_callback(),
> > > >> > > but again, for example, it may be trim_stale_devices() started by acpiphp_check_bridge()
> > > >> > > that doesn't hold extra references to the pci_dev.  [Well, that piece of code
> > > >> > > is racy anyway, because it walks bus->devices without locking.  Which is my
> > > >> > > fault too, because I overlooked that.  Shame, shame.]
> > > >> > >
> > > 
> > > can you add extra reference to that path?
> > 
> > hotplug_event_work()
> > 	hotplug_event()
> > 		acpiphp_check_bridge()
> > 			trim_stale_devices()
> > 				pci_stop_and_remove_bus_device()
> > 
> > Yes, it should hold a reference to dev, but adding it there doesn't really help,
> > because there are list walks over &bus->devices in acpiphp_check_bridge() and
> > trim_stale_devices() that are racy with respect to pci_stop_and_remove_bus_device()
> > run from device_schedule_callback().
> > 
> > > >> > > Perhaps we can do something like the (untested) patch below (in addition to the
> > > >> > > $subject patch).  Do you see any immediate problems with it?
> > > >> >
> > > >> > Ah, I see one.  It will break pci_stop_bus_device() and pci_remove_bus_device().
> > > >> > So much for being clever.
> > > >> >
> > > >> > Moreover, it looks like those two routines above are racy too for the same
> > > >> > reason?
> > > >>
> > > >> The (still untested) patch below is what I have come up with for now.  The
> > > >> is_gone flag is now only operated under pci_remove_rescan_mutex, so it need
> > > >> not be atomic.  Of course, whoever calls pci_stop_and_remove_bus_device()
> > > >> (the "locked" one) should hold a ref to the device being removed to avoid
> > > >> use-after-free (the callers need to be audited for that).
> > > 
> > > if you can use device_schedule_...,
> > 
> > No, I can't.  I need to hold acpi_scan_lock taken in hotplug_event_work()
> > throughout all bus trimming/scanning and I need to protect list walks over
> > &bus->devices too.
> > 
> > > should have hold reference may be
> > > atomic would be better than lock/unlock everywhere?
> > 
> > The locking is necessary not only for the device removal itself, but also for
> > the safety of the &bus->devices list walks.
> > 
> > Besides, remove_callback() in remove.c already holds pci_remove_rescan_mutex
> > around pci_stop_and_remove_bus_device() and I don't see how it would be safe
> > to run pci_stop_and_remove_bus_device() without holding that mutex from
> > anywhere else.
> > 
> > For one example, pci_stop_and_remove_bus_device() that is not run under
> > pci_remove_rescan_mutex can race with the stuff called under that mutex
> > in dev_bus_rescan_store() (and elsewhere in pci-sysfs.c).
> > 
> > So either pci_remove_rescan_mutex is useless and should be dropped, or
> > it is there for a purpose, in which case it needs to be used around
> > pci_stop_and_remove_bus_device() everywhere.  There's no other possibility
> > and to my eyes that mutex is necessary.
> 
> So below is a new version of the patch (which has been tested on my Thunderbolt
> rig without visibly breaking anything) with the description of all the problems
> it attempts to address.  If any of the scenarios described in the changelog are
> not possible for some reason, please tell me why that is the case.  I couldn't
> find such reasons myself.

And I forgot about the ACPIPHP slot "power" attribute that also may trigger
race conditions with device removal.  Updated patch follows.

Thanks,
Rafael


---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: PCI / hotplug / ACPI: Fix concurrency problems related to device removal

The following are concurrency problems related to the PCI device
removal code in pci-sysfs.c and in ACPIPHP present in the current
mainline kernel:

Scenario 1: pci_stop_and_remove_bus_device() run concurrently for
  the same top device from remove_callback() in pci-sysfs.c and
  from trim_stale_devices() in acpiphp_glue.c.

  In this scenario the second code path is executed without
  pci_remove_rescan_mutex locked, so the &bus->devices list
  walks in either trim_stale_devices() itself or in
  acpiphp_check_bridge() can suffer from list corruption while the
  first code path is executing pci_destroy_dev() for one of the
  devices on those lists.

  Moreover, if any of the device objects in question is freed
  after pci_destroy_dev() executed by the first code path, the
  second code path may suffer a use-after-free problem while
  trying to access that device object.

  Conversely, the second code path may execute pci_destroy_dev()
  for one of the devices in question such that one of the
  &bus->devices list walks in pci_stop_bus_device()
  or pci_remove_bus_device() executed by the first code path will
  suffer from a list corruption.

  Moreover, use-after-free is also possible if one of the device
  objects in question is freed as a result of calling
  pci_destroy_dev() by the second code path and then the first
  code path tries to access it (the first code path only holds
  an extra reference to the device it has been run for, but not
  for its child devices).

Scenario 2: ACPI hotplug event occurs for a device under a bridge
  being removed by pci_stop_and_remove_bus_device() run from
  remove_callback() in pci-sysfs.c.

  In that case it doesn't make sense to handle the hotplug event,
  because the device in question will be removed anyway along with
  its parent bridge and that will cause the context objects needed
  for hotplug handling to be freed as well.

  Moreover, if the event is handled regardless, it may cause one
  or more devices already removed by pci_stop_and_remove_bus_device()
  to be added again by the code handling the event, which will
  conflict with the bridge removal.

Scenario 3: pci_stop_and_remove_bus_device() is run from
  trim_stale_devices() (as a result of an ACPI hotplug event) in
  parallel with dev_bus_rescan_store() or bus_rescan_store(),
  or dev_rescan_store().

  In that scenario the second code path may attempt to operate
  on device objects being removed by the first code path which
  may lead to many interesting types of breakage.

Scenario 4: acpi_pci_root_remove() run (as a result of an ACPI PCI
  host bridge removal event) in  parallel with bus_rescan_store(),
  dev_bus_rescan_store(), dev_rescan_store(), or remove_callback()
  for any devices under the host bridge in question.

  In that case the same symptoms as in Scenarios 1 and 3 may occur
  depending on which code path wins the races involved.

Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
  for a device and its parent bridge via remove_callback().

  In that case both code paths attempt to acquire
  pci_remove_rescan_mutex.  If the child device removal acquires
  it first, there will be no problems.  However, if the parent
  bridge removal acquires it first, it will eventually execute
  pci_destroy_dev() for the child device, but that device will
  not be freed yet due to the reference held by the concurrent
  child removal.  Consequently, both pci_stop_bus_device() and
  pci_remove_bus_device() will be executed for that device
  unnecessarily and pci_destroy_dev() will see a corrupted list
  head in that object.  Moreover, an excess put_device() will
  be executed for that device in that case which may lead to a
  use-after-free in the final kobject_put() done by
  sysfs_schedule_callback_work().

Scenario 6: ACPIPHP slot enabling/disabling triggered by the
  slot's "power" attribute in parallel with device removal run
  from remove_callback().

  This scenario may lead to race conditions analogous to the
  ones described in Scenario 1.  It also may lead to situations
  in which an already removed device under a bridge scheduled
  for removal will be added which is analogous to Scenario 2.

All of these scenarios are addressed by the patch below as follows.

(1) To prevent the races in Scenarios 1 and 3 from happening hold
    pci_remove_rescan_mutex around hotplug_event() in
    hotplug_event_work(() (acpiphp_glue.c).

(2) To prevent the races in Scenario 2 from happening, add an ACPIPHP
    bridge flag is_going_away indicating that hotplug events should
    be ignored for children below that bridge.  That flag is set
    by cleanup_bridge() that for non-root bridges should be run
    under pci_remove_rescan_mutex (for root bridges it is only
    run under acpi_scan_lock anyway).

(3) To prevent the races in Scenario 4 from happening, hold
    pci_remove_rescan_mutex around pci_stop_root_bus() and
    pci_remove_root_bus() in acpi_pci_root_remove().

(4) To prevent the races in Scenario 5 from happening, add an new
    is_gone flag to struct pci_dev that will be set by pci_destroy_dev()
    and checked by pci_stop_and_remove_bus_device().  That only
    covers cases in which pci_stop_and_remove_bus_device() is
    run under pci_remove_rescan_mutex, but the other existing
    cases need to be fixed to use that mutex anyway for other
    reasons (analogous to Scenarios 1 and 3 above, for example).

(5) To prevent the races in Scenario 6 from happening, add
    the PCI remove/rescan locking to acpiphp_enable_slot() and
    acpiphp_disable_and_eject_slot() and make these functions
    check the slot's parent bridge status.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/acpi/pci_root.c            |    4 ++
 drivers/pci/hotplug/acpiphp.h      |    1 
 drivers/pci/hotplug/acpiphp_glue.c |   74 ++++++++++++++++++++++++++++++-------
 drivers/pci/pci-sysfs.c            |   11 +++++
 drivers/pci/remove.c               |    7 ++-
 include/linux/pci.h                |    3 +
 6 files changed, 84 insertions(+), 16 deletions(-)

Index: linux-pm/include/linux/pci.h
===================================================================
--- linux-pm.orig/include/linux/pci.h
+++ linux-pm/include/linux/pci.h
@@ -321,6 +321,7 @@ struct pci_dev {
 	unsigned int	multifunction:1;/* Part of multi-function device */
 	/* keep track of device state */
 	unsigned int	is_added:1;
+	unsigned int	is_gone:1;
 	unsigned int	is_busmaster:1; /* device is busmaster */
 	unsigned int	no_msi:1;	/* device may not use msi */
 	unsigned int	block_cfg_access:1;	/* config space access is blocked */
@@ -1022,6 +1023,8 @@ void set_pcie_hotplug_bridge(struct pci_
 int pci_bus_find_capability(struct pci_bus *bus, unsigned int devfn, int cap);
 unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge);
 unsigned int pci_rescan_bus(struct pci_bus *bus);
+void lock_pci_remove_rescan(void);
+void unlock_pci_remove_rescan(void);
 
 /* Vital product data routines */
 ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
Index: linux-pm/drivers/pci/pci-sysfs.c
===================================================================
--- linux-pm.orig/drivers/pci/pci-sysfs.c
+++ linux-pm/drivers/pci/pci-sysfs.c
@@ -298,6 +298,17 @@ msi_bus_store(struct device *dev, struct
 static DEVICE_ATTR_RW(msi_bus);
 
 static DEFINE_MUTEX(pci_remove_rescan_mutex);
+
+void lock_pci_remove_rescan(void)
+{
+	mutex_lock(&pci_remove_rescan_mutex);
+}
+
+void unlock_pci_remove_rescan(void)
+{
+	mutex_unlock(&pci_remove_rescan_mutex);
+}
+
 static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf,
 				size_t count)
 {
Index: linux-pm/drivers/pci/remove.c
===================================================================
--- linux-pm.orig/drivers/pci/remove.c
+++ linux-pm/drivers/pci/remove.c
@@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
 
 static void pci_destroy_dev(struct pci_dev *dev)
 {
+	dev->is_gone = 1;
 	device_del(&dev->dev);
 
 	down_write(&pci_bus_sem);
@@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
  */
 void pci_stop_and_remove_bus_device(struct pci_dev *dev)
 {
-	pci_stop_bus_device(dev);
-	pci_remove_bus_device(dev);
+	if (!dev->is_gone) {
+		pci_stop_bus_device(dev);
+		pci_remove_bus_device(dev);
+	}
 }
 EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
 
Index: linux-pm/drivers/pci/hotplug/acpiphp.h
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/acpiphp.h
+++ linux-pm/drivers/pci/hotplug/acpiphp.h
@@ -71,6 +71,7 @@ struct acpiphp_bridge {
 	struct acpiphp_context *context;
 
 	int nr_slots;
+	bool is_going_away;
 
 	/* This bus (host bridge) or Secondary bus (PCI-to-PCI bridge) */
 	struct pci_bus *pci_bus;
Index: linux-pm/drivers/pci/hotplug/acpiphp_glue.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/acpiphp_glue.c
+++ linux-pm/drivers/pci/hotplug/acpiphp_glue.c
@@ -439,6 +439,13 @@ static void cleanup_bridge(struct acpiph
 	mutex_lock(&bridge_mutex);
 	list_del(&bridge->list);
 	mutex_unlock(&bridge_mutex);
+
+	/*
+	 * For non-root bridges this flag is protected by the PCI remove/rescan
+	 * locking.  For root bridges it is only operated under acpi_scan_lock
+	 * anyway.
+	 */
+	bridge->is_going_away = true;
 }
 
 /**
@@ -733,11 +740,17 @@ static void trim_stale_devices(struct pc
  *
  * Iterate over all slots under this bridge and make sure that if a
  * card is present they are enabled, and if not they are disabled.
+ *
+ * For non-root bridges call under the PCI remove/rescan mutex.
  */
 static void acpiphp_check_bridge(struct acpiphp_bridge *bridge)
 {
 	struct acpiphp_slot *slot;
 
+	/* Bail out if the bridge is going away. */
+	if (bridge->is_going_away)
+		return;
+
 	list_for_each_entry(slot, &bridge->slots, node) {
 		struct pci_bus *bus = slot->bus;
 		struct pci_dev *dev, *tmp;
@@ -807,6 +820,8 @@ void acpiphp_check_host_bridge(acpi_hand
 	}
 }
 
+static int disable_and_eject_slot(struct acpiphp_slot *slot);
+
 static void hotplug_event(acpi_handle handle, u32 type, void *data)
 {
 	struct acpiphp_context *context = data;
@@ -866,7 +881,7 @@ static void hotplug_event(acpi_handle ha
 	case ACPI_NOTIFY_EJECT_REQUEST:
 		/* request device eject */
 		pr_debug("%s: Device eject notify on %s\n", __func__, objname);
-		acpiphp_disable_and_eject_slot(func->slot);
+		disable_and_eject_slot(func->slot);
 		break;
 	}
 
@@ -878,14 +893,19 @@ static void hotplug_event_work(void *dat
 {
 	struct acpiphp_context *context = data;
 	acpi_handle handle = context->handle;
+	struct acpiphp_bridge *bridge = context->func.parent;
 
 	acpi_scan_lock_acquire();
+	lock_pci_remove_rescan();
 
-	hotplug_event(handle, type, context);
+	/* Bail out if the parent bridge is going away. */
+	if (!bridge->is_going_away)
+		hotplug_event(handle, type, context);
 
+	unlock_pci_remove_rescan();
 	acpi_scan_lock_release();
 	acpi_evaluate_hotplug_ost(handle, type, ACPI_OST_SC_SUCCESS, NULL);
-	put_bridge(context->func.parent);
+	put_bridge(bridge);
 }
 
 /**
@@ -1050,20 +1070,27 @@ void acpiphp_remove_slots(struct pci_bus
  */
 int acpiphp_enable_slot(struct acpiphp_slot *slot)
 {
-	mutex_lock(&slot->crit_sect);
-	/* configure all functions */
-	if (!(slot->flags & SLOT_ENABLED))
-		enable_slot(slot);
+	struct acpiphp_func *func;
+	int ret = -ENODEV;
 
-	mutex_unlock(&slot->crit_sect);
-	return 0;
+	lock_pci_remove_rescan();
+
+	func = list_first_entry(&slot->funcs, struct acpiphp_func, sibling);
+	if (!func->parent->is_going_away) {
+		mutex_lock(&slot->crit_sect);
+		/* configure all functions */
+		if (!(slot->flags & SLOT_ENABLED))
+			enable_slot(slot);
+
+		mutex_unlock(&slot->crit_sect);
+		ret = 0;
+	}
+
+	unlock_pci_remove_rescan();
+	return ret;
 }
 
-/**
- * acpiphp_disable_and_eject_slot - power off and eject slot
- * @slot: ACPI PHP slot
- */
-int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
+static int disable_and_eject_slot(struct acpiphp_slot *slot)
 {
 	struct acpiphp_func *func;
 	int retval = 0;
@@ -1087,6 +1114,25 @@ int acpiphp_disable_and_eject_slot(struc
 	return retval;
 }
 
+/**
+ * acpiphp_disable_and_eject_slot - power off and eject slot.
+ * @slot: ACPIPHP slot.
+ */
+int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
+{
+	struct acpiphp_func *func;
+	int ret = -ENODEV;
+
+	lock_pci_remove_rescan();
+
+	func = list_first_entry(&slot->funcs, struct acpiphp_func, sibling);
+	if (!func->parent->is_going_away)
+		ret = disable_and_eject_slot(slot);
+
+	unlock_pci_remove_rescan();
+	return ret;
+}
+
 
 /*
  * slot enabled:  1
Index: linux-pm/drivers/acpi/pci_root.c
===================================================================
--- linux-pm.orig/drivers/acpi/pci_root.c
+++ linux-pm/drivers/acpi/pci_root.c
@@ -616,6 +616,8 @@ static void acpi_pci_root_remove(struct
 {
 	struct acpi_pci_root *root = acpi_driver_data(device);
 
+	lock_pci_remove_rescan();
+
 	pci_stop_root_bus(root->bus);
 
 	device_set_run_wake(root->bus->bridge, false);
@@ -623,6 +625,8 @@ static void acpi_pci_root_remove(struct
 
 	pci_remove_root_bus(root->bus);
 
+	unlock_pci_remove_rescan();
+
 	kfree(root);
 }
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-12-02 14:49                           ` Rafael J. Wysocki
@ 2013-12-05 22:40                             ` Bjorn Helgaas
  2013-12-06  1:21                               ` Rafael J. Wysocki
  2013-12-06  6:52                             ` [PATCH v2 04/10] PCI: Destroy pci dev only once Yinghai Lu
  1 sibling, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2013-12-05 22:40 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe

On Mon, Dec 2, 2013 at 7:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> ...
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: PCI / hotplug / ACPI: Fix concurrency problems related to device removal
>
> The following are concurrency problems related to the PCI device
> removal code in pci-sysfs.c and in ACPIPHP present in the current
> mainline kernel:

You've found a bunch of issues.  I don't think there's anything to
gain by fixing them all in a single patch, and I think it would be
useful to split them out to help us think about them and find other
places that have similar problems.

> Scenario 1: pci_stop_and_remove_bus_device() run concurrently for
>   the same top device from remove_callback() in pci-sysfs.c and
>   from trim_stale_devices() in acpiphp_glue.c.
>
>   In this scenario the second code path is executed without
>   pci_remove_rescan_mutex locked, so the &bus->devices list
>   walks in either trim_stale_devices() itself or in
>   acpiphp_check_bridge() can suffer from list corruption while the
>   first code path is executing pci_destroy_dev() for one of the
>   devices on those lists.

Protecting &bus->devices is a generic problem, isn't it?  There are
about a zillion uses of it.  Many are in the pcibios_fixup_bus() path.
 I think we can get rid of most of those by integrating the work into
the pci_scan_device() path instead of doing it as a post-discovery
fixup, but there will be several other cases left.  If using
pci_remove_rescan_mutex to protect &bus->devices is the right generic
answer, we should document that and audit every place that uses the
list.

>   Moreover, if any of the device objects in question is freed
>   after pci_destroy_dev() executed by the first code path, the
>   second code path may suffer a use-after-free problem while
>   trying to access that device object.
>
>   Conversely, the second code path may execute pci_destroy_dev()
>   for one of the devices in question such that one of the
>   &bus->devices list walks in pci_stop_bus_device()
>   or pci_remove_bus_device() executed by the first code path will
>   suffer from a list corruption.
>
>   Moreover, use-after-free is also possible if one of the device
>   objects in question is freed as a result of calling
>   pci_destroy_dev() by the second code path and then the first
>   code path tries to access it (the first code path only holds
>   an extra reference to the device it has been run for, but not
>   for its child devices).

The use-after-free problems *sound* like a reference counting issue.
Yinghai's patch [1] should fix some of this; how much is left after
that?

[1] http://lkml.kernel.org/r/1385851238-21085-4-git-send-email-yinghai@kernel.org

> Scenario 2: ACPI hotplug event occurs for a device under a bridge
>   being removed by pci_stop_and_remove_bus_device() run from
>   remove_callback() in pci-sysfs.c.
>
>   In that case it doesn't make sense to handle the hotplug event,
>   because the device in question will be removed anyway along with
>   its parent bridge and that will cause the context objects needed
>   for hotplug handling to be freed as well.
>
>   Moreover, if the event is handled regardless, it may cause one
>   or more devices already removed by pci_stop_and_remove_bus_device()
>   to be added again by the code handling the event, which will
>   conflict with the bridge removal.

We definitely need to serialize hotplug events from ACPI and sysfs
(and other sources, like other hotplug drivers).  Would that be
enough?  Adding the is_going_away flag is ACPI-specific and seems like
sort of a point workaround.

> Scenario 3: pci_stop_and_remove_bus_device() is run from
>   trim_stale_devices() (as a result of an ACPI hotplug event) in
>   parallel with dev_bus_rescan_store() or bus_rescan_store(),
>   or dev_rescan_store().
>
>   In that scenario the second code path may attempt to operate
>   on device objects being removed by the first code path which
>   may lead to many interesting types of breakage.
>
> Scenario 4: acpi_pci_root_remove() run (as a result of an ACPI PCI
>   host bridge removal event) in  parallel with bus_rescan_store(),
>   dev_bus_rescan_store(), dev_rescan_store(), or remove_callback()
>   for any devices under the host bridge in question.
>
>   In that case the same symptoms as in Scenarios 1 and 3 may occur
>   depending on which code path wins the races involved.

Scenarios 3 and 4 sound like more cases of hotplug operations needing
to be serialized, right?  If we serialized them sufficiently, would
there still be a problem?  Using pci_remove_rescan_mutex would
serialize *all* PCI hotplug operations, which is more than strictly
necessary, but maybe there's no reason to do anything finer-grained.

> Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
>   for a device and its parent bridge via remove_callback().
>
>   In that case both code paths attempt to acquire
>   pci_remove_rescan_mutex.  If the child device removal acquires
>   it first, there will be no problems.  However, if the parent
>   bridge removal acquires it first, it will eventually execute
>   pci_destroy_dev() for the child device, but that device will
>   not be freed yet due to the reference held by the concurrent
>   child removal.  Consequently, both pci_stop_bus_device() and
>   pci_remove_bus_device() will be executed for that device
>   unnecessarily and pci_destroy_dev() will see a corrupted list
>   head in that object.  Moreover, an excess put_device() will
>   be executed for that device in that case which may lead to a
>   use-after-free in the final kobject_put() done by
>   sysfs_schedule_callback_work().

The corrupted list head should be fixed by Yinghai's patch [1].

Where is the extra put_device()?  I see the
kobject_get()/kobject_put() pair in sysfs_schedule_callback() and
sysfs_schedule_callback_work().  Oh, I see -- the remove_store() ->
remove_callback() path acquires no references, but it calls
pci_stop_and_remove_bus_device(), which ultimately does the
put_device() in pci_destroy_dev().

So if both the parent and the child removal manage to get to
remove_callback() and the parent acquires pci_remove_rescan_mutex
first, the child removal will do the extra put_device().

There are only six callers of device_schedule_callback(), and I think
five of them are susceptible to this same problem: they are sysfs
store methods, and they use device_schedule_callback() with a callback
that does a put_device() on the device:

  drivers/pci/pci-sysfs.c: remove_store()
  drivers/scsi/scsi_sysfs.c: sdev_store_delete()
  arch/s390/pci/pci_sysfs.c: store_recover()
  drivers/s390/block/dcssblk.c: dcssblk_shared_store()
  drivers/s390/cio/ccwgroup.c: ccwgroup_ungroup_store()

I don't know what the right fix is, but adding "is_gone" to struct
pci_dev only addresses one of the five places, of course.

Bjorn

> Scenario 6: ACPIPHP slot enabling/disabling triggered by the
>   slot's "power" attribute in parallel with device removal run
>   from remove_callback().
>
>   This scenario may lead to race conditions analogous to the
>   ones described in Scenario 1.  It also may lead to situations
>   in which an already removed device under a bridge scheduled
>   for removal will be added which is analogous to Scenario 2.
>
> All of these scenarios are addressed by the patch below as follows.
>
> (1) To prevent the races in Scenarios 1 and 3 from happening hold
>     pci_remove_rescan_mutex around hotplug_event() in
>     hotplug_event_work(() (acpiphp_glue.c).
>
> (2) To prevent the races in Scenario 2 from happening, add an ACPIPHP
>     bridge flag is_going_away indicating that hotplug events should
>     be ignored for children below that bridge.  That flag is set
>     by cleanup_bridge() that for non-root bridges should be run
>     under pci_remove_rescan_mutex (for root bridges it is only
>     run under acpi_scan_lock anyway).
>
> (3) To prevent the races in Scenario 4 from happening, hold
>     pci_remove_rescan_mutex around pci_stop_root_bus() and
>     pci_remove_root_bus() in acpi_pci_root_remove().
>
> (4) To prevent the races in Scenario 5 from happening, add an new
>     is_gone flag to struct pci_dev that will be set by pci_destroy_dev()
>     and checked by pci_stop_and_remove_bus_device().  That only
>     covers cases in which pci_stop_and_remove_bus_device() is
>     run under pci_remove_rescan_mutex, but the other existing
>     cases need to be fixed to use that mutex anyway for other
>     reasons (analogous to Scenarios 1 and 3 above, for example).
>
> (5) To prevent the races in Scenario 6 from happening, add
>     the PCI remove/rescan locking to acpiphp_enable_slot() and
>     acpiphp_disable_and_eject_slot() and make these functions
>     check the slot's parent bridge status.
>
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
>  drivers/acpi/pci_root.c            |    4 ++
>  drivers/pci/hotplug/acpiphp.h      |    1
>  drivers/pci/hotplug/acpiphp_glue.c |   74 ++++++++++++++++++++++++++++++-------
>  drivers/pci/pci-sysfs.c            |   11 +++++
>  drivers/pci/remove.c               |    7 ++-
>  include/linux/pci.h                |    3 +
>  6 files changed, 84 insertions(+), 16 deletions(-)
>
> Index: linux-pm/include/linux/pci.h
> ===================================================================
> --- linux-pm.orig/include/linux/pci.h
> +++ linux-pm/include/linux/pci.h
> @@ -321,6 +321,7 @@ struct pci_dev {
>         unsigned int    multifunction:1;/* Part of multi-function device */
>         /* keep track of device state */
>         unsigned int    is_added:1;
> +       unsigned int    is_gone:1;
>         unsigned int    is_busmaster:1; /* device is busmaster */
>         unsigned int    no_msi:1;       /* device may not use msi */
>         unsigned int    block_cfg_access:1;     /* config space access is blocked */
> @@ -1022,6 +1023,8 @@ void set_pcie_hotplug_bridge(struct pci_
>  int pci_bus_find_capability(struct pci_bus *bus, unsigned int devfn, int cap);
>  unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge);
>  unsigned int pci_rescan_bus(struct pci_bus *bus);
> +void lock_pci_remove_rescan(void);
> +void unlock_pci_remove_rescan(void);
>
>  /* Vital product data routines */
>  ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
> Index: linux-pm/drivers/pci/pci-sysfs.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/pci-sysfs.c
> +++ linux-pm/drivers/pci/pci-sysfs.c
> @@ -298,6 +298,17 @@ msi_bus_store(struct device *dev, struct
>  static DEVICE_ATTR_RW(msi_bus);
>
>  static DEFINE_MUTEX(pci_remove_rescan_mutex);
> +
> +void lock_pci_remove_rescan(void)
> +{
> +       mutex_lock(&pci_remove_rescan_mutex);
> +}
> +
> +void unlock_pci_remove_rescan(void)
> +{
> +       mutex_unlock(&pci_remove_rescan_mutex);
> +}
> +
>  static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf,
>                                 size_t count)
>  {
> Index: linux-pm/drivers/pci/remove.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/remove.c
> +++ linux-pm/drivers/pci/remove.c
> @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
>
>  static void pci_destroy_dev(struct pci_dev *dev)
>  {
> +       dev->is_gone = 1;
>         device_del(&dev->dev);
>
>         down_write(&pci_bus_sem);
> @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
>   */
>  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
>  {
> -       pci_stop_bus_device(dev);
> -       pci_remove_bus_device(dev);
> +       if (!dev->is_gone) {
> +               pci_stop_bus_device(dev);
> +               pci_remove_bus_device(dev);
> +       }
>  }
>  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
>
> Index: linux-pm/drivers/pci/hotplug/acpiphp.h
> ===================================================================
> --- linux-pm.orig/drivers/pci/hotplug/acpiphp.h
> +++ linux-pm/drivers/pci/hotplug/acpiphp.h
> @@ -71,6 +71,7 @@ struct acpiphp_bridge {
>         struct acpiphp_context *context;
>
>         int nr_slots;
> +       bool is_going_away;
>
>         /* This bus (host bridge) or Secondary bus (PCI-to-PCI bridge) */
>         struct pci_bus *pci_bus;
> Index: linux-pm/drivers/pci/hotplug/acpiphp_glue.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/hotplug/acpiphp_glue.c
> +++ linux-pm/drivers/pci/hotplug/acpiphp_glue.c
> @@ -439,6 +439,13 @@ static void cleanup_bridge(struct acpiph
>         mutex_lock(&bridge_mutex);
>         list_del(&bridge->list);
>         mutex_unlock(&bridge_mutex);
> +
> +       /*
> +        * For non-root bridges this flag is protected by the PCI remove/rescan
> +        * locking.  For root bridges it is only operated under acpi_scan_lock
> +        * anyway.
> +        */
> +       bridge->is_going_away = true;
>  }
>
>  /**
> @@ -733,11 +740,17 @@ static void trim_stale_devices(struct pc
>   *
>   * Iterate over all slots under this bridge and make sure that if a
>   * card is present they are enabled, and if not they are disabled.
> + *
> + * For non-root bridges call under the PCI remove/rescan mutex.
>   */
>  static void acpiphp_check_bridge(struct acpiphp_bridge *bridge)
>  {
>         struct acpiphp_slot *slot;
>
> +       /* Bail out if the bridge is going away. */
> +       if (bridge->is_going_away)
> +               return;
> +
>         list_for_each_entry(slot, &bridge->slots, node) {
>                 struct pci_bus *bus = slot->bus;
>                 struct pci_dev *dev, *tmp;
> @@ -807,6 +820,8 @@ void acpiphp_check_host_bridge(acpi_hand
>         }
>  }
>
> +static int disable_and_eject_slot(struct acpiphp_slot *slot);
> +
>  static void hotplug_event(acpi_handle handle, u32 type, void *data)
>  {
>         struct acpiphp_context *context = data;
> @@ -866,7 +881,7 @@ static void hotplug_event(acpi_handle ha
>         case ACPI_NOTIFY_EJECT_REQUEST:
>                 /* request device eject */
>                 pr_debug("%s: Device eject notify on %s\n", __func__, objname);
> -               acpiphp_disable_and_eject_slot(func->slot);
> +               disable_and_eject_slot(func->slot);
>                 break;
>         }
>
> @@ -878,14 +893,19 @@ static void hotplug_event_work(void *dat
>  {
>         struct acpiphp_context *context = data;
>         acpi_handle handle = context->handle;
> +       struct acpiphp_bridge *bridge = context->func.parent;
>
>         acpi_scan_lock_acquire();
> +       lock_pci_remove_rescan();
>
> -       hotplug_event(handle, type, context);
> +       /* Bail out if the parent bridge is going away. */
> +       if (!bridge->is_going_away)
> +               hotplug_event(handle, type, context);
>
> +       unlock_pci_remove_rescan();
>         acpi_scan_lock_release();
>         acpi_evaluate_hotplug_ost(handle, type, ACPI_OST_SC_SUCCESS, NULL);
> -       put_bridge(context->func.parent);
> +       put_bridge(bridge);
>  }
>
>  /**
> @@ -1050,20 +1070,27 @@ void acpiphp_remove_slots(struct pci_bus
>   */
>  int acpiphp_enable_slot(struct acpiphp_slot *slot)
>  {
> -       mutex_lock(&slot->crit_sect);
> -       /* configure all functions */
> -       if (!(slot->flags & SLOT_ENABLED))
> -               enable_slot(slot);
> +       struct acpiphp_func *func;
> +       int ret = -ENODEV;
>
> -       mutex_unlock(&slot->crit_sect);
> -       return 0;
> +       lock_pci_remove_rescan();
> +
> +       func = list_first_entry(&slot->funcs, struct acpiphp_func, sibling);
> +       if (!func->parent->is_going_away) {
> +               mutex_lock(&slot->crit_sect);
> +               /* configure all functions */
> +               if (!(slot->flags & SLOT_ENABLED))
> +                       enable_slot(slot);
> +
> +               mutex_unlock(&slot->crit_sect);
> +               ret = 0;
> +       }
> +
> +       unlock_pci_remove_rescan();
> +       return ret;
>  }
>
> -/**
> - * acpiphp_disable_and_eject_slot - power off and eject slot
> - * @slot: ACPI PHP slot
> - */
> -int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
> +static int disable_and_eject_slot(struct acpiphp_slot *slot)
>  {
>         struct acpiphp_func *func;
>         int retval = 0;
> @@ -1087,6 +1114,25 @@ int acpiphp_disable_and_eject_slot(struc
>         return retval;
>  }
>
> +/**
> + * acpiphp_disable_and_eject_slot - power off and eject slot.
> + * @slot: ACPIPHP slot.
> + */
> +int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
> +{
> +       struct acpiphp_func *func;
> +       int ret = -ENODEV;
> +
> +       lock_pci_remove_rescan();
> +
> +       func = list_first_entry(&slot->funcs, struct acpiphp_func, sibling);
> +       if (!func->parent->is_going_away)
> +               ret = disable_and_eject_slot(slot);
> +
> +       unlock_pci_remove_rescan();
> +       return ret;
> +}
> +
>
>  /*
>   * slot enabled:  1
> Index: linux-pm/drivers/acpi/pci_root.c
> ===================================================================
> --- linux-pm.orig/drivers/acpi/pci_root.c
> +++ linux-pm/drivers/acpi/pci_root.c
> @@ -616,6 +616,8 @@ static void acpi_pci_root_remove(struct
>  {
>         struct acpi_pci_root *root = acpi_driver_data(device);
>
> +       lock_pci_remove_rescan();
> +
>         pci_stop_root_bus(root->bus);
>
>         device_set_run_wake(root->bus->bridge, false);
> @@ -623,6 +625,8 @@ static void acpi_pci_root_remove(struct
>
>         pci_remove_root_bus(root->bus);
>
> +       unlock_pci_remove_rescan();
> +
>         kfree(root);
>  }
>
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-12-05 22:40                             ` Bjorn Helgaas
@ 2013-12-06  1:21                               ` Rafael J. Wysocki
  2013-12-06  6:29                                 ` Yinghai Lu
  2014-01-10 14:20                                 ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
  0 siblings, 2 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2013-12-06  1:21 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe

On Thursday, December 05, 2013 03:40:39 PM Bjorn Helgaas wrote:
> On Mon, Dec 2, 2013 at 7:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > ...
> > From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> > Subject: PCI / hotplug / ACPI: Fix concurrency problems related to device removal
> >
> > The following are concurrency problems related to the PCI device
> > removal code in pci-sysfs.c and in ACPIPHP present in the current
> > mainline kernel:
> 
> You've found a bunch of issues.  I don't think there's anything to
> gain by fixing them all in a single patch, and I think it would be
> useful to split them out to help us think about them and find other
> places that have similar problems.

The problem is that they are all related.  Whatever we decide to do with
one of them will likely affect how we deal with the others.

> > Scenario 1: pci_stop_and_remove_bus_device() run concurrently for
> >   the same top device from remove_callback() in pci-sysfs.c and
> >   from trim_stale_devices() in acpiphp_glue.c.
> >
> >   In this scenario the second code path is executed without
> >   pci_remove_rescan_mutex locked, so the &bus->devices list
> >   walks in either trim_stale_devices() itself or in
> >   acpiphp_check_bridge() can suffer from list corruption while the
> >   first code path is executing pci_destroy_dev() for one of the
> >   devices on those lists.
> 
> Protecting &bus->devices is a generic problem, isn't it?

Yes, it is.

In fact, there are two sides of it.  One is with read-only users (that is,
walking the list without modifying it) which should be done under read-locked
pci_bus_sem and that's quite obvious.  It looks like the majority of users of
that list are readers and they fall under this case.  That case is addressed
by write-locking pci_bus_sem around the dev->bus_list deletion in pci_destroy_dev()
(it will wait for all readers to complete their walks as long as they use
pci_bus_sem correcty, but we may as well replace that with SRCU).

The second one is where someone is walking the list and *deleting* entires from
it in the process, which has to be synchronized with other writers and the
write-locking of pci_bus_sem in pci_destroy_dev() is not sufficient for that,
because pci_bus_sem can't be read-locked around a bus->devices walk deleting
entries from that list (exactly because pci_bus_sem is write-locked during
device deletion).  So this is a special case and to fix races between different
code paths that walk bus->devices *and* delete entries from there we need a
separate lock to be held around the entire list walk.  Incidentally,
pci_remove_rescan_mutex is already used in that context by the code in
pci-sysfs.c, so it can be used for this purpose by all of the other code
paths doing this thing, like for example acpiphp_check_bridge() and
trim_stale_devices() (there's more).

> There are about a zillion uses of it.  Many are in the pcibios_fixup_bus() path.
>  I think we can get rid of most of those by integrating the work into
> the pci_scan_device() path instead of doing it as a post-discovery
> fixup, but there will be several other cases left.  If using
> pci_remove_rescan_mutex to protect &bus->devices is the right generic
> answer, we should document that and audit every place that uses the
> list.

No, we don't have to do that as explained above.  We only need it around
code that walks the list in order to (possibly) delete entries from it
and there are not too many of such places, so they can be audited quite easily.
[Basically, wherever pci_stop_and_remove_bus_device() is called during a list
walk over bus->devices.]

And *if* we use pci_remove_rescan_mutex for that, it will *also* address some
other synchronization problems.

> >   Moreover, if any of the device objects in question is freed
> >   after pci_destroy_dev() executed by the first code path, the
> >   second code path may suffer a use-after-free problem while
> >   trying to access that device object.
> >
> >   Conversely, the second code path may execute pci_destroy_dev()
> >   for one of the devices in question such that one of the
> >   &bus->devices list walks in pci_stop_bus_device()
> >   or pci_remove_bus_device() executed by the first code path will
> >   suffer from a list corruption.
> >
> >   Moreover, use-after-free is also possible if one of the device
> >   objects in question is freed as a result of calling
> >   pci_destroy_dev() by the second code path and then the first
> >   code path tries to access it (the first code path only holds
> >   an extra reference to the device it has been run for, but not
> >   for its child devices).
> 
> The use-after-free problems *sound* like a reference counting issue.
> Yinghai's patch [1] should fix some of this; how much is left after
> that?
> 
> [1] http://lkml.kernel.org/r/1385851238-21085-4-git-send-email-yinghai@kernel.org

Hmm, no.  Do you mean https://patchwork.kernel.org/patch/3261171/ ?

This patch doesn't fix the problem either, however, because in only fixes
the one case described in Scenario 5 below.

Suppose that (1) trim_stale_devices() calls pci_stop_amd_remove_bus_devices(X)
and (2) remove_callback() calls pci_stop_and_remove_bus_devices(X) (i.e. for
the same device) at the same time.

Suppose that X is a leaf device for simplicity and say that code path (2)
is executed first.  We run pci_stop_dev(X) and pci_destroy_dev(X), which
does put_device(&X->dev) and then sysfs_schedule_callback_work() executes
kobject_put() for the X's kobject.  The device pointed to by X is gone at this
point.

Now, code path (1) runs and and crashes with a great bang trying to access
X->subordinate in pci_stop_bus_device().  This happens regardless of whether or
not the Yinghai's patch has been applied.  QED

Of course, you can argue that this is because code path (1) should have done
a pci_dev_get(X) before running pci_stop_amd_remove_bus_devices(X) which
would have avoided the crash.  I can agree with that, but *if* code path (1)
had held pci_remove_rescan_mutex around the whole trim_stale_devices(), then
it wouldn't have had to do that pci_dev_get(X), because the entire race
above wouldn't have been possible (code path (2) already holds that mutex
around pci_stop_and_remove_bus_devices(X)).

OK, you can ask, but why the heck does code path (1) need to hold
pci_remove_rescan_mutex around trim_stale_devices()?  Well, because it needs
to synchronize the list walks in acpiphp_check_bridge() and trim_stale_devices()
with the list_del() in pci_destroy_dev() executed by code path (2).

That's why I'm proposing to acquire pci_remove_rescan_mutex in
hotplug_event_work() and hold it around the whole hotplug_event().

Of course, analogous races are possible for acpiphp_enable_slot() and
acpiphp_disable_and_eject_slot() and they may be addressed analogously -
by holding pci_remove_rescan_mutex around possible modifications of the PCI
device hierarchy.

> > Scenario 2: ACPI hotplug event occurs for a device under a bridge
> >   being removed by pci_stop_and_remove_bus_device() run from
> >   remove_callback() in pci-sysfs.c.
> >
> >   In that case it doesn't make sense to handle the hotplug event,
> >   because the device in question will be removed anyway along with
> >   its parent bridge and that will cause the context objects needed
> >   for hotplug handling to be freed as well.
> >
> >   Moreover, if the event is handled regardless, it may cause one
> >   or more devices already removed by pci_stop_and_remove_bus_device()
> >   to be added again by the code handling the event, which will
> >   conflict with the bridge removal.
> 
> We definitely need to serialize hotplug events from ACPI and sysfs
> (and other sources, like other hotplug drivers).  Would that be
> enough?  Adding the is_going_away flag is ACPI-specific and seems like
> sort of a point workaround.

Well, we basically need to prevent hotplug_event() from being run in that
case, and since that function is ACPI-specific and the data structures
hotplug_event_work() operates on are ACPI-specific, I'm using an ACPI-specific
flag to achieve that goal.  If you can suggest any approach that would not
be ACPI-specific to address this particular case, I'll be happy to use it. :-)

> > Scenario 3: pci_stop_and_remove_bus_device() is run from
> >   trim_stale_devices() (as a result of an ACPI hotplug event) in
> >   parallel with dev_bus_rescan_store() or bus_rescan_store(),
> >   or dev_rescan_store().
> >
> >   In that scenario the second code path may attempt to operate
> >   on device objects being removed by the first code path which
> >   may lead to many interesting types of breakage.
> >
> > Scenario 4: acpi_pci_root_remove() run (as a result of an ACPI PCI
> >   host bridge removal event) in  parallel with bus_rescan_store(),
> >   dev_bus_rescan_store(), dev_rescan_store(), or remove_callback()
> >   for any devices under the host bridge in question.
> >
> >   In that case the same symptoms as in Scenarios 1 and 3 may occur
> >   depending on which code path wins the races involved.
> 
> Scenarios 3 and 4 sound like more cases of hotplug operations needing
> to be serialized, right?  If we serialized them sufficiently, would
> there still be a problem?

That depends on what exactly you mean by "sufficiently".  That is, what's
the goal?  Different approaches may be sufficient to achieve specific goals,
but not necessarily more than one of them.

> Using pci_remove_rescan_mutex would
> serialize *all* PCI hotplug operations, which is more than strictly
> necessary, but maybe there's no reason to do anything finer-grained.

My answer to this is yes, using pci_remove_rescan_mutex *will* serialize all
PCI hotplug operations, which will be a very good first step.  Having done that,
we can see how much we can relax things and how far we *want* to go with that.

> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
> >   for a device and its parent bridge via remove_callback().
> >
> >   In that case both code paths attempt to acquire
> >   pci_remove_rescan_mutex.  If the child device removal acquires
> >   it first, there will be no problems.  However, if the parent
> >   bridge removal acquires it first, it will eventually execute
> >   pci_destroy_dev() for the child device, but that device will
> >   not be freed yet due to the reference held by the concurrent
> >   child removal.  Consequently, both pci_stop_bus_device() and
> >   pci_remove_bus_device() will be executed for that device
> >   unnecessarily and pci_destroy_dev() will see a corrupted list
> >   head in that object.  Moreover, an excess put_device() will
> >   be executed for that device in that case which may lead to a
> >   use-after-free in the final kobject_put() done by
> >   sysfs_schedule_callback_work().
> 
> The corrupted list head should be fixed by Yinghai's patch [1].

I think you really mean https://patchwork.kernel.org/patch/3261171/ :-)

Generally, patches in that series would mitigate that problem somewhat.

I actually stole the Yinghai's idea with the new PCI device flag (sorry,
Yinghai), but I think it's better to check that flag to start with in
pci_stop_and_remove_bus_device(), because then we avoid calling
pci_stop_bus_device() unnecessarily as well (surely, the device was stopped
before it has gone, wasn't it?).

> Where is the extra put_device()?  I see the
> kobject_get()/kobject_put() pair in sysfs_schedule_callback() and
> sysfs_schedule_callback_work().  Oh, I see -- the remove_store() ->
> remove_callback() path acquires no references, but it calls
> pci_stop_and_remove_bus_device(), which ultimately does the
> put_device() in pci_destroy_dev().
> 
> So if both the parent and the child removal manage to get to
> remove_callback() and the parent acquires pci_remove_rescan_mutex
> first, the child removal will do the extra put_device().
> 
> There are only six callers of device_schedule_callback(), and I think
> five of them are susceptible to this same problem: they are sysfs
> store methods, and they use device_schedule_callback() with a callback
> that does a put_device() on the device:
> 
>   drivers/pci/pci-sysfs.c: remove_store()
>   drivers/scsi/scsi_sysfs.c: sdev_store_delete()
>   arch/s390/pci/pci_sysfs.c: store_recover()
>   drivers/s390/block/dcssblk.c: dcssblk_shared_store()
>   drivers/s390/cio/ccwgroup.c: ccwgroup_ungroup_store()
> 
> I don't know what the right fix is, but adding "is_gone" to struct
> pci_dev only addresses one of the five places, of course.

That's correct.

I'm not sure if there *is* a generic solution to this problem, because
sysfs_schedule_callback_work() doesn't know what the callback function is
going to do.  In principle it can look at ss->kobj->parent and skip the
execution of ss->func if that is NULL (which means that the kobject has
been deleted and it is likely the last thing holding a reference to that
kobject), but that still will be racy without extra synchronization in the
users of device_schedule_callback().

Which probably means that device_schedule_callback() is ill concieved in the
first place, because it can't really do the work it is designed for in general.
So perhaps it's better to use something simpler and PCI-specific for PCI and
analogously in the other cases you mentioned.

OK

To be a bit more constructive, as the next step I'd try to use
pci_remove_rescan_mutex to serialize all PCI hotplug operations (as I said
above) without making the other changes made by my patch.  Does that sound
reasonable?

Rafael


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-12-06  1:21                               ` Rafael J. Wysocki
@ 2013-12-06  6:29                                 ` Yinghai Lu
  2014-01-10 14:20                                 ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
  1 sibling, 0 replies; 69+ messages in thread
From: Yinghai Lu @ 2013-12-06  6:29 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe

On Thu, Dec 5, 2013 at 5:21 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>>
>> The use-after-free problems *sound* like a reference counting issue.
>> Yinghai's patch [1] should fix some of this; how much is left after
>> that?
>>
>> [1] http://lkml.kernel.org/r/1385851238-21085-4-git-send-email-yinghai@kernel.org
>
> Hmm, no.  Do you mean https://patchwork.kernel.org/patch/3261171/ ?

should be

https://patchwork.kernel.org/patch/3261001/
[v3,03/12] PCI: Move resources and bus_list releasing to pci_release_dev

move down list_del(&pci_dev->bus_list).

Yinghai

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-12-02 14:49                           ` Rafael J. Wysocki
  2013-12-05 22:40                             ` Bjorn Helgaas
@ 2013-12-06  6:52                             ` Yinghai Lu
  2013-12-07  1:27                               ` Rafael J. Wysocki
  1 sibling, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-12-06  6:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe

On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>
> Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
>   for a device and its parent bridge via remove_callback().
>
>   In that case both code paths attempt to acquire
>   pci_remove_rescan_mutex.  If the child device removal acquires
>   it first, there will be no problems.  However, if the parent
>   bridge removal acquires it first, it will eventually execute
>   pci_destroy_dev() for the child device, but that device will
>   not be freed yet due to the reference held by the concurrent
>   child removal.  Consequently, both pci_stop_bus_device() and
>   pci_remove_bus_device() will be executed for that device
>   unnecessarily and pci_destroy_dev() will see a corrupted list
>   head in that object.  Moreover, an excess put_device() will
>   be executed for that device in that case which may lead to a
>   use-after-free in the final kobject_put() done by
>   sysfs_schedule_callback_work().
>
> Index: linux-pm/include/linux/pci.h
> ===================================================================
> --- linux-pm.orig/include/linux/pci.h
> +++ linux-pm/include/linux/pci.h
> @@ -321,6 +321,7 @@ struct pci_dev {
>         unsigned int    multifunction:1;/* Part of multi-function device */
>         /* keep track of device state */
>         unsigned int    is_added:1;
> +       unsigned int    is_gone:1;
>         unsigned int    is_busmaster:1; /* device is busmaster */
>         unsigned int    no_msi:1;       /* device may not use msi */
>         unsigned int    block_cfg_access:1;     /* config space access is blocked */
> Index: linux-pm/drivers/pci/remove.c
> ===================================================================
> --- linux-pm.orig/drivers/pci/remove.c
> +++ linux-pm/drivers/pci/remove.c
> @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
>
>  static void pci_destroy_dev(struct pci_dev *dev)
>  {
> +       dev->is_gone = 1;
>         device_del(&dev->dev);
>
>         down_write(&pci_bus_sem);
> @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
>   */
>  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
>  {
> -       pci_stop_bus_device(dev);
> -       pci_remove_bus_device(dev);
> +       if (!dev->is_gone) {
> +               pci_stop_bus_device(dev);
> +               pci_remove_bus_device(dev);
> +       }
>  }
>  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
>

Yes, above change should address sys double remove problem.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-12-06  6:52                             ` [PATCH v2 04/10] PCI: Destroy pci dev only once Yinghai Lu
@ 2013-12-07  1:27                               ` Rafael J. Wysocki
  2013-12-08  3:31                                 ` Yinghai Lu
  2014-01-13  1:03                                 ` [PATCH] PCI / remove: Check parent kobject in pci_destroy_dev() (was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
  0 siblings, 2 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2013-12-07  1:27 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe

On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote:
> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >
> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
> >   for a device and its parent bridge via remove_callback().
> >
> >   In that case both code paths attempt to acquire
> >   pci_remove_rescan_mutex.  If the child device removal acquires
> >   it first, there will be no problems.  However, if the parent
> >   bridge removal acquires it first, it will eventually execute
> >   pci_destroy_dev() for the child device, but that device will
> >   not be freed yet due to the reference held by the concurrent
> >   child removal.  Consequently, both pci_stop_bus_device() and
> >   pci_remove_bus_device() will be executed for that device
> >   unnecessarily and pci_destroy_dev() will see a corrupted list
> >   head in that object.  Moreover, an excess put_device() will
> >   be executed for that device in that case which may lead to a
> >   use-after-free in the final kobject_put() done by
> >   sysfs_schedule_callback_work().
> >
> > Index: linux-pm/include/linux/pci.h
> > ===================================================================
> > --- linux-pm.orig/include/linux/pci.h
> > +++ linux-pm/include/linux/pci.h
> > @@ -321,6 +321,7 @@ struct pci_dev {
> >         unsigned int    multifunction:1;/* Part of multi-function device */
> >         /* keep track of device state */
> >         unsigned int    is_added:1;
> > +       unsigned int    is_gone:1;
> >         unsigned int    is_busmaster:1; /* device is busmaster */
> >         unsigned int    no_msi:1;       /* device may not use msi */
> >         unsigned int    block_cfg_access:1;     /* config space access is blocked */
> > Index: linux-pm/drivers/pci/remove.c
> > ===================================================================
> > --- linux-pm.orig/drivers/pci/remove.c
> > +++ linux-pm/drivers/pci/remove.c
> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
> >
> >  static void pci_destroy_dev(struct pci_dev *dev)
> >  {
> > +       dev->is_gone = 1;
> >         device_del(&dev->dev);
> >
> >         down_write(&pci_bus_sem);
> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
> >   */
> >  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
> >  {
> > -       pci_stop_bus_device(dev);
> > -       pci_remove_bus_device(dev);
> > +       if (!dev->is_gone) {
> > +               pci_stop_bus_device(dev);
> > +               pci_remove_bus_device(dev);
> > +       }
> >  }
> >  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
> >
> 
> Yes, above change should address sys double remove problem.

I've just realized that we don't need a new flag for that, though.

It looks like we only need to check dev->dev.kobj.parent and return if that is
NULL, because that means pci_destroy_dev() has run for that device already
(I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like
it should do that?).

Of course, that still is going to be racy if we don't hold
pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code
path using it (or use another similar synchronization mechanism).

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-12-07  1:27                               ` Rafael J. Wysocki
@ 2013-12-08  3:31                                 ` Yinghai Lu
  2013-12-08  3:50                                   ` Greg Kroah-Hartman
  2014-01-13  1:03                                 ` [PATCH] PCI / remove: Check parent kobject in pci_destroy_dev() (was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
  1 sibling, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-12-08  3:31 UTC (permalink / raw)
  To: Rafael J. Wysocki, Greg Kroah-Hartman
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe

[+ GregKH]

On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote:
>> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> >
>> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
>> >   for a device and its parent bridge via remove_callback().
>> >
>> >   In that case both code paths attempt to acquire
>> >   pci_remove_rescan_mutex.  If the child device removal acquires
>> >   it first, there will be no problems.  However, if the parent
>> >   bridge removal acquires it first, it will eventually execute
>> >   pci_destroy_dev() for the child device, but that device will
>> >   not be freed yet due to the reference held by the concurrent
>> >   child removal.  Consequently, both pci_stop_bus_device() and
>> >   pci_remove_bus_device() will be executed for that device
>> >   unnecessarily and pci_destroy_dev() will see a corrupted list
>> >   head in that object.  Moreover, an excess put_device() will
>> >   be executed for that device in that case which may lead to a
>> >   use-after-free in the final kobject_put() done by
>> >   sysfs_schedule_callback_work().
>> >
>> > Index: linux-pm/include/linux/pci.h
>> > ===================================================================
>> > --- linux-pm.orig/include/linux/pci.h
>> > +++ linux-pm/include/linux/pci.h
>> > @@ -321,6 +321,7 @@ struct pci_dev {
>> >         unsigned int    multifunction:1;/* Part of multi-function device */
>> >         /* keep track of device state */
>> >         unsigned int    is_added:1;
>> > +       unsigned int    is_gone:1;
>> >         unsigned int    is_busmaster:1; /* device is busmaster */
>> >         unsigned int    no_msi:1;       /* device may not use msi */
>> >         unsigned int    block_cfg_access:1;     /* config space access is blocked */
>> > Index: linux-pm/drivers/pci/remove.c
>> > ===================================================================
>> > --- linux-pm.orig/drivers/pci/remove.c
>> > +++ linux-pm/drivers/pci/remove.c
>> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
>> >
>> >  static void pci_destroy_dev(struct pci_dev *dev)
>> >  {
>> > +       dev->is_gone = 1;
>> >         device_del(&dev->dev);
>> >
>> >         down_write(&pci_bus_sem);
>> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
>> >   */
>> >  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
>> >  {
>> > -       pci_stop_bus_device(dev);
>> > -       pci_remove_bus_device(dev);
>> > +       if (!dev->is_gone) {
>> > +               pci_stop_bus_device(dev);
>> > +               pci_remove_bus_device(dev);
>> > +       }
>> >  }
>> >  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
>> >
>>
>> Yes, above change should address sys double remove problem.
>
> I've just realized that we don't need a new flag for that, though.
>
> It looks like we only need to check dev->dev.kobj.parent and return if that is
> NULL, because that means pci_destroy_dev() has run for that device already
> (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like
> it should do that?).
>
> Of course, that still is going to be racy if we don't hold
> pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code
> path using it (or use another similar synchronization mechanism).

Wonder if we can have safe way to check if device_del() is called already.

And those access_after_free should be addressed by driver core instead
of pci code?

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-12-08  3:31                                 ` Yinghai Lu
@ 2013-12-08  3:50                                   ` Greg Kroah-Hartman
  2013-12-09 15:24                                     ` Ethan Zhao
  0 siblings, 1 reply; 69+ messages in thread
From: Greg Kroah-Hartman @ 2013-12-08  3:50 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Rafael J. Wysocki, Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng,
	Guo Chao, linux-pci, linux-kernel, Mika Westerberg, Myron Stowe

On Sat, Dec 07, 2013 at 07:31:21PM -0800, Yinghai Lu wrote:
> [+ GregKH]
> 
> On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote:
> >> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >> >
> >> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
> >> >   for a device and its parent bridge via remove_callback().
> >> >
> >> >   In that case both code paths attempt to acquire
> >> >   pci_remove_rescan_mutex.  If the child device removal acquires
> >> >   it first, there will be no problems.  However, if the parent
> >> >   bridge removal acquires it first, it will eventually execute
> >> >   pci_destroy_dev() for the child device, but that device will
> >> >   not be freed yet due to the reference held by the concurrent
> >> >   child removal.  Consequently, both pci_stop_bus_device() and
> >> >   pci_remove_bus_device() will be executed for that device
> >> >   unnecessarily and pci_destroy_dev() will see a corrupted list
> >> >   head in that object.  Moreover, an excess put_device() will
> >> >   be executed for that device in that case which may lead to a
> >> >   use-after-free in the final kobject_put() done by
> >> >   sysfs_schedule_callback_work().
> >> >
> >> > Index: linux-pm/include/linux/pci.h
> >> > ===================================================================
> >> > --- linux-pm.orig/include/linux/pci.h
> >> > +++ linux-pm/include/linux/pci.h
> >> > @@ -321,6 +321,7 @@ struct pci_dev {
> >> >         unsigned int    multifunction:1;/* Part of multi-function device */
> >> >         /* keep track of device state */
> >> >         unsigned int    is_added:1;
> >> > +       unsigned int    is_gone:1;
> >> >         unsigned int    is_busmaster:1; /* device is busmaster */
> >> >         unsigned int    no_msi:1;       /* device may not use msi */
> >> >         unsigned int    block_cfg_access:1;     /* config space access is blocked */
> >> > Index: linux-pm/drivers/pci/remove.c
> >> > ===================================================================
> >> > --- linux-pm.orig/drivers/pci/remove.c
> >> > +++ linux-pm/drivers/pci/remove.c
> >> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
> >> >
> >> >  static void pci_destroy_dev(struct pci_dev *dev)
> >> >  {
> >> > +       dev->is_gone = 1;
> >> >         device_del(&dev->dev);
> >> >
> >> >         down_write(&pci_bus_sem);
> >> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
> >> >   */
> >> >  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
> >> >  {
> >> > -       pci_stop_bus_device(dev);
> >> > -       pci_remove_bus_device(dev);
> >> > +       if (!dev->is_gone) {
> >> > +               pci_stop_bus_device(dev);
> >> > +               pci_remove_bus_device(dev);
> >> > +       }
> >> >  }
> >> >  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
> >> >
> >>
> >> Yes, above change should address sys double remove problem.
> >
> > I've just realized that we don't need a new flag for that, though.
> >
> > It looks like we only need to check dev->dev.kobj.parent and return if that is
> > NULL, because that means pci_destroy_dev() has run for that device already
> > (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like
> > it should do that?).
> >
> > Of course, that still is going to be racy if we don't hold
> > pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code
> > path using it (or use another similar synchronization mechanism).
> 
> Wonder if we can have safe way to check if device_del() is called already.

Nope.

> And those access_after_free should be addressed by driver core instead
> of pci code?

Nope, it's up to the bus to handle this.  It shouldn't be hard, you
shouldn't actually care about this, if you do, something is wrong.

How is this PCI code so hard to get right?  Look at USB for devices that
disappear from anywhere at anytime as an example for how to handle
this.  PCI should be doing the same thing, no need for this "is_gone"
stuff.

greg k-h

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-12-08  3:50                                   ` Greg Kroah-Hartman
@ 2013-12-09 15:24                                     ` Ethan Zhao
  2013-12-09 19:08                                       ` Greg Kroah-Hartman
  0 siblings, 1 reply; 69+ messages in thread
From: Ethan Zhao @ 2013-12-09 15:24 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Yinghai Lu, Rafael J. Wysocki, Bjorn Helgaas, Rafael J. Wysocki,
	Gu Zheng, Guo Chao, linux-pci, linux-kernel, Mika Westerberg,
	Myron Stowe

On Sun, Dec 8, 2013 at 11:50 AM, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Sat, Dec 07, 2013 at 07:31:21PM -0800, Yinghai Lu wrote:
>> [+ GregKH]
>>
>> On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> > On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote:
>> >> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> >> >
>> >> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
>> >> >   for a device and its parent bridge via remove_callback().
>> >> >
>> >> >   In that case both code paths attempt to acquire
>> >> >   pci_remove_rescan_mutex.  If the child device removal acquires
>> >> >   it first, there will be no problems.  However, if the parent
>> >> >   bridge removal acquires it first, it will eventually execute
>> >> >   pci_destroy_dev() for the child device, but that device will
>> >> >   not be freed yet due to the reference held by the concurrent
>> >> >   child removal.  Consequently, both pci_stop_bus_device() and
>> >> >   pci_remove_bus_device() will be executed for that device
>> >> >   unnecessarily and pci_destroy_dev() will see a corrupted list
>> >> >   head in that object.  Moreover, an excess put_device() will
>> >> >   be executed for that device in that case which may lead to a
>> >> >   use-after-free in the final kobject_put() done by
>> >> >   sysfs_schedule_callback_work().
>> >> >
>> >> > Index: linux-pm/include/linux/pci.h
>> >> > ===================================================================
>> >> > --- linux-pm.orig/include/linux/pci.h
>> >> > +++ linux-pm/include/linux/pci.h
>> >> > @@ -321,6 +321,7 @@ struct pci_dev {
>> >> >         unsigned int    multifunction:1;/* Part of multi-function device */
>> >> >         /* keep track of device state */
>> >> >         unsigned int    is_added:1;
>> >> > +       unsigned int    is_gone:1;
>> >> >         unsigned int    is_busmaster:1; /* device is busmaster */
>> >> >         unsigned int    no_msi:1;       /* device may not use msi */
>> >> >         unsigned int    block_cfg_access:1;     /* config space access is blocked */
>> >> > Index: linux-pm/drivers/pci/remove.c
>> >> > ===================================================================
>> >> > --- linux-pm.orig/drivers/pci/remove.c
>> >> > +++ linux-pm/drivers/pci/remove.c
>> >> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
>> >> >
>> >> >  static void pci_destroy_dev(struct pci_dev *dev)
>> >> >  {
>> >> > +       dev->is_gone = 1;
>> >> >         device_del(&dev->dev);
>> >> >
>> >> >         down_write(&pci_bus_sem);
>> >> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
>> >> >   */
>> >> >  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
>> >> >  {
>> >> > -       pci_stop_bus_device(dev);
>> >> > -       pci_remove_bus_device(dev);
>> >> > +       if (!dev->is_gone) {
>> >> > +               pci_stop_bus_device(dev);
>> >> > +               pci_remove_bus_device(dev);
>> >> > +       }
>> >> >  }
>> >> >  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
>> >> >
>> >>
>> >> Yes, above change should address sys double remove problem.
>> >
>> > I've just realized that we don't need a new flag for that, though.
>> >
>> > It looks like we only need to check dev->dev.kobj.parent and return if that is
>> > NULL, because that means pci_destroy_dev() has run for that device already
>> > (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like
>> > it should do that?).
>> >
>> > Of course, that still is going to be racy if we don't hold
>> > pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code
>> > path using it (or use another similar synchronization mechanism).
>>
>> Wonder if we can have safe way to check if device_del() is called already.
>
> Nope.
>
>> And those access_after_free should be addressed by driver core instead
>> of pci code?
>
> Nope, it's up to the bus to handle this.  It shouldn't be hard, you
> shouldn't actually care about this, if you do, something is wrong.
>
> How is this PCI code so hard to get right?  Look at USB for devices that
> disappear from anywhere at anytime as an example for how to handle
> this.  PCI should be doing the same thing, no need for this "is_gone"
> stuff.
Greg,

  Don't agree USB is a good example to follow, do you never hit panic
when you pull out USB device from anywhere at anytime without unmount
or stop it via command ? that is not truth.  the truth is none regards
it as enterprise level interface to attach devices.
  Is there a feature for an USB disk to tell the host you want to pull
out it and should sync all the data in cache and unmount the files
system then power it off ?
  What USB could drive for us ? 40GB nic ? infiniband ? High end graphic card ?

Thanks,
Ethan
>
> greg k-h
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-12-09 15:24                                     ` Ethan Zhao
@ 2013-12-09 19:08                                       ` Greg Kroah-Hartman
  2013-12-10  7:43                                         ` Ethan Zhao
  0 siblings, 1 reply; 69+ messages in thread
From: Greg Kroah-Hartman @ 2013-12-09 19:08 UTC (permalink / raw)
  To: Ethan Zhao
  Cc: Yinghai Lu, Rafael J. Wysocki, Bjorn Helgaas, Rafael J. Wysocki,
	Gu Zheng, Guo Chao, linux-pci, linux-kernel, Mika Westerberg,
	Myron Stowe

On Mon, Dec 09, 2013 at 11:24:04PM +0800, Ethan Zhao wrote:
> On Sun, Dec 8, 2013 at 11:50 AM, Greg Kroah-Hartman
> <gregkh@linuxfoundation.org> wrote:
> > On Sat, Dec 07, 2013 at 07:31:21PM -0800, Yinghai Lu wrote:
> >> [+ GregKH]
> >>
> >> On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >> > On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote:
> >> >> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> >> >> >
> >> >> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
> >> >> >   for a device and its parent bridge via remove_callback().
> >> >> >
> >> >> >   In that case both code paths attempt to acquire
> >> >> >   pci_remove_rescan_mutex.  If the child device removal acquires
> >> >> >   it first, there will be no problems.  However, if the parent
> >> >> >   bridge removal acquires it first, it will eventually execute
> >> >> >   pci_destroy_dev() for the child device, but that device will
> >> >> >   not be freed yet due to the reference held by the concurrent
> >> >> >   child removal.  Consequently, both pci_stop_bus_device() and
> >> >> >   pci_remove_bus_device() will be executed for that device
> >> >> >   unnecessarily and pci_destroy_dev() will see a corrupted list
> >> >> >   head in that object.  Moreover, an excess put_device() will
> >> >> >   be executed for that device in that case which may lead to a
> >> >> >   use-after-free in the final kobject_put() done by
> >> >> >   sysfs_schedule_callback_work().
> >> >> >
> >> >> > Index: linux-pm/include/linux/pci.h
> >> >> > ===================================================================
> >> >> > --- linux-pm.orig/include/linux/pci.h
> >> >> > +++ linux-pm/include/linux/pci.h
> >> >> > @@ -321,6 +321,7 @@ struct pci_dev {
> >> >> >         unsigned int    multifunction:1;/* Part of multi-function device */
> >> >> >         /* keep track of device state */
> >> >> >         unsigned int    is_added:1;
> >> >> > +       unsigned int    is_gone:1;
> >> >> >         unsigned int    is_busmaster:1; /* device is busmaster */
> >> >> >         unsigned int    no_msi:1;       /* device may not use msi */
> >> >> >         unsigned int    block_cfg_access:1;     /* config space access is blocked */
> >> >> > Index: linux-pm/drivers/pci/remove.c
> >> >> > ===================================================================
> >> >> > --- linux-pm.orig/drivers/pci/remove.c
> >> >> > +++ linux-pm/drivers/pci/remove.c
> >> >> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
> >> >> >
> >> >> >  static void pci_destroy_dev(struct pci_dev *dev)
> >> >> >  {
> >> >> > +       dev->is_gone = 1;
> >> >> >         device_del(&dev->dev);
> >> >> >
> >> >> >         down_write(&pci_bus_sem);
> >> >> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
> >> >> >   */
> >> >> >  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
> >> >> >  {
> >> >> > -       pci_stop_bus_device(dev);
> >> >> > -       pci_remove_bus_device(dev);
> >> >> > +       if (!dev->is_gone) {
> >> >> > +               pci_stop_bus_device(dev);
> >> >> > +               pci_remove_bus_device(dev);
> >> >> > +       }
> >> >> >  }
> >> >> >  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
> >> >> >
> >> >>
> >> >> Yes, above change should address sys double remove problem.
> >> >
> >> > I've just realized that we don't need a new flag for that, though.
> >> >
> >> > It looks like we only need to check dev->dev.kobj.parent and return if that is
> >> > NULL, because that means pci_destroy_dev() has run for that device already
> >> > (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like
> >> > it should do that?).
> >> >
> >> > Of course, that still is going to be racy if we don't hold
> >> > pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code
> >> > path using it (or use another similar synchronization mechanism).
> >>
> >> Wonder if we can have safe way to check if device_del() is called already.
> >
> > Nope.
> >
> >> And those access_after_free should be addressed by driver core instead
> >> of pci code?
> >
> > Nope, it's up to the bus to handle this.  It shouldn't be hard, you
> > shouldn't actually care about this, if you do, something is wrong.
> >
> > How is this PCI code so hard to get right?  Look at USB for devices that
> > disappear from anywhere at anytime as an example for how to handle
> > this.  PCI should be doing the same thing, no need for this "is_gone"
> > stuff.
> Greg,
> 
>   Don't agree USB is a good example to follow, do you never hit panic
> when you pull out USB device from anywhere at anytime without unmount
> or stop it via command ?

You shouldn't.  If you do, it's a bug, let us know and we will fix it.

> that is not truth.  the truth is none regards it as enterprise level
> interface to attach devices.

Huh?

>   Is there a feature for an USB disk to tell the host you want to pull
> out it and should sync all the data in cache and unmount the files
> system then power it off ?

Nope, neither is there one for when I yank out my PCI storage device
without telling the OS about it either.  Everything better "just work",
with the exception of any lost data that might be in flight.

>   What USB could drive for us ? 40GB nic ? infiniband ? High end graphic card ?

I don't understand what that means at all.

We have USB network ethernet devices, I have a USB 3.0 one here that
works really well.  Infiniband is merely a transport, with some "verbs"
on top of it, that has nothing to do with PCI other than you can have a
IB PCI controller in the system.  And I have a USB graphics adapter here
that works just fine as well (people chain lots of them on one system.)

So how does this apply to PCI at all?  It's the same thing, you have to
be able to handle a PCI device going away at any point in time, with or
without telling the OS ahead of time that you are going to remove it.

greg k-h

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 04/10] PCI: Destroy pci dev only once
  2013-12-09 19:08                                       ` Greg Kroah-Hartman
@ 2013-12-10  7:43                                         ` Ethan Zhao
  0 siblings, 0 replies; 69+ messages in thread
From: Ethan Zhao @ 2013-12-10  7:43 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Yinghai Lu, Rafael J. Wysocki, Bjorn Helgaas, Rafael J. Wysocki,
	Gu Zheng, Guo Chao, linux-pci, linux-kernel, Mika Westerberg,
	Myron Stowe

On Tue, Dec 10, 2013 at 3:08 AM, Greg Kroah-Hartman
<gregkh@linuxfoundation.org> wrote:
> On Mon, Dec 09, 2013 at 11:24:04PM +0800, Ethan Zhao wrote:
>> On Sun, Dec 8, 2013 at 11:50 AM, Greg Kroah-Hartman
>> <gregkh@linuxfoundation.org> wrote:
>> > On Sat, Dec 07, 2013 at 07:31:21PM -0800, Yinghai Lu wrote:
>> >> [+ GregKH]
>> >>
>> >> On Fri, Dec 6, 2013 at 5:27 PM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> >> > On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote:
>> >> >> On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
>> >> >> >
>> >> >> > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
>> >> >> >   for a device and its parent bridge via remove_callback().
>> >> >> >
>> >> >> >   In that case both code paths attempt to acquire
>> >> >> >   pci_remove_rescan_mutex.  If the child device removal acquires
>> >> >> >   it first, there will be no problems.  However, if the parent
>> >> >> >   bridge removal acquires it first, it will eventually execute
>> >> >> >   pci_destroy_dev() for the child device, but that device will
>> >> >> >   not be freed yet due to the reference held by the concurrent
>> >> >> >   child removal.  Consequently, both pci_stop_bus_device() and
>> >> >> >   pci_remove_bus_device() will be executed for that device
>> >> >> >   unnecessarily and pci_destroy_dev() will see a corrupted list
>> >> >> >   head in that object.  Moreover, an excess put_device() will
>> >> >> >   be executed for that device in that case which may lead to a
>> >> >> >   use-after-free in the final kobject_put() done by
>> >> >> >   sysfs_schedule_callback_work().
>> >> >> >
>> >> >> > Index: linux-pm/include/linux/pci.h
>> >> >> > ===================================================================
>> >> >> > --- linux-pm.orig/include/linux/pci.h
>> >> >> > +++ linux-pm/include/linux/pci.h
>> >> >> > @@ -321,6 +321,7 @@ struct pci_dev {
>> >> >> >         unsigned int    multifunction:1;/* Part of multi-function device */
>> >> >> >         /* keep track of device state */
>> >> >> >         unsigned int    is_added:1;
>> >> >> > +       unsigned int    is_gone:1;
>> >> >> >         unsigned int    is_busmaster:1; /* device is busmaster */
>> >> >> >         unsigned int    no_msi:1;       /* device may not use msi */
>> >> >> >         unsigned int    block_cfg_access:1;     /* config space access is blocked */
>> >> >> > Index: linux-pm/drivers/pci/remove.c
>> >> >> > ===================================================================
>> >> >> > --- linux-pm.orig/drivers/pci/remove.c
>> >> >> > +++ linux-pm/drivers/pci/remove.c
>> >> >> > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
>> >> >> >
>> >> >> >  static void pci_destroy_dev(struct pci_dev *dev)
>> >> >> >  {
>> >> >> > +       dev->is_gone = 1;
>> >> >> >         device_del(&dev->dev);
>> >> >> >
>> >> >> >         down_write(&pci_bus_sem);
>> >> >> > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
>> >> >> >   */
>> >> >> >  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
>> >> >> >  {
>> >> >> > -       pci_stop_bus_device(dev);
>> >> >> > -       pci_remove_bus_device(dev);
>> >> >> > +       if (!dev->is_gone) {
>> >> >> > +               pci_stop_bus_device(dev);
>> >> >> > +               pci_remove_bus_device(dev);
>> >> >> > +       }
>> >> >> >  }
>> >> >> >  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
>> >> >> >
>> >> >>
>> >> >> Yes, above change should address sys double remove problem.
>> >> >
>> >> > I've just realized that we don't need a new flag for that, though.
>> >> >
>> >> > It looks like we only need to check dev->dev.kobj.parent and return if that is
>> >> > NULL, because that means pci_destroy_dev() has run for that device already
>> >> > (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like
>> >> > it should do that?).
>> >> >
>> >> > Of course, that still is going to be racy if we don't hold
>> >> > pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code
>> >> > path using it (or use another similar synchronization mechanism).
>> >>
>> >> Wonder if we can have safe way to check if device_del() is called already.
>> >
>> > Nope.
>> >
>> >> And those access_after_free should be addressed by driver core instead
>> >> of pci code?
>> >
>> > Nope, it's up to the bus to handle this.  It shouldn't be hard, you
>> > shouldn't actually care about this, if you do, something is wrong.
>> >
>> > How is this PCI code so hard to get right?  Look at USB for devices that
>> > disappear from anywhere at anytime as an example for how to handle
>> > this.  PCI should be doing the same thing, no need for this "is_gone"
>> > stuff.
>> Greg,
>>
>>   Don't agree USB is a good example to follow, do you never hit panic
>> when you pull out USB device from anywhere at anytime without unmount
>> or stop it via command ?
>
> You shouldn't.  If you do, it's a bug, let us know and we will fix it.

Of coz, next time hit, bore you with a calltrace.

>
>> that is not truth.  the truth is none regards it as enterprise level
>> interface to attach devices.
>
> Huh?

USB 3.0 still not fast enough for enterprise level.
>
>>   Is there a feature for an USB disk to tell the host you want to pull
>> out it and should sync all the data in cache and unmount the files
>> system then power it off ?
>
> Nope, neither is there one for when I yank out my PCI storage device
> without telling the OS about it either.  Everything better "just work",
> with the exception of any lost data that might be in flight.

To a desktop, you do have option to issue 'sync, umount' and pull out
the device and it 'just work',
To a server, someone wouldn't stand for any data lost in flight.  USB
need additional feature added
for you to tell udev sync data etc without a console in hand.

>
>>   What USB could drive for us ? 40GB nic ? infiniband ? High end graphic card ?
>
> I don't understand what that means at all.

Don't be sleepy, man, you know USB is not powerful enough today, just
as you said, someday,
all the outdated thing will go away, just like those ISA, VESA, we
don't care the low level data link layer anymore. but today, PCIe is
still a little more complex/out than USB to handle.

Thanks,
Ethan
>
> We have USB network ethernet devices, I have a USB 3.0 one here that
> works really well.  Infiniband is merely a transport, with some "verbs"
> on top of it, that has nothing to do with PCI other than you can have a
> IB PCI controller in the system.  And I have a USB graphics adapter here
> that works just fine as well (people chain lots of them on one system.)
>
> So how does this apply to PCI at all?  It's the same thing, you have to
> be able to handle a PCI device going away at any point in time, with or
> without telling the OS ahead of time that you are going to remove it.
>
> greg k-h

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr
  2013-11-26 19:35     ` Yinghai Lu
@ 2013-12-11 18:48       ` Bjorn Helgaas
  2013-12-11 19:58         ` Yinghai Lu
  0 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2013-12-11 18:48 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	David Airlie

On Tue, Nov 26, 2013 at 12:35 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Mon, Nov 25, 2013 at 7:46 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>> On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>>> That bar could be 64bit pref mem and above 4G.
>>>
>>> -v2: refresh to 3.13-rc1
>>>
>>> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
>>> Cc: David Airlie <airlied@linux.ie>
>>> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
>>
>> This looks OK to me.  Does it depend on any previous patches in this
>> series?  If not, I think Dave should pick it up.
>
> No.
>
> could be exposed after 5-9 get applied.

Doesn't that mean that this series should be reordered so this patch
is *before* patches 5-9 to avoid bisection breakage?  I know you've
split up and reposted parts of the series, but the question still
applies.

Bjorn

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr
  2013-12-11 18:48       ` Bjorn Helgaas
@ 2013-12-11 19:58         ` Yinghai Lu
  0 siblings, 0 replies; 69+ messages in thread
From: Yinghai Lu @ 2013-12-11 19:58 UTC (permalink / raw)
  To: Bjorn Helgaas, David Airlie
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel

On Wed, Dec 11, 2013 at 10:48 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>
> Doesn't that mean that this series should be reordered so this patch
> is *before* patches 5-9 to avoid bisection breakage?  I know you've
> split up and reposted parts of the series, but the question still
> applies.

Yes, you are right.

Maybe we have this one go through via Dave Airlie at first for 3.13?

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr
  2013-11-26  1:28 ` [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr Yinghai Lu
  2013-11-26  3:46   ` Bjorn Helgaas
@ 2013-12-21  0:27   ` Bjorn Helgaas
  2013-12-21  1:19     ` Yinghai Lu
  1 sibling, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2013-12-21  0:27 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	David Airlie

On Mon, Nov 25, 2013 at 6:28 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> That bar could be 64bit pref mem and above 4G.
>
> -v2: refresh to 3.13-rc1
>
> Signed-off-by: Yinghai Lu <yinghai@kernel.org>
> Cc: David Airlie <airlied@linux.ie>
> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> ---
>  drivers/char/agp/intel-gtt.c | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c
> index b8e2014..b929e9d 100644
> --- a/drivers/char/agp/intel-gtt.c
> +++ b/drivers/char/agp/intel-gtt.c
> @@ -609,8 +609,10 @@ static bool intel_gtt_can_wc(void)
>  static int intel_gtt_init(void)
>  {
>         u32 gma_addr;
> +       u32 addr_hi = 0;
>         u32 gtt_map_size;
>         int ret;
> +       int pos;
>
>         ret = intel_private.driver->setup();
>         if (ret != 0)
> @@ -660,13 +662,17 @@ static int intel_gtt_init(void)
>         }
>
>         if (INTEL_GTT_GEN <= 2)
> -               pci_read_config_dword(intel_private.pcidev, I810_GMADDR,
> -                                     &gma_addr);
> +               pos = I810_GMADDR;
>         else
> -               pci_read_config_dword(intel_private.pcidev, I915_GMADDR,
> -                                     &gma_addr);
> +               pos = I915_GMADDR;
> +
> +       pci_read_config_dword(intel_private.pcidev, pos, &gma_addr);
> +
> +       if (gma_addr & PCI_BASE_ADDRESS_MEM_TYPE_64)
> +               pci_read_config_dword(intel_private.pcidev, pos + 4, &addr_hi);

Why are we reading these BARs directly anyway?  These look like
standard PCI BARs (I810_GMADDR == 0x10 and I915_GMADDR == 0x18), so
the PCI core should already be reading them correctly, shouldn't it?
Can't we just use pcibios_resource_to_bus(pci_resource_start())?

It looks like i810_setup(), i830_setup(), and i9xx_setup() have the
same problem and should also be using pci_resource_start() or
something similar.

And I'm confused because the i915_gmch_probe() path fills in
gtt->mappable_base with the bus address, but the gen6_gmch_probe()
path uses the resource, i.e., the CPU, address.  That looks broken to
me.

agp_serverworks_probe() looks like another place that should not be
reading the BARs directly.

>
>         intel_private.gma_bus_addr = (gma_addr & PCI_BASE_ADDRESS_MEM_MASK);
> +       intel_private.gma_bus_addr |= (u64)addr_hi << 32;
>
>         return 0;
>  }
> --
> 1.8.1.4
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr
  2013-12-21  0:27   ` Bjorn Helgaas
@ 2013-12-21  1:19     ` Yinghai Lu
  2013-12-21 18:50       ` Bjorn Helgaas
  0 siblings, 1 reply; 69+ messages in thread
From: Yinghai Lu @ 2013-12-21  1:19 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	David Airlie

[-- Attachment #1: Type: text/plain, Size: 1749 bytes --]

On Fri, Dec 20, 2013 at 4:27 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:

> Why are we reading these BARs directly anyway?  These look like
> standard PCI BARs (I810_GMADDR == 0x10 and I915_GMADDR == 0x18), so
> the PCI core should already be reading them correctly, shouldn't it?
> Can't we just use pcibios_resource_to_bus(pci_resource_start())?
>
> It looks like i810_setup(), i830_setup(), and i9xx_setup() have the
> same problem and should also be using pci_resource_start() or
> something similar.

Agreed.

should be sth like:


Index: linux-2.6/drivers/char/agp/intel-gtt.c
===================================================================
--- linux-2.6.orig/drivers/char/agp/intel-gtt.c
+++ linux-2.6/drivers/char/agp/intel-gtt.c
@@ -608,9 +608,9 @@ static bool intel_gtt_can_wc(void)

 static int intel_gtt_init(void)
 {
-       u32 gma_addr;
+       struct pci_bus_region r;
        u32 gtt_map_size;
-       int ret;
+       int ret, idx;

        ret = intel_private.driver->setup();
        if (ret != 0)
@@ -660,13 +660,14 @@ static int intel_gtt_init(void)
        }

        if (INTEL_GTT_GEN <= 2)
-               pci_read_config_dword(intel_private.pcidev, I810_GMADDR,
-                                     &gma_addr);
+               idx = 0; /* I810_GMADDR */
        else
-               pci_read_config_dword(intel_private.pcidev, I915_GMADDR,
-                                     &gma_addr);
+               idx = 2; /* I915_GMADDR */

-       intel_private.gma_bus_addr = (gma_addr & PCI_BASE_ADDRESS_MEM_MASK);
+       pcibios_resource_to_bus(intel_private.pcidev->bus, &r,
+                               &intel_private.pcidev->resource[idx]);
+
+       intel_private.gma_bus_addr = r.start;

        return 0;
 }

[-- Attachment #2: intel_gma_bus_addr_resource_bus.patch --]
[-- Type: text/x-patch, Size: 965 bytes --]

diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c
index b8e2014..0250017 100644
--- a/drivers/char/agp/intel-gtt.c
+++ b/drivers/char/agp/intel-gtt.c
@@ -608,9 +608,9 @@ static bool intel_gtt_can_wc(void)
 
 static int intel_gtt_init(void)
 {
-	u32 gma_addr;
+	struct pci_bus_region r;
 	u32 gtt_map_size;
-	int ret;
+	int ret, idx;
 
 	ret = intel_private.driver->setup();
 	if (ret != 0)
@@ -660,13 +660,14 @@ static int intel_gtt_init(void)
 	}
 
 	if (INTEL_GTT_GEN <= 2)
-		pci_read_config_dword(intel_private.pcidev, I810_GMADDR,
-				      &gma_addr);
+		idx = 0; /* I810_GMADDR */
 	else
-		pci_read_config_dword(intel_private.pcidev, I915_GMADDR,
-				      &gma_addr);
+		idx = 2; /* I915_GMADDR */
+
+	pcibios_resource_to_bus(intel_private.pcidev->bus, &r,
+				&intel_private.pcidev->resource[idx]);
 
-	intel_private.gma_bus_addr = (gma_addr & PCI_BASE_ADDRESS_MEM_MASK);
+	intel_private.gma_bus_addr = r.start;
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr
  2013-12-21  1:19     ` Yinghai Lu
@ 2013-12-21 18:50       ` Bjorn Helgaas
  2013-12-23 22:33         ` Bjorn Helgaas
  0 siblings, 1 reply; 69+ messages in thread
From: Bjorn Helgaas @ 2013-12-21 18:50 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	David Airlie, Daniel Vetter

[+cc Daniel]

On Fri, Dec 20, 2013 at 05:19:38PM -0800, Yinghai Lu wrote:
> On Fri, Dec 20, 2013 at 4:27 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> 
> > Why are we reading these BARs directly anyway?  These look like
> > standard PCI BARs (I810_GMADDR == 0x10 and I915_GMADDR == 0x18), so
> > the PCI core should already be reading them correctly, shouldn't it?
> > Can't we just use pcibios_resource_to_bus(pci_resource_start())?
> >
> > It looks like i810_setup(), i830_setup(), and i9xx_setup() have the
> > same problem and should also be using pci_resource_start() or
> > something similar.
> 
> Agreed.
> 
> should be sth like:
> 
> 
> Index: linux-2.6/drivers/char/agp/intel-gtt.c
> ===================================================================
> --- linux-2.6.orig/drivers/char/agp/intel-gtt.c
> +++ linux-2.6/drivers/char/agp/intel-gtt.c
> @@ -608,9 +608,9 @@ static bool intel_gtt_can_wc(void)
> 
>  static int intel_gtt_init(void)
>  {
> -       u32 gma_addr;
> +       struct pci_bus_region r;
>         u32 gtt_map_size;
> -       int ret;
> +       int ret, idx;
> 
>         ret = intel_private.driver->setup();
>         if (ret != 0)
> @@ -660,13 +660,14 @@ static int intel_gtt_init(void)
>         }
> 
>         if (INTEL_GTT_GEN <= 2)
> -               pci_read_config_dword(intel_private.pcidev, I810_GMADDR,
> -                                     &gma_addr);
> +               idx = 0; /* I810_GMADDR */
>         else
> -               pci_read_config_dword(intel_private.pcidev, I915_GMADDR,
> -                                     &gma_addr);
> +               idx = 2; /* I915_GMADDR */
> 
> -       intel_private.gma_bus_addr = (gma_addr & PCI_BASE_ADDRESS_MEM_MASK);
> +       pcibios_resource_to_bus(intel_private.pcidev->bus, &r,
> +                               &intel_private.pcidev->resource[idx]);
> +
> +       intel_private.gma_bus_addr = r.start;
> 
>         return 0;
>  }

I think it's even worse than we first thought.  Not only does the current
code fail on 64-bit BARs, but it also ignores the CPU/bus address
difference, and I think we want the CPU address.  So I think we need
something like these:


commit 8ba262d78f9d218672486c62ba6a1c7a073bd272
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Sat Dec 21 10:49:58 2013 -0700

    agp/intel: Rename "*_bus_addr" to "*_phys_addr"
    
    We're dealing with CPU physical addresses here, which may be different from
    bus addresses, so rename "gtt_bus_addr" to "gtt_phys_addr" and
    "gma_bus_addr" to "gma_phys_addr" to avoid confusion.
    
    No functional change.
    
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c
index b8e2014cb9cb..dca04a6aa7f8 100644
--- a/drivers/char/agp/intel-gtt.c
+++ b/drivers/char/agp/intel-gtt.c
@@ -64,7 +64,7 @@ static struct _intel_private {
 	struct pci_dev *pcidev;	/* device one */
 	struct pci_dev *bridge_dev;
 	u8 __iomem *registers;
-	phys_addr_t gtt_bus_addr;
+	phys_addr_t gtt_phys_addr;
 	u32 PGETBL_save;
 	u32 __iomem *gtt;		/* I915G */
 	bool clear_fake_agp; /* on first access via agp, fill with scratch */
@@ -78,7 +78,7 @@ static struct _intel_private {
 	int refcount;
 	/* Whether i915 needs to use the dmar apis or not. */
 	unsigned int needs_dmar : 1;
-	phys_addr_t gma_bus_addr;
+	phys_addr_t gma_phys_addr;
 	/*  Size of memory reserved for graphics by the BIOS */
 	unsigned int stolen_size;
 	/* Total number of gtt entries. */
@@ -191,7 +191,7 @@ static int i810_setup(void)
 	writel(virt_to_phys(gtt_table) | I810_PGETBL_ENABLED,
 	       intel_private.registers+I810_PGETBL_CTL);
 
-	intel_private.gtt_bus_addr = reg_addr + I810_PTE_BASE;
+	intel_private.gtt_phys_addr = reg_addr + I810_PTE_BASE;
 
 	if ((readl(intel_private.registers+I810_DRAM_CTL)
 		& I810_DRAM_ROW_0) == I810_DRAM_ROW_0_SDRAM) {
@@ -636,10 +636,10 @@ static int intel_gtt_init(void)
 
 	intel_private.gtt = NULL;
 	if (intel_gtt_can_wc())
-		intel_private.gtt = ioremap_wc(intel_private.gtt_bus_addr,
+		intel_private.gtt = ioremap_wc(intel_private.gtt_phys_addr,
 					       gtt_map_size);
 	if (intel_private.gtt == NULL)
-		intel_private.gtt = ioremap(intel_private.gtt_bus_addr,
+		intel_private.gtt = ioremap(intel_private.gtt_phys_addr,
 					    gtt_map_size);
 	if (intel_private.gtt == NULL) {
 		intel_private.driver->cleanup();
@@ -666,7 +666,7 @@ static int intel_gtt_init(void)
 		pci_read_config_dword(intel_private.pcidev, I915_GMADDR,
 				      &gma_addr);
 
-	intel_private.gma_bus_addr = (gma_addr & PCI_BASE_ADDRESS_MEM_MASK);
+	intel_private.gma_phys_addr = (gma_addr & PCI_BASE_ADDRESS_MEM_MASK);
 
 	return 0;
 }
@@ -796,7 +796,7 @@ static int i830_setup(void)
 	if (!intel_private.registers)
 		return -ENOMEM;
 
-	intel_private.gtt_bus_addr = reg_addr + I810_PTE_BASE;
+	intel_private.gtt_phys_addr = reg_addr + I810_PTE_BASE;
 
 	return 0;
 }
@@ -821,7 +821,7 @@ static int intel_fake_agp_configure(void)
 	    return -EIO;
 
 	intel_private.clear_fake_agp = true;
-	agp_bridge->gart_bus_addr = intel_private.gma_bus_addr;
+	agp_bridge->gart_bus_addr = intel_private.gma_phys_addr;
 
 	return 0;
 }
@@ -1123,13 +1123,13 @@ static int i9xx_setup(void)
 	case 3:
 		pci_read_config_dword(intel_private.pcidev,
 				      I915_PTEADDR, &gtt_addr);
-		intel_private.gtt_bus_addr = gtt_addr;
+		intel_private.gtt_phys_addr = gtt_addr;
 		break;
 	case 5:
-		intel_private.gtt_bus_addr = reg_addr + MB(2);
+		intel_private.gtt_phys_addr = reg_addr + MB(2);
 		break;
 	default:
-		intel_private.gtt_bus_addr = reg_addr + KB(512);
+		intel_private.gtt_phys_addr = reg_addr + KB(512);
 		break;
 	}
 
@@ -1409,7 +1409,7 @@ void intel_gtt_get(size_t *gtt_total, size_t *stolen_size,
 {
 	*gtt_total = intel_private.gtt_total_entries << PAGE_SHIFT;
 	*stolen_size = intel_private.stolen_size;
-	*mappable_base = intel_private.gma_bus_addr;
+	*mappable_base = intel_private.gma_phys_addr;
 	*mappable_end = intel_private.gtt_mappable_entries << PAGE_SHIFT;
 }
 EXPORT_SYMBOL(intel_gtt_get);

commit 31349de4ce32a0c2c2d14df35717544e94d56066
Author: Bjorn Helgaas <bhelgaas@google.com>
Date:   Sat Dec 21 10:09:19 2013 -0700

    intel-gtt: Use CPU address (not BAR value) for GART
    
    There were two problems here:
    
      1) The GMADR can be either a 32-bit or a 64-bit BAR, but we only read the
         low 32 bits, so this failed if GMADR was above 4GB.
    
      2) The value read from the BAR is a bus address, not a CPU physical
         address, and these may be different.
    
    Use pci_resource_start() instead of reading the BAR directly to remove the
    BAR size and bus/CPU address issue, as gen8_gmch_probe() already does.
    
    Reference: http://lkml.kernel.org/r/1385429290-25397-11-git-send-email-yinghai@kernel.org
    Based-on-patch-by: Yinghai Lu <yinghai@kernel.org>
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>

diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c
index dca04a6aa7f8..35169ff0ffe1 100644
--- a/drivers/char/agp/intel-gtt.c
+++ b/drivers/char/agp/intel-gtt.c
@@ -610,7 +610,7 @@ static int intel_gtt_init(void)
 {
 	u32 gma_addr;
 	u32 gtt_map_size;
-	int ret;
+	int ret, bar;
 
 	ret = intel_private.driver->setup();
 	if (ret != 0)
@@ -660,14 +660,12 @@ static int intel_gtt_init(void)
 	}
 
 	if (INTEL_GTT_GEN <= 2)
-		pci_read_config_dword(intel_private.pcidev, I810_GMADDR,
-				      &gma_addr);
+		bar = 0;	/* I810_GMADDR */
 	else
-		pci_read_config_dword(intel_private.pcidev, I915_GMADDR,
-				      &gma_addr);
-
-	intel_private.gma_phys_addr = (gma_addr & PCI_BASE_ADDRESS_MEM_MASK);
+		bar = 2;	/* I915_GMADDR */
 
+	intel_private.gma_phys_addr = pci_resource_start(intel_private.pcidev,
+							 bar);
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 69+ messages in thread

* Re: [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr
  2013-12-21 18:50       ` Bjorn Helgaas
@ 2013-12-23 22:33         ` Bjorn Helgaas
  0 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2013-12-23 22:33 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci, linux-kernel,
	David Airlie, Daniel Vetter

On Sat, Dec 21, 2013 at 11:50 AM, Bjorn Helgaas <bhelgaas@google.com> wrote:
> [+cc Daniel]
>
> On Fri, Dec 20, 2013 at 05:19:38PM -0800, Yinghai Lu wrote:
>> On Fri, Dec 20, 2013 at 4:27 PM, Bjorn Helgaas <bhelgaas@google.com> wrote:
>>
>> > Why are we reading these BARs directly anyway?  These look like
>> > standard PCI BARs (I810_GMADDR == 0x10 and I915_GMADDR == 0x18), so
>> > the PCI core should already be reading them correctly, shouldn't it?
>> > Can't we just use pcibios_resource_to_bus(pci_resource_start())?
>> >
>> > It looks like i810_setup(), i830_setup(), and i9xx_setup() have the
>> > same problem and should also be using pci_resource_start() or
>> > something similar.
>>
>> Agreed.
>>
>> should be sth like:
>>
>>
>> Index: linux-2.6/drivers/char/agp/intel-gtt.c
>> ===================================================================
>> --- linux-2.6.orig/drivers/char/agp/intel-gtt.c
>> +++ linux-2.6/drivers/char/agp/intel-gtt.c
>> @@ -608,9 +608,9 @@ static bool intel_gtt_can_wc(void)
>>
>>  static int intel_gtt_init(void)
>>  {
>> -       u32 gma_addr;
>> +       struct pci_bus_region r;
>>         u32 gtt_map_size;
>> -       int ret;
>> +       int ret, idx;
>>
>>         ret = intel_private.driver->setup();
>>         if (ret != 0)
>> @@ -660,13 +660,14 @@ static int intel_gtt_init(void)
>>         }
>>
>>         if (INTEL_GTT_GEN <= 2)
>> -               pci_read_config_dword(intel_private.pcidev, I810_GMADDR,
>> -                                     &gma_addr);
>> +               idx = 0; /* I810_GMADDR */
>>         else
>> -               pci_read_config_dword(intel_private.pcidev, I915_GMADDR,
>> -                                     &gma_addr);
>> +               idx = 2; /* I915_GMADDR */
>>
>> -       intel_private.gma_bus_addr = (gma_addr & PCI_BASE_ADDRESS_MEM_MASK);
>> +       pcibios_resource_to_bus(intel_private.pcidev->bus, &r,
>> +                               &intel_private.pcidev->resource[idx]);
>> +
>> +       intel_private.gma_bus_addr = r.start;
>>
>>         return 0;
>>  }
>
> I think it's even worse than we first thought.  Not only does the current
> code fail on 64-bit BARs, but it also ignores the CPU/bus address
> difference, and I think we want the CPU address.  So I think we need
> something like these:

Please ignore the following patches.  After looking again, I think
"gma_bus_addr" really is a bus addresses after all, and something like
Yinghai's patch is more appropriate.  But I think there are other
places with similar issues, so I'm working on a more extensive set of
patches.

Bjorn

> commit 8ba262d78f9d218672486c62ba6a1c7a073bd272
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Sat Dec 21 10:49:58 2013 -0700
>
>     agp/intel: Rename "*_bus_addr" to "*_phys_addr"
>
>     We're dealing with CPU physical addresses here, which may be different from
>     bus addresses, so rename "gtt_bus_addr" to "gtt_phys_addr" and
>     "gma_bus_addr" to "gma_phys_addr" to avoid confusion.
>
>     No functional change.
>
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>
> diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c
> index b8e2014cb9cb..dca04a6aa7f8 100644
> --- a/drivers/char/agp/intel-gtt.c
> +++ b/drivers/char/agp/intel-gtt.c
> @@ -64,7 +64,7 @@ static struct _intel_private {
>         struct pci_dev *pcidev; /* device one */
>         struct pci_dev *bridge_dev;
>         u8 __iomem *registers;
> -       phys_addr_t gtt_bus_addr;
> +       phys_addr_t gtt_phys_addr;
>         u32 PGETBL_save;
>         u32 __iomem *gtt;               /* I915G */
>         bool clear_fake_agp; /* on first access via agp, fill with scratch */
> @@ -78,7 +78,7 @@ static struct _intel_private {
>         int refcount;
>         /* Whether i915 needs to use the dmar apis or not. */
>         unsigned int needs_dmar : 1;
> -       phys_addr_t gma_bus_addr;
> +       phys_addr_t gma_phys_addr;
>         /*  Size of memory reserved for graphics by the BIOS */
>         unsigned int stolen_size;
>         /* Total number of gtt entries. */
> @@ -191,7 +191,7 @@ static int i810_setup(void)
>         writel(virt_to_phys(gtt_table) | I810_PGETBL_ENABLED,
>                intel_private.registers+I810_PGETBL_CTL);
>
> -       intel_private.gtt_bus_addr = reg_addr + I810_PTE_BASE;
> +       intel_private.gtt_phys_addr = reg_addr + I810_PTE_BASE;
>
>         if ((readl(intel_private.registers+I810_DRAM_CTL)
>                 & I810_DRAM_ROW_0) == I810_DRAM_ROW_0_SDRAM) {
> @@ -636,10 +636,10 @@ static int intel_gtt_init(void)
>
>         intel_private.gtt = NULL;
>         if (intel_gtt_can_wc())
> -               intel_private.gtt = ioremap_wc(intel_private.gtt_bus_addr,
> +               intel_private.gtt = ioremap_wc(intel_private.gtt_phys_addr,
>                                                gtt_map_size);
>         if (intel_private.gtt == NULL)
> -               intel_private.gtt = ioremap(intel_private.gtt_bus_addr,
> +               intel_private.gtt = ioremap(intel_private.gtt_phys_addr,
>                                             gtt_map_size);
>         if (intel_private.gtt == NULL) {
>                 intel_private.driver->cleanup();
> @@ -666,7 +666,7 @@ static int intel_gtt_init(void)
>                 pci_read_config_dword(intel_private.pcidev, I915_GMADDR,
>                                       &gma_addr);
>
> -       intel_private.gma_bus_addr = (gma_addr & PCI_BASE_ADDRESS_MEM_MASK);
> +       intel_private.gma_phys_addr = (gma_addr & PCI_BASE_ADDRESS_MEM_MASK);
>
>         return 0;
>  }
> @@ -796,7 +796,7 @@ static int i830_setup(void)
>         if (!intel_private.registers)
>                 return -ENOMEM;
>
> -       intel_private.gtt_bus_addr = reg_addr + I810_PTE_BASE;
> +       intel_private.gtt_phys_addr = reg_addr + I810_PTE_BASE;
>
>         return 0;
>  }
> @@ -821,7 +821,7 @@ static int intel_fake_agp_configure(void)
>             return -EIO;
>
>         intel_private.clear_fake_agp = true;
> -       agp_bridge->gart_bus_addr = intel_private.gma_bus_addr;
> +       agp_bridge->gart_bus_addr = intel_private.gma_phys_addr;
>
>         return 0;
>  }
> @@ -1123,13 +1123,13 @@ static int i9xx_setup(void)
>         case 3:
>                 pci_read_config_dword(intel_private.pcidev,
>                                       I915_PTEADDR, &gtt_addr);
> -               intel_private.gtt_bus_addr = gtt_addr;
> +               intel_private.gtt_phys_addr = gtt_addr;
>                 break;
>         case 5:
> -               intel_private.gtt_bus_addr = reg_addr + MB(2);
> +               intel_private.gtt_phys_addr = reg_addr + MB(2);
>                 break;
>         default:
> -               intel_private.gtt_bus_addr = reg_addr + KB(512);
> +               intel_private.gtt_phys_addr = reg_addr + KB(512);
>                 break;
>         }
>
> @@ -1409,7 +1409,7 @@ void intel_gtt_get(size_t *gtt_total, size_t *stolen_size,
>  {
>         *gtt_total = intel_private.gtt_total_entries << PAGE_SHIFT;
>         *stolen_size = intel_private.stolen_size;
> -       *mappable_base = intel_private.gma_bus_addr;
> +       *mappable_base = intel_private.gma_phys_addr;
>         *mappable_end = intel_private.gtt_mappable_entries << PAGE_SHIFT;
>  }
>  EXPORT_SYMBOL(intel_gtt_get);
>
> commit 31349de4ce32a0c2c2d14df35717544e94d56066
> Author: Bjorn Helgaas <bhelgaas@google.com>
> Date:   Sat Dec 21 10:09:19 2013 -0700
>
>     intel-gtt: Use CPU address (not BAR value) for GART
>
>     There were two problems here:
>
>       1) The GMADR can be either a 32-bit or a 64-bit BAR, but we only read the
>          low 32 bits, so this failed if GMADR was above 4GB.
>
>       2) The value read from the BAR is a bus address, not a CPU physical
>          address, and these may be different.
>
>     Use pci_resource_start() instead of reading the BAR directly to remove the
>     BAR size and bus/CPU address issue, as gen8_gmch_probe() already does.
>
>     Reference: http://lkml.kernel.org/r/1385429290-25397-11-git-send-email-yinghai@kernel.org
>     Based-on-patch-by: Yinghai Lu <yinghai@kernel.org>
>     Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
>
> diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c
> index dca04a6aa7f8..35169ff0ffe1 100644
> --- a/drivers/char/agp/intel-gtt.c
> +++ b/drivers/char/agp/intel-gtt.c
> @@ -610,7 +610,7 @@ static int intel_gtt_init(void)
>  {
>         u32 gma_addr;
>         u32 gtt_map_size;
> -       int ret;
> +       int ret, bar;
>
>         ret = intel_private.driver->setup();
>         if (ret != 0)
> @@ -660,14 +660,12 @@ static int intel_gtt_init(void)
>         }
>
>         if (INTEL_GTT_GEN <= 2)
> -               pci_read_config_dword(intel_private.pcidev, I810_GMADDR,
> -                                     &gma_addr);
> +               bar = 0;        /* I810_GMADDR */
>         else
> -               pci_read_config_dword(intel_private.pcidev, I915_GMADDR,
> -                                     &gma_addr);
> -
> -       intel_private.gma_phys_addr = (gma_addr & PCI_BASE_ADDRESS_MEM_MASK);
> +               bar = 2;        /* I915_GMADDR */
>
> +       intel_private.gma_phys_addr = pci_resource_start(intel_private.pcidev,
> +                                                        bar);
>         return 0;
>  }
>

^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once)
  2013-12-06  1:21                               ` Rafael J. Wysocki
  2013-12-06  6:29                                 ` Yinghai Lu
@ 2014-01-10 14:20                                 ` Rafael J. Wysocki
  2014-01-10 14:22                                   ` [PATCH 1/9] PCI: Global rescan-remove lock Rafael J. Wysocki
                                                     ` (9 more replies)
  1 sibling, 10 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2014-01-10 14:20 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe,
	Benjamin Herrenschmidt, linux-scsi, Matthew Garrett,
	Konrad Rzeszutek Wilk

[Cc: adding linux-scsi for the MPT changes, Ben for powerpc, Matthew for
 platform/x86 and Konrad for Xen]

On Friday, December 06, 2013 02:21:50 AM Rafael J. Wysocki wrote:

[...]

> 
> OK
> 
> To be a bit more constructive, as the next step I'd try to use
> pci_remove_rescan_mutex to serialize all PCI hotplug operations (as I said
> above) without making the other changes made by my patch.  Does that sound
> reasonable?

Well, no answer here, so as a followup, a series implementing that idea
follows.

I *hope* I found all of the places that need to be synchronized vs the bus
rescan and device removal that can be triggered via sysfs, but I might overlook
something.  Also in some cases I wasn't quite sure how much stuff to put under
the lock, because said stuff is not exactly straightforward.

Enjoy!

Rafael


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 1/9] PCI: Global rescan-remove lock
  2014-01-10 14:20                                 ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
@ 2014-01-10 14:22                                   ` Rafael J. Wysocki
  2014-01-10 14:23                                   ` [PATCH 2/9] ACPI / PCI: Use global PCI rescan-remove locking in PCI root hotplug Rafael J. Wysocki
                                                     ` (8 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2014-01-10 14:22 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe,
	Benjamin Herrenschmidt, linux-scsi, Matthew Garrett,
	Konrad Rzeszutek Wilk

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

There are multiple PCI device addition and removal code paths that
may be run concurrently with the generic PCI bus rescan and device
removal that can be triggered via sysfs.  If that happens, it may
lead to multiple different, potentially dangerous race conditions.

The most straightforward way to address those problems is to run
the code in question under the same lock that is used by the
generic rescan/remove code in pci-sysfs.c.  To prepare for those
changes, move the definition of the global PCI remove/rescan lock
to probe.c and provide global wrappers, pci_lock_rescan_remove()
and pci_unlock_rescan_remove(), allowing drivers to manipulate
that lock.  Also provide pci_stop_and_remove_bus_device_locked()
for the callers of pci_stop_and_remove_bus_device() who only need
to hold the rescan/remove lock around it.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/pci/pci-sysfs.c |   19 +++++++------------
 drivers/pci/probe.c     |   18 ++++++++++++++++++
 drivers/pci/remove.c    |    8 ++++++++
 include/linux/pci.h     |    3 +++
 4 files changed, 36 insertions(+), 12 deletions(-)

Index: linux-pm/drivers/pci/pci-sysfs.c
===================================================================
--- linux-pm.orig/drivers/pci/pci-sysfs.c
+++ linux-pm/drivers/pci/pci-sysfs.c
@@ -297,7 +297,6 @@ msi_bus_store(struct device *dev, struct
 }
 static DEVICE_ATTR_RW(msi_bus);
 
-static DEFINE_MUTEX(pci_remove_rescan_mutex);
 static ssize_t bus_rescan_store(struct bus_type *bus, const char *buf,
 				size_t count)
 {
@@ -308,10 +307,10 @@ static ssize_t bus_rescan_store(struct b
 		return -EINVAL;
 
 	if (val) {
-		mutex_lock(&pci_remove_rescan_mutex);
+		pci_lock_rescan_remove();
 		while ((b = pci_find_next_bus(b)) != NULL)
 			pci_rescan_bus(b);
-		mutex_unlock(&pci_remove_rescan_mutex);
+		pci_unlock_rescan_remove();
 	}
 	return count;
 }
@@ -342,9 +341,9 @@ dev_rescan_store(struct device *dev, str
 		return -EINVAL;
 
 	if (val) {
-		mutex_lock(&pci_remove_rescan_mutex);
+		pci_lock_rescan_remove();
 		pci_rescan_bus(pdev->bus);
-		mutex_unlock(&pci_remove_rescan_mutex);
+		pci_unlock_rescan_remove();
 	}
 	return count;
 }
@@ -354,11 +353,7 @@ static struct device_attribute dev_resca
 
 static void remove_callback(struct device *dev)
 {
-	struct pci_dev *pdev = to_pci_dev(dev);
-
-	mutex_lock(&pci_remove_rescan_mutex);
-	pci_stop_and_remove_bus_device(pdev);
-	mutex_unlock(&pci_remove_rescan_mutex);
+	pci_stop_and_remove_bus_device_locked(to_pci_dev(dev));
 }
 
 static ssize_t
@@ -395,12 +390,12 @@ dev_bus_rescan_store(struct device *dev,
 		return -EINVAL;
 
 	if (val) {
-		mutex_lock(&pci_remove_rescan_mutex);
+		pci_lock_rescan_remove();
 		if (!pci_is_root_bus(bus) && list_empty(&bus->devices))
 			pci_rescan_bus_bridge_resize(bus->self);
 		else
 			pci_rescan_bus(bus);
-		mutex_unlock(&pci_remove_rescan_mutex);
+		pci_unlock_rescan_remove();
 	}
 	return count;
 }
Index: linux-pm/drivers/pci/probe.c
===================================================================
--- linux-pm.orig/drivers/pci/probe.c
+++ linux-pm/drivers/pci/probe.c
@@ -2014,6 +2014,24 @@ EXPORT_SYMBOL(pci_scan_slot);
 EXPORT_SYMBOL(pci_scan_bridge);
 EXPORT_SYMBOL_GPL(pci_scan_child_bus);
 
+/*
+ * pci_rescan_bus(), pci_rescan_bus_bridge_resize() and PCI device removal
+ * routines should always be executed under this mutex.
+ */
+static DEFINE_MUTEX(pci_rescan_remove_lock);
+
+void pci_lock_rescan_remove(void)
+{
+	mutex_lock(&pci_rescan_remove_lock);
+}
+EXPORT_SYMBOL_GPL(pci_lock_rescan_remove);
+
+void pci_unlock_rescan_remove(void)
+{
+	mutex_unlock(&pci_rescan_remove_lock);
+}
+EXPORT_SYMBOL_GPL(pci_unlock_rescan_remove);
+
 static int __init pci_sort_bf_cmp(const struct device *d_a, const struct device *d_b)
 {
 	const struct pci_dev *a = to_pci_dev(d_a);
Index: linux-pm/include/linux/pci.h
===================================================================
--- linux-pm.orig/include/linux/pci.h
+++ linux-pm/include/linux/pci.h
@@ -779,6 +779,7 @@ struct pci_dev *pci_dev_get(struct pci_d
 void pci_dev_put(struct pci_dev *dev);
 void pci_remove_bus(struct pci_bus *b);
 void pci_stop_and_remove_bus_device(struct pci_dev *dev);
+void pci_stop_and_remove_bus_device_locked(struct pci_dev *dev);
 void pci_stop_root_bus(struct pci_bus *bus);
 void pci_remove_root_bus(struct pci_bus *bus);
 void pci_setup_cardbus(struct pci_bus *bus);
@@ -1022,6 +1023,8 @@ void set_pcie_hotplug_bridge(struct pci_
 int pci_bus_find_capability(struct pci_bus *bus, unsigned int devfn, int cap);
 unsigned int pci_rescan_bus_bridge_resize(struct pci_dev *bridge);
 unsigned int pci_rescan_bus(struct pci_bus *bus);
+void pci_lock_rescan_remove(void);
+void pci_unlock_rescan_remove(void);
 
 /* Vital product data routines */
 ssize_t pci_read_vpd(struct pci_dev *dev, loff_t pos, size_t count, void *buf);
Index: linux-pm/drivers/pci/remove.c
===================================================================
--- linux-pm.orig/drivers/pci/remove.c
+++ linux-pm/drivers/pci/remove.c
@@ -114,6 +114,14 @@ void pci_stop_and_remove_bus_device(stru
 }
 EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
 
+void pci_stop_and_remove_bus_device_locked(struct pci_dev *dev)
+{
+	pci_lock_rescan_remove();
+	pci_stop_and_remove_bus_device(dev);
+	pci_unlock_rescan_remove();
+}
+EXPORT_SYMBOL_GPL(pci_stop_and_remove_bus_device_locked);
+
 void pci_stop_root_bus(struct pci_bus *bus)
 {
 	struct pci_dev *child, *tmp;


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 2/9] ACPI / PCI: Use global PCI rescan-remove locking in PCI root hotplug
  2014-01-10 14:20                                 ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
  2014-01-10 14:22                                   ` [PATCH 1/9] PCI: Global rescan-remove lock Rafael J. Wysocki
@ 2014-01-10 14:23                                   ` Rafael J. Wysocki
  2014-01-10 14:24                                   ` [PATCH 3/9] ACPI / hotplug / PCI: Use global PCI rescan-remove locking Rafael J. Wysocki
                                                     ` (7 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2014-01-10 14:23 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe,
	Benjamin Herrenschmidt, linux-scsi, Matthew Garrett,
	Konrad Rzeszutek Wilk

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Multiple race conditions are possible between the addition and
removal of PCI devices during ACPI PCI host bridge hotplug and the
generic PCI bus rescan and device removal that can be triggered via
sysfs.

To avoid those race conditions make the ACPI PCI host bridge addition
and removal code use global PCI rescan-remove locking.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/acpi/pci_root.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: linux-pm/drivers/acpi/pci_root.c
===================================================================
--- linux-pm.orig/drivers/acpi/pci_root.c
+++ linux-pm/drivers/acpi/pci_root.c
@@ -604,7 +604,9 @@ static int acpi_pci_root_add(struct acpi
 		pci_assign_unassigned_root_bus_resources(root->bus);
 	}
 
+	pci_lock_rescan_remove();
 	pci_bus_add_devices(root->bus);
+	pci_unlock_rescan_remove();
 	return 1;
 
 end:
@@ -616,6 +618,8 @@ static void acpi_pci_root_remove(struct
 {
 	struct acpi_pci_root *root = acpi_driver_data(device);
 
+	pci_lock_rescan_remove();
+
 	pci_stop_root_bus(root->bus);
 
 	device_set_run_wake(root->bus->bridge, false);
@@ -623,6 +627,8 @@ static void acpi_pci_root_remove(struct
 
 	pci_remove_root_bus(root->bus);
 
+	pci_unlock_rescan_remove();
+
 	kfree(root);
 }
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 3/9] ACPI / hotplug / PCI: Use global PCI rescan-remove locking
  2014-01-10 14:20                                 ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
  2014-01-10 14:22                                   ` [PATCH 1/9] PCI: Global rescan-remove lock Rafael J. Wysocki
  2014-01-10 14:23                                   ` [PATCH 2/9] ACPI / PCI: Use global PCI rescan-remove locking in PCI root hotplug Rafael J. Wysocki
@ 2014-01-10 14:24                                   ` Rafael J. Wysocki
  2014-01-10 14:25                                   ` [PATCH 4/9] PCMCIA / cardbus: " Rafael J. Wysocki
                                                     ` (6 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2014-01-10 14:24 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe,
	Benjamin Herrenschmidt, linux-scsi, Matthew Garrett,
	Konrad Rzeszutek Wilk

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Multiple race conditions are possible between the ACPI-based PCI
hotplug (ACPIPHP) and the generic PCI bus rescan and device removal
that can be triggered via sysfs.

To avoid those race conditions make the ACPIPHP code use global PCI
rescan-remove locking.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/pci/hotplug/acpiphp.h      |    5 +++-
 drivers/pci/hotplug/acpiphp_core.c |    2 -
 drivers/pci/hotplug/acpiphp_glue.c |   43 ++++++++++++++++++++++++++++++++-----
 3 files changed, 43 insertions(+), 7 deletions(-)

Index: linux-pm/drivers/pci/hotplug/acpiphp.h
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/acpiphp.h
+++ linux-pm/drivers/pci/hotplug/acpiphp.h
@@ -77,6 +77,8 @@ struct acpiphp_bridge {
 
 	/* PCI-to-PCI bridge device */
 	struct pci_dev *pci_dev;
+
+	bool is_going_away;
 };
 
 
@@ -150,6 +152,7 @@ struct acpiphp_attention_info
 /* slot flags */
 
 #define SLOT_ENABLED		(0x00000001)
+#define SLOT_IS_GOING_AWAY	(0x00000002)
 
 /* function flags */
 
@@ -169,7 +172,7 @@ void acpiphp_unregister_hotplug_slot(str
 typedef int (*acpiphp_callback)(struct acpiphp_slot *slot, void *data);
 
 int acpiphp_enable_slot(struct acpiphp_slot *slot);
-int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot);
+int acpiphp_disable_slot(struct acpiphp_slot *slot);
 u8 acpiphp_get_power_status(struct acpiphp_slot *slot);
 u8 acpiphp_get_attention_status(struct acpiphp_slot *slot);
 u8 acpiphp_get_latch_status(struct acpiphp_slot *slot);
Index: linux-pm/drivers/pci/hotplug/acpiphp_glue.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/acpiphp_glue.c
+++ linux-pm/drivers/pci/hotplug/acpiphp_glue.c
@@ -432,6 +432,7 @@ static void cleanup_bridge(struct acpiph
 					pr_err("failed to remove notify handler\n");
 			}
 		}
+		slot->flags |= SLOT_IS_GOING_AWAY;
 		if (slot->slot)
 			acpiphp_unregister_hotplug_slot(slot);
 	}
@@ -439,6 +440,8 @@ static void cleanup_bridge(struct acpiph
 	mutex_lock(&bridge_mutex);
 	list_del(&bridge->list);
 	mutex_unlock(&bridge_mutex);
+
+	bridge->is_going_away = true;
 }
 
 /**
@@ -757,6 +760,10 @@ static void acpiphp_check_bridge(struct
 {
 	struct acpiphp_slot *slot;
 
+	/* Bail out if the bridge is going away. */
+	if (bridge->is_going_away)
+		return;
+
 	list_for_each_entry(slot, &bridge->slots, node) {
 		struct pci_bus *bus = slot->bus;
 		struct pci_dev *dev, *tmp;
@@ -827,6 +834,8 @@ void acpiphp_check_host_bridge(acpi_hand
 	}
 }
 
+static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot);
+
 static void hotplug_event(acpi_handle handle, u32 type, void *data)
 {
 	struct acpiphp_context *context = data;
@@ -856,6 +865,9 @@ static void hotplug_event(acpi_handle ha
 		} else {
 			struct acpiphp_slot *slot = func->slot;
 
+			if (slot->flags & SLOT_IS_GOING_AWAY)
+				break;
+
 			mutex_lock(&slot->crit_sect);
 			enable_slot(slot);
 			mutex_unlock(&slot->crit_sect);
@@ -871,6 +883,9 @@ static void hotplug_event(acpi_handle ha
 			struct acpiphp_slot *slot = func->slot;
 			int ret;
 
+			if (slot->flags & SLOT_IS_GOING_AWAY)
+				break;
+
 			/*
 			 * Check if anything has changed in the slot and rescan
 			 * from the parent if that's the case.
@@ -900,9 +915,11 @@ static void hotplug_event_work(void *dat
 	acpi_handle handle = context->handle;
 
 	acpi_scan_lock_acquire();
+	pci_lock_rescan_remove();
 
 	hotplug_event(handle, type, context);
 
+	pci_unlock_rescan_remove();
 	acpi_scan_lock_release();
 	acpi_evaluate_hotplug_ost(handle, type, ACPI_OST_SC_SUCCESS, NULL);
 	put_bridge(context->func.parent);
@@ -1070,12 +1087,19 @@ void acpiphp_remove_slots(struct pci_bus
  */
 int acpiphp_enable_slot(struct acpiphp_slot *slot)
 {
+	pci_lock_rescan_remove();
+
+	if (slot->flags & SLOT_IS_GOING_AWAY)
+		return -ENODEV;
+
 	mutex_lock(&slot->crit_sect);
 	/* configure all functions */
 	if (!(slot->flags & SLOT_ENABLED))
 		enable_slot(slot);
 
 	mutex_unlock(&slot->crit_sect);
+
+	pci_unlock_rescan_remove();
 	return 0;
 }
 
@@ -1083,10 +1107,12 @@ int acpiphp_enable_slot(struct acpiphp_s
  * acpiphp_disable_and_eject_slot - power off and eject slot
  * @slot: ACPI PHP slot
  */
-int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
+static int acpiphp_disable_and_eject_slot(struct acpiphp_slot *slot)
 {
 	struct acpiphp_func *func;
-	int retval = 0;
+
+	if (slot->flags & SLOT_IS_GOING_AWAY)
+		return -ENODEV;
 
 	mutex_lock(&slot->crit_sect);
 
@@ -1104,9 +1130,18 @@ int acpiphp_disable_and_eject_slot(struc
 		}
 
 	mutex_unlock(&slot->crit_sect);
-	return retval;
+	return 0;
 }
 
+int acpiphp_disable_slot(struct acpiphp_slot *slot)
+{
+	int ret;
+
+	pci_lock_rescan_remove();
+	ret = acpiphp_disable_and_eject_slot(slot);
+	pci_unlock_rescan_remove();
+	return ret;
+}
 
 /*
  * slot enabled:  1
@@ -1117,7 +1152,6 @@ u8 acpiphp_get_power_status(struct acpip
 	return (slot->flags & SLOT_ENABLED);
 }
 
-
 /*
  * latch   open:  1
  * latch closed:  0
@@ -1127,7 +1161,6 @@ u8 acpiphp_get_latch_status(struct acpip
 	return !(get_slot_status(slot) & ACPI_STA_DEVICE_UI);
 }
 
-
 /*
  * adapter presence : 1
  *          absence : 0
Index: linux-pm/drivers/pci/hotplug/acpiphp_core.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/acpiphp_core.c
+++ linux-pm/drivers/pci/hotplug/acpiphp_core.c
@@ -156,7 +156,7 @@ static int disable_slot(struct hotplug_s
 	pr_debug("%s - physical_slot = %s\n", __func__, slot_name(slot));
 
 	/* disable the specified slot */
-	return acpiphp_disable_and_eject_slot(slot->acpi_slot);
+	return acpiphp_disable_slot(slot->acpi_slot);
 }
 
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 4/9] PCMCIA / cardbus: Use global PCI rescan-remove locking
  2014-01-10 14:20                                 ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
                                                     ` (2 preceding siblings ...)
  2014-01-10 14:24                                   ` [PATCH 3/9] ACPI / hotplug / PCI: Use global PCI rescan-remove locking Rafael J. Wysocki
@ 2014-01-10 14:25                                   ` Rafael J. Wysocki
  2014-01-10 14:26                                   ` [PATCH 5/9] PCI / hotplug: " Rafael J. Wysocki
                                                     ` (5 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2014-01-10 14:25 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe,
	Benjamin Herrenschmidt, linux-scsi, Matthew Garrett,
	Konrad Rzeszutek Wilk

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Multiple race conditions are possible between the cardbus PCI
device addition and removal and the generic PCI bus rescan and device
removal that can be triggered via sysfs.

To avoid those race conditions make the cardbus code use global
PCI rescan-remove locking.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/pcmcia/cardbus.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux-pm/drivers/pcmcia/cardbus.c
===================================================================
--- linux-pm.orig/drivers/pcmcia/cardbus.c
+++ linux-pm/drivers/pcmcia/cardbus.c
@@ -70,6 +70,8 @@ int __ref cb_alloc(struct pcmcia_socket
 	struct pci_dev *dev;
 	unsigned int max, pass;
 
+	pci_lock_rescan_remove();
+
 	s->functions = pci_scan_slot(bus, PCI_DEVFN(0, 0));
 	pci_fixup_cardbus(bus);
 
@@ -93,6 +95,7 @@ int __ref cb_alloc(struct pcmcia_socket
 
 	pci_bus_add_devices(bus);
 
+	pci_unlock_rescan_remove();
 	return 0;
 }
 
@@ -115,6 +118,10 @@ void cb_free(struct pcmcia_socket *s)
 	if (!bus)
 		return;
 
+	pci_lock_rescan_remove();
+
 	list_for_each_entry_safe(dev, tmp, &bus->devices, bus_list)
 		pci_stop_and_remove_bus_device(dev);
+
+	pci_unlock_rescan_remove();
 }


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 5/9] PCI / hotplug: Use global PCI rescan-remove locking
  2014-01-10 14:20                                 ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
                                                     ` (3 preceding siblings ...)
  2014-01-10 14:25                                   ` [PATCH 4/9] PCMCIA / cardbus: " Rafael J. Wysocki
@ 2014-01-10 14:26                                   ` Rafael J. Wysocki
  2014-01-10 14:27                                   ` [PATCH 6/9] platform / x86: " Rafael J. Wysocki
                                                     ` (4 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2014-01-10 14:26 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe,
	Benjamin Herrenschmidt, linux-scsi, Matthew Garrett,
	Konrad Rzeszutek Wilk

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Multiple race conditions are possible between PCI hotplug and the
generic PCI bus rescan and device removal that can be triggered via
sysfs.

To avoid those race conditions make PCI hotplug use global PCI
rescan-remove locking.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/pci/hotplug/cpci_hotplug_pci.c |   14 ++++++++++++--
 drivers/pci/hotplug/cpqphp_pci.c       |    8 +++++++-
 drivers/pci/hotplug/ibmphp_core.c      |   13 +++++++++++--
 drivers/pci/hotplug/pciehp_pci.c       |   17 +++++++++++++----
 drivers/pci/hotplug/rpadlpar_core.c    |   19 ++++++++++++++-----
 drivers/pci/hotplug/rpaphp_core.c      |    4 ++++
 drivers/pci/hotplug/s390_pci_hpc.c     |    4 +++-
 drivers/pci/hotplug/sgi_hotplug.c      |    5 +++++
 drivers/pci/hotplug/shpchp_pci.c       |   18 ++++++++++++++----
 9 files changed, 83 insertions(+), 19 deletions(-)

Index: linux-pm/drivers/pci/hotplug/rpadlpar_core.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/rpadlpar_core.c
+++ linux-pm/drivers/pci/hotplug/rpadlpar_core.c
@@ -354,10 +354,15 @@ int dlpar_remove_pci_slot(char *drc_name
 {
 	struct pci_bus *bus;
 	struct slot *slot;
+	int ret = 0;
+
+	pci_lock_rescan_remove();
 
 	bus = pcibios_find_pci_bus(dn);
-	if (!bus)
-		return -EINVAL;
+	if (!bus) {
+		ret = -EINVAL;
+		goto out;
+	}
 
 	pr_debug("PCI: Removing PCI slot below EADS bridge %s\n",
 		 bus->self ? pci_name(bus->self) : "<!PHB!>");
@@ -371,7 +376,8 @@ int dlpar_remove_pci_slot(char *drc_name
 			printk(KERN_ERR
 				"%s: unable to remove hotplug slot %s\n",
 				__func__, drc_name);
-			return -EIO;
+			ret = -EIO;
+			goto out;
 		}
 	}
 
@@ -382,7 +388,8 @@ int dlpar_remove_pci_slot(char *drc_name
 	if (pcibios_unmap_io_space(bus)) {
 		printk(KERN_ERR "%s: failed to unmap bus range\n",
 			__func__);
-		return -ERANGE;
+		ret = -ERANGE;
+		goto out;
 	}
 
 	/* Remove the EADS bridge device itself */
@@ -390,7 +397,9 @@ int dlpar_remove_pci_slot(char *drc_name
 	pr_debug("PCI: Now removing bridge device %s\n", pci_name(bus->self));
 	pci_stop_and_remove_bus_device(bus->self);
 
-	return 0;
+ out:
+	pci_unlock_rescan_remove();
+	return ret;
 }
 
 /**
Index: linux-pm/drivers/pci/hotplug/cpci_hotplug_pci.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/cpci_hotplug_pci.c
+++ linux-pm/drivers/pci/hotplug/cpci_hotplug_pci.c
@@ -254,9 +254,12 @@ int __ref cpci_configure_slot(struct slo
 {
 	struct pci_dev *dev;
 	struct pci_bus *parent;
+	int ret = 0;
 
 	dbg("%s - enter", __func__);
 
+	pci_lock_rescan_remove();
+
 	if (slot->dev == NULL) {
 		dbg("pci_dev null, finding %02x:%02x:%x",
 		    slot->bus->number, PCI_SLOT(slot->devfn), PCI_FUNC(slot->devfn));
@@ -277,7 +280,8 @@ int __ref cpci_configure_slot(struct slo
 		slot->dev = pci_get_slot(slot->bus, slot->devfn);
 		if (slot->dev == NULL) {
 			err("Could not find PCI device for slot %02x", slot->number);
-			return -ENODEV;
+			ret = -ENODEV;
+			goto out;
 		}
 	}
 	parent = slot->dev->bus;
@@ -294,8 +298,10 @@ int __ref cpci_configure_slot(struct slo
 
 	pci_bus_add_devices(parent);
 
+ out:
+	pci_unlock_rescan_remove();
 	dbg("%s - exit", __func__);
-	return 0;
+	return ret;
 }
 
 int cpci_unconfigure_slot(struct slot* slot)
@@ -308,6 +314,8 @@ int cpci_unconfigure_slot(struct slot* s
 		return -ENODEV;
 	}
 
+	pci_lock_rescan_remove();
+
 	list_for_each_entry_safe(dev, temp, &slot->bus->devices, bus_list) {
 		if (PCI_SLOT(dev->devfn) != PCI_SLOT(slot->devfn))
 			continue;
@@ -318,6 +326,8 @@ int cpci_unconfigure_slot(struct slot* s
 	pci_dev_put(slot->dev);
 	slot->dev = NULL;
 
+	pci_unlock_rescan_remove();
+
 	dbg("%s - exit", __func__);
 	return 0;
 }
Index: linux-pm/drivers/pci/hotplug/pciehp_pci.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/pciehp_pci.c
+++ linux-pm/drivers/pci/hotplug/pciehp_pci.c
@@ -39,22 +39,26 @@ int pciehp_configure_device(struct slot
 	struct pci_dev *dev;
 	struct pci_dev *bridge = p_slot->ctrl->pcie->port;
 	struct pci_bus *parent = bridge->subordinate;
-	int num;
+	int num, ret = 0;
 	struct controller *ctrl = p_slot->ctrl;
 
+	pci_lock_rescan_remove();
+
 	dev = pci_get_slot(parent, PCI_DEVFN(0, 0));
 	if (dev) {
 		ctrl_err(ctrl, "Device %s already exists "
 			 "at %04x:%02x:00, cannot hot-add\n", pci_name(dev),
 			 pci_domain_nr(parent), parent->number);
 		pci_dev_put(dev);
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out;
 	}
 
 	num = pci_scan_slot(parent, PCI_DEVFN(0, 0));
 	if (num == 0) {
 		ctrl_err(ctrl, "No new device found\n");
-		return -ENODEV;
+		ret = -ENODEV;
+		goto out;
 	}
 
 	list_for_each_entry(dev, &parent->devices, bus_list)
@@ -73,7 +77,9 @@ int pciehp_configure_device(struct slot
 
 	pci_bus_add_devices(parent);
 
-	return 0;
+ out:
+	pci_unlock_rescan_remove();
+	return ret;
 }
 
 int pciehp_unconfigure_device(struct slot *p_slot)
@@ -92,6 +98,8 @@ int pciehp_unconfigure_device(struct slo
 	if (ret)
 		presence = 0;
 
+	pci_lock_rescan_remove();
+
 	/*
 	 * Stopping an SR-IOV PF device removes all the associated VFs,
 	 * which will update the bus->devices list and confuse the
@@ -126,5 +134,6 @@ int pciehp_unconfigure_device(struct slo
 		pci_dev_put(dev);
 	}
 
+	pci_unlock_rescan_remove();
 	return rc;
 }
Index: linux-pm/drivers/pci/hotplug/cpqphp_pci.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/cpqphp_pci.c
+++ linux-pm/drivers/pci/hotplug/cpqphp_pci.c
@@ -86,6 +86,8 @@ int cpqhp_configure_device (struct contr
 	struct pci_bus *child;
 	int num;
 
+	pci_lock_rescan_remove();
+
 	if (func->pci_dev == NULL)
 		func->pci_dev = pci_get_bus_and_slot(func->bus,PCI_DEVFN(func->device, func->function));
 
@@ -100,7 +102,7 @@ int cpqhp_configure_device (struct contr
 		func->pci_dev = pci_get_bus_and_slot(func->bus, PCI_DEVFN(func->device, func->function));
 		if (func->pci_dev == NULL) {
 			dbg("ERROR: pci_dev still null\n");
-			return 0;
+			goto out;
 		}
 	}
 
@@ -113,6 +115,8 @@ int cpqhp_configure_device (struct contr
 
 	pci_dev_put(func->pci_dev);
 
+ out:
+	pci_unlock_rescan_remove();
 	return 0;
 }
 
@@ -123,6 +127,7 @@ int cpqhp_unconfigure_device(struct pci_
 
 	dbg("%s: bus/dev/func = %x/%x/%x\n", __func__, func->bus, func->device, func->function);
 
+	pci_lock_rescan_remove();
 	for (j=0; j<8 ; j++) {
 		struct pci_dev* temp = pci_get_bus_and_slot(func->bus, PCI_DEVFN(func->device, j));
 		if (temp) {
@@ -130,6 +135,7 @@ int cpqhp_unconfigure_device(struct pci_
 			pci_stop_and_remove_bus_device(temp);
 		}
 	}
+	pci_unlock_rescan_remove();
 	return 0;
 }
 
Index: linux-pm/drivers/pci/hotplug/ibmphp_core.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/ibmphp_core.c
+++ linux-pm/drivers/pci/hotplug/ibmphp_core.c
@@ -718,6 +718,8 @@ static void ibm_unconfigure_device(struc
 					func->device, func->function);
 	debug("func->device << 3 | 0x0  = %x\n", func->device << 3 | 0x0);
 
+	pci_lock_rescan_remove();
+
 	for (j = 0; j < 0x08; j++) {
 		temp = pci_get_bus_and_slot(func->busno, (func->device << 3) | j);
 		if (temp) {
@@ -725,7 +727,10 @@ static void ibm_unconfigure_device(struc
 			pci_dev_put(temp);
 		}
 	}
+
 	pci_dev_put(func->dev);
+
+	pci_unlock_rescan_remove();
 }
 
 /*
@@ -780,6 +785,8 @@ static int ibm_configure_device(struct p
 	int flag = 0;	/* this is to make sure we don't double scan the bus,
 					for bridged devices primarily */
 
+	pci_lock_rescan_remove();
+
 	if (!(bus_structure_fixup(func->busno)))
 		flag = 1;
 	if (func->dev == NULL)
@@ -789,7 +796,7 @@ static int ibm_configure_device(struct p
 	if (func->dev == NULL) {
 		struct pci_bus *bus = pci_find_bus(0, func->busno);
 		if (!bus)
-			return 0;
+			goto out;
 
 		num = pci_scan_slot(bus,
 				PCI_DEVFN(func->device, func->function));
@@ -800,7 +807,7 @@ static int ibm_configure_device(struct p
 				PCI_DEVFN(func->device, func->function));
 		if (func->dev == NULL) {
 			err("ERROR... : pci_dev still NULL\n");
-			return 0;
+			goto out;
 		}
 	}
 	if (!(flag) && (func->dev->hdr_type == PCI_HEADER_TYPE_BRIDGE)) {
@@ -810,6 +817,8 @@ static int ibm_configure_device(struct p
 			pci_bus_add_devices(child);
 	}
 
+ out:
+	pci_unlock_rescan_remove();
 	return 0;
 }
 
Index: linux-pm/drivers/pci/hotplug/s390_pci_hpc.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/s390_pci_hpc.c
+++ linux-pm/drivers/pci/hotplug/s390_pci_hpc.c
@@ -80,7 +80,9 @@ static int enable_slot(struct hotplug_sl
 		goto out_deconfigure;
 
 	pci_scan_slot(slot->zdev->bus, ZPCI_DEVFN);
+	pci_lock_rescan_remove();
 	pci_bus_add_devices(slot->zdev->bus);
+	pci_unlock_rescan_remove();
 
 	return rc;
 
@@ -98,7 +100,7 @@ static int disable_slot(struct hotplug_s
 		return -EIO;
 
 	if (slot->zdev->pdev)
-		pci_stop_and_remove_bus_device(slot->zdev->pdev);
+		pci_stop_and_remove_bus_device_locked(slot->zdev->pdev);
 
 	rc = zpci_disable_device(slot->zdev);
 	if (rc)
Index: linux-pm/drivers/pci/hotplug/shpchp_pci.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/shpchp_pci.c
+++ linux-pm/drivers/pci/hotplug/shpchp_pci.c
@@ -40,7 +40,9 @@ int __ref shpchp_configure_device(struct
 	struct controller *ctrl = p_slot->ctrl;
 	struct pci_dev *bridge = ctrl->pci_dev;
 	struct pci_bus *parent = bridge->subordinate;
-	int num;
+	int num, ret = 0;
+
+	pci_lock_rescan_remove();
 
 	dev = pci_get_slot(parent, PCI_DEVFN(p_slot->device, 0));
 	if (dev) {
@@ -48,13 +50,15 @@ int __ref shpchp_configure_device(struct
 			 "at %04x:%02x:%02x, cannot hot-add\n", pci_name(dev),
 			 pci_domain_nr(parent), p_slot->bus, p_slot->device);
 		pci_dev_put(dev);
-		return -EINVAL;
+		ret = -EINVAL;
+		goto out;
 	}
 
 	num = pci_scan_slot(parent, PCI_DEVFN(p_slot->device, 0));
 	if (num == 0) {
 		ctrl_err(ctrl, "No new device found\n");
-		return -ENODEV;
+		ret = -ENODEV;
+		goto out;
 	}
 
 	list_for_each_entry(dev, &parent->devices, bus_list) {
@@ -75,7 +79,9 @@ int __ref shpchp_configure_device(struct
 
 	pci_bus_add_devices(parent);
 
-	return 0;
+ out:
+	pci_unlock_rescan_remove();
+	return ret;
 }
 
 int shpchp_unconfigure_device(struct slot *p_slot)
@@ -89,6 +95,8 @@ int shpchp_unconfigure_device(struct slo
 	ctrl_dbg(ctrl, "%s: domain:bus:dev = %04x:%02x:%02x\n",
 		 __func__, pci_domain_nr(parent), p_slot->bus, p_slot->device);
 
+	pci_lock_rescan_remove();
+
 	list_for_each_entry_safe(dev, temp, &parent->devices, bus_list) {
 		if (PCI_SLOT(dev->devfn) != p_slot->device)
 			continue;
@@ -108,6 +116,8 @@ int shpchp_unconfigure_device(struct slo
 		pci_stop_and_remove_bus_device(dev);
 		pci_dev_put(dev);
 	}
+
+	pci_unlock_rescan_remove();
 	return rc;
 }
 
Index: linux-pm/drivers/pci/hotplug/sgi_hotplug.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/sgi_hotplug.c
+++ linux-pm/drivers/pci/hotplug/sgi_hotplug.c
@@ -459,12 +459,15 @@ static int enable_slot(struct hotplug_sl
 		acpi_scan_lock_release();
 	}
 
+	pci_lock_rescan_remove();
+
 	/* Call the driver for the new device */
 	pci_bus_add_devices(slot->pci_bus);
 	/* Call the drivers for the new devices subordinate to PPB */
 	if (new_ppb)
 		pci_bus_add_devices(new_bus);
 
+	pci_unlock_rescan_remove();
 	mutex_unlock(&sn_hotplug_mutex);
 
 	if (rc == 0)
@@ -540,6 +543,7 @@ static int disable_slot(struct hotplug_s
 		acpi_scan_lock_release();
 	}
 
+	pci_lock_rescan_remove();
 	/* Free the SN resources assigned to the Linux device.*/
 	list_for_each_entry_safe(dev, temp, &slot->pci_bus->devices, bus_list) {
 		if (PCI_SLOT(dev->devfn) != slot->device_num + 1)
@@ -550,6 +554,7 @@ static int disable_slot(struct hotplug_s
 		pci_stop_and_remove_bus_device(dev);
 		pci_dev_put(dev);
 	}
+	pci_unlock_rescan_remove();
 
 	/* Remove the SSDT for the slot from the ACPI namespace */
 	if (SN_ACPI_BASE_SUPPORT() && ssdt_id) {
Index: linux-pm/drivers/pci/hotplug/rpaphp_core.c
===================================================================
--- linux-pm.orig/drivers/pci/hotplug/rpaphp_core.c
+++ linux-pm/drivers/pci/hotplug/rpaphp_core.c
@@ -398,7 +398,9 @@ static int enable_slot(struct hotplug_sl
 		return retval;
 
 	if (state == PRESENT) {
+		pci_lock_rescan_remove();
 		pcibios_add_pci_devices(slot->bus);
+		pci_unlock_rescan_remove();
 		slot->state = CONFIGURED;
 	} else if (state == EMPTY) {
 		slot->state = EMPTY;
@@ -418,7 +420,9 @@ static int disable_slot(struct hotplug_s
 	if (slot->state == NOT_CONFIGURED)
 		return -EINVAL;
 
+	pci_lock_rescan_remove();
 	pcibios_remove_pci_devices(slot->bus);
+	pci_unlock_rescan_remove();
 	vm_unmap_aliases();
 
 	slot->state = NOT_CONFIGURED;


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 6/9] platform / x86: Use global PCI rescan-remove locking
  2014-01-10 14:20                                 ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
                                                     ` (4 preceding siblings ...)
  2014-01-10 14:26                                   ` [PATCH 5/9] PCI / hotplug: " Rafael J. Wysocki
@ 2014-01-10 14:27                                   ` Rafael J. Wysocki
  2014-01-10 14:27                                   ` [PATCH 7/9] MPT / PCI: Use pci_stop_and_remove_bus_device_locked() Rafael J. Wysocki
                                                     ` (3 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2014-01-10 14:27 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe,
	Benjamin Herrenschmidt, linux-scsi, Matthew Garrett,
	Konrad Rzeszutek Wilk

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Multiple race conditions are possible between the rfkill hotplug in
the asus-wmi and eeepc-laptop drivers and the generic PCI bus rescan
and device removal that can be triggered via sysfs.

To avoid those race conditions make asus-wmi and eeepc-laptop use
global PCI rescan-remove locking around the rfkill hotplug.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/platform/x86/asus-wmi.c     |    2 ++
 drivers/platform/x86/eeepc-laptop.c |    2 ++
 2 files changed, 4 insertions(+)

Index: linux-pm/drivers/platform/x86/asus-wmi.c
===================================================================
--- linux-pm.orig/drivers/platform/x86/asus-wmi.c
+++ linux-pm/drivers/platform/x86/asus-wmi.c
@@ -605,6 +605,7 @@ static void asus_rfkill_hotplug(struct a
 	mutex_unlock(&asus->wmi_lock);
 
 	mutex_lock(&asus->hotplug_lock);
+	pci_lock_rescan_remove();
 
 	if (asus->wlan.rfkill)
 		rfkill_set_sw_state(asus->wlan.rfkill, blocked);
@@ -655,6 +656,7 @@ static void asus_rfkill_hotplug(struct a
 	}
 
 out_unlock:
+	pci_unlock_rescan_remove();
 	mutex_unlock(&asus->hotplug_lock);
 }
 
Index: linux-pm/drivers/platform/x86/eeepc-laptop.c
===================================================================
--- linux-pm.orig/drivers/platform/x86/eeepc-laptop.c
+++ linux-pm/drivers/platform/x86/eeepc-laptop.c
@@ -591,6 +591,7 @@ static void eeepc_rfkill_hotplug(struct
 		rfkill_set_sw_state(eeepc->wlan_rfkill, blocked);
 
 	mutex_lock(&eeepc->hotplug_lock);
+	pci_lock_rescan_remove();
 
 	if (eeepc->hotplug_slot) {
 		port = acpi_get_pci_dev(handle);
@@ -648,6 +649,7 @@ out_put_dev:
 	}
 
 out_unlock:
+	pci_unlock_rescan_remove();
 	mutex_unlock(&eeepc->hotplug_lock);
 }
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 7/9] MPT / PCI: Use pci_stop_and_remove_bus_device_locked()
  2014-01-10 14:20                                 ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
                                                     ` (5 preceding siblings ...)
  2014-01-10 14:27                                   ` [PATCH 6/9] platform / x86: " Rafael J. Wysocki
@ 2014-01-10 14:27                                   ` Rafael J. Wysocki
  2014-01-10 14:28                                   ` [PATCH 8/9] powerpc / eeh_driver: Use global PCI rescan-remove locking Rafael J. Wysocki
                                                     ` (2 subsequent siblings)
  9 siblings, 0 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2014-01-10 14:27 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe,
	Benjamin Herrenschmidt, linux-scsi, Matthew Garrett,
	Konrad Rzeszutek Wilk

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Race conditions are theoretically possible between the MPT PCI
device removal and the generic PCI bus rescan and device removal
that can be triggered via sysfs.

To avoid those race conditions make the MPT PCI code use
pci_stop_and_remove_bus_device_locked().

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/message/fusion/mptbase.c    |    2 +-
 drivers/scsi/mpt2sas/mpt2sas_base.c |    2 +-
 drivers/scsi/mpt3sas/mpt3sas_base.c |    2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

Index: linux-pm/drivers/scsi/mpt3sas/mpt3sas_base.c
===================================================================
--- linux-pm.orig/drivers/scsi/mpt3sas/mpt3sas_base.c
+++ linux-pm/drivers/scsi/mpt3sas/mpt3sas_base.c
@@ -131,7 +131,7 @@ static int mpt3sas_remove_dead_ioc_func(
 	pdev = ioc->pdev;
 	if ((pdev == NULL))
 		return -1;
-	pci_stop_and_remove_bus_device(pdev);
+	pci_stop_and_remove_bus_device_locked(pdev);
 	return 0;
 }
 
Index: linux-pm/drivers/scsi/mpt2sas/mpt2sas_base.c
===================================================================
--- linux-pm.orig/drivers/scsi/mpt2sas/mpt2sas_base.c
+++ linux-pm/drivers/scsi/mpt2sas/mpt2sas_base.c
@@ -128,7 +128,7 @@ static int mpt2sas_remove_dead_ioc_func(
 		pdev = ioc->pdev;
 		if ((pdev == NULL))
 			return -1;
-		pci_stop_and_remove_bus_device(pdev);
+		pci_stop_and_remove_bus_device_locked(pdev);
 		return 0;
 }
 
Index: linux-pm/drivers/message/fusion/mptbase.c
===================================================================
--- linux-pm.orig/drivers/message/fusion/mptbase.c
+++ linux-pm/drivers/message/fusion/mptbase.c
@@ -346,7 +346,7 @@ static int mpt_remove_dead_ioc_func(void
 	if ((pdev == NULL))
 		return -1;
 
-	pci_stop_and_remove_bus_device(pdev);
+	pci_stop_and_remove_bus_device_locked(pdev);
 	return 0;
 }
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 8/9] powerpc / eeh_driver: Use global PCI rescan-remove locking
  2014-01-10 14:20                                 ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
                                                     ` (6 preceding siblings ...)
  2014-01-10 14:27                                   ` [PATCH 7/9] MPT / PCI: Use pci_stop_and_remove_bus_device_locked() Rafael J. Wysocki
@ 2014-01-10 14:28                                   ` Rafael J. Wysocki
  2014-01-15 13:36                                     ` [Update][PATCH " Rafael J. Wysocki
  2014-01-10 14:29                                   ` [PATCH 9/9] Xen / PCI: " Rafael J. Wysocki
  2014-01-15 18:02                                   ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Bjorn Helgaas
  9 siblings, 1 reply; 69+ messages in thread
From: Rafael J. Wysocki @ 2014-01-10 14:28 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe,
	Benjamin Herrenschmidt, linux-scsi, Matthew Garrett,
	Konrad Rzeszutek Wilk

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Race conditions are theoretically possible between the PCI device
addition and removal in the PPC64 PCI error recovery driver and
the generic PCI bus rescan and device removal that can be triggered
via sysfs.

To avoid those race conditions make PPC64 PCI error recovery driver
use global PCI rescan-remove locking around PCI device addition and
removal.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 arch/powerpc/kernel/eeh_driver.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

Index: linux-pm/arch/powerpc/kernel/eeh_driver.c
===================================================================
--- linux-pm.orig/arch/powerpc/kernel/eeh_driver.c
+++ linux-pm/arch/powerpc/kernel/eeh_driver.c
@@ -369,7 +369,9 @@ static void *eeh_rmv_device(void *data,
 	edev->mode |= EEH_DEV_DISCONNECTED;
 	(*removed)++;
 
+	pci_lock_rescan_remove();
 	pci_stop_and_remove_bus_device(dev);
+	pci_unlock_rescan_remove();
 
 	return NULL;
 }
@@ -416,10 +418,13 @@ static int eeh_reset_device(struct eeh_p
 	 * into pcibios_add_pci_devices().
 	 */
 	eeh_pe_state_mark(pe, EEH_PE_KEEP);
-	if (bus)
+	if (bus) {
+		pci_lock_rescan_remove();
 		pcibios_remove_pci_devices(bus);
-	else if (frozen_bus)
+		pci_unlock_rescan_remove();
+	} else if (frozen_bus) {
 		eeh_pe_dev_traverse(pe, eeh_rmv_device, &removed);
+	}
 
 	/* Reset the pci controller. (Asserts RST#; resets config space).
 	 * Reconfigure bridges and devices. Don't try to bring the system
@@ -429,6 +434,8 @@ static int eeh_reset_device(struct eeh_p
 	if (rc)
 		return rc;
 
+	pci_lock_rescan_remove();
+
 	/* Restore PE */
 	eeh_ops->configure_bridge(pe);
 	eeh_pe_restore_bars(pe);
@@ -462,6 +469,7 @@ static int eeh_reset_device(struct eeh_p
 	pe->tstamp = tstamp;
 	pe->freeze_count = cnt;
 
+	pci_unlock_rescan_remove();
 	return 0;
 }
 
@@ -618,8 +626,11 @@ perm_error:
 	eeh_pe_dev_traverse(pe, eeh_report_failure, NULL);
 
 	/* Shut down the device drivers for good. */
-	if (frozen_bus)
+	if (frozen_bus) {
+		pci_lock_rescan_remove();
 		pcibios_remove_pci_devices(frozen_bus);
+		pci_unlock_rescan_remove();
+	}
 }
 
 static void eeh_handle_special_event(void)
@@ -692,6 +703,7 @@ static void eeh_handle_special_event(voi
 	if (rc == 2 || rc == 1)
 		eeh_handle_normal_event(pe);
 	else {
+		lock_pci_remove_rescan();
 		list_for_each_entry_safe(hose, tmp,
 			&hose_list, list_node) {
 			phb_pe = eeh_phb_pe_get(hose);
@@ -703,6 +715,7 @@ static void eeh_handle_special_event(voi
 			eeh_pe_dev_traverse(pe, eeh_report_failure, NULL);
 			pcibios_remove_pci_devices(bus);
 		}
+		unlock_pci_remove_rescan();
 	}
 }
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH 9/9] Xen / PCI: Use global PCI rescan-remove locking
  2014-01-10 14:20                                 ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
                                                     ` (7 preceding siblings ...)
  2014-01-10 14:28                                   ` [PATCH 8/9] powerpc / eeh_driver: Use global PCI rescan-remove locking Rafael J. Wysocki
@ 2014-01-10 14:29                                   ` Rafael J. Wysocki
  2014-01-15 18:02                                   ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Bjorn Helgaas
  9 siblings, 0 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2014-01-10 14:29 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe,
	Benjamin Herrenschmidt, linux-scsi, Matthew Garrett,
	Konrad Rzeszutek Wilk

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>

Multiple race conditions are possible between the Xen pcifront
device addition and removal and the generic PCI device addition
and removal that can be triggered via sysfs.

To avoid those race conditions make the Xen pcifront code use global
PCI rescan-remove locking.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/pci/xen-pcifront.c |    8 ++++++++
 1 file changed, 8 insertions(+)

Index: linux-pm/drivers/pci/xen-pcifront.c
===================================================================
--- linux-pm.orig/drivers/pci/xen-pcifront.c
+++ linux-pm/drivers/pci/xen-pcifront.c
@@ -471,12 +471,15 @@ static int pcifront_scan_root(struct pci
 	}
 	pcifront_init_sd(sd, domain, bus, pdev);
 
+	pci_lock_rescan_remove();
+
 	b = pci_scan_bus_parented(&pdev->xdev->dev, bus,
 				  &pcifront_bus_ops, sd);
 	if (!b) {
 		dev_err(&pdev->xdev->dev,
 			"Error creating PCI Frontend Bus!\n");
 		err = -ENOMEM;
+		pci_unlock_rescan_remove();
 		goto err_out;
 	}
 
@@ -494,6 +497,7 @@ static int pcifront_scan_root(struct pci
 	/* Create SysFS and notify udev of the devices. Aka: "going live" */
 	pci_bus_add_devices(b);
 
+	pci_unlock_rescan_remove();
 	return err;
 
 err_out:
@@ -556,6 +560,7 @@ static void pcifront_free_roots(struct p
 
 	dev_dbg(&pdev->xdev->dev, "cleaning up root buses\n");
 
+	pci_lock_rescan_remove();
 	list_for_each_entry_safe(bus_entry, t, &pdev->root_buses, list) {
 		list_del(&bus_entry->list);
 
@@ -568,6 +573,7 @@ static void pcifront_free_roots(struct p
 
 		kfree(bus_entry);
 	}
+	pci_unlock_rescan_remove();
 }
 
 static pci_ers_result_t pcifront_common_process(int cmd,
@@ -1043,8 +1049,10 @@ static int pcifront_detach_devices(struc
 				domain, bus, slot, func);
 			continue;
 		}
+		pci_lock_rescan_remove();
 		pci_stop_and_remove_bus_device(pci_dev);
 		pci_dev_put(pci_dev);
+		pci_unlock_rescan_remove();
 
 		dev_dbg(&pdev->xdev->dev,
 			"PCI device %04x:%02x:%02x.%d removed.\n",


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [PATCH] PCI / remove: Check parent kobject in pci_destroy_dev() (was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once)
  2013-12-07  1:27                               ` Rafael J. Wysocki
  2013-12-08  3:31                                 ` Yinghai Lu
@ 2014-01-13  1:03                                 ` Rafael J. Wysocki
  1 sibling, 0 replies; 69+ messages in thread
From: Rafael J. Wysocki @ 2014-01-13  1:03 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Bjorn Helgaas, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe

On Saturday, December 07, 2013 02:27:51 AM Rafael J. Wysocki wrote:
> On Thursday, December 05, 2013 10:52:36 PM Yinghai Lu wrote:
> > On Mon, Dec 2, 2013 at 6:49 AM, Rafael J. Wysocki <rjw@rjwysocki.net> wrote:
> > >
> > > Scenario 5: pci_stop_and_remove_bus_device() is run concurrently
> > >   for a device and its parent bridge via remove_callback().
> > >
> > >   In that case both code paths attempt to acquire
> > >   pci_remove_rescan_mutex.  If the child device removal acquires
> > >   it first, there will be no problems.  However, if the parent
> > >   bridge removal acquires it first, it will eventually execute
> > >   pci_destroy_dev() for the child device, but that device will
> > >   not be freed yet due to the reference held by the concurrent
> > >   child removal.  Consequently, both pci_stop_bus_device() and
> > >   pci_remove_bus_device() will be executed for that device
> > >   unnecessarily and pci_destroy_dev() will see a corrupted list
> > >   head in that object.  Moreover, an excess put_device() will
> > >   be executed for that device in that case which may lead to a
> > >   use-after-free in the final kobject_put() done by
> > >   sysfs_schedule_callback_work().
> > >
> > > Index: linux-pm/include/linux/pci.h
> > > ===================================================================
> > > --- linux-pm.orig/include/linux/pci.h
> > > +++ linux-pm/include/linux/pci.h
> > > @@ -321,6 +321,7 @@ struct pci_dev {
> > >         unsigned int    multifunction:1;/* Part of multi-function device */
> > >         /* keep track of device state */
> > >         unsigned int    is_added:1;
> > > +       unsigned int    is_gone:1;
> > >         unsigned int    is_busmaster:1; /* device is busmaster */
> > >         unsigned int    no_msi:1;       /* device may not use msi */
> > >         unsigned int    block_cfg_access:1;     /* config space access is blocked */
> > > Index: linux-pm/drivers/pci/remove.c
> > > ===================================================================
> > > --- linux-pm.orig/drivers/pci/remove.c
> > > +++ linux-pm/drivers/pci/remove.c
> > > @@ -34,6 +34,7 @@ static void pci_stop_dev(struct pci_dev
> > >
> > >  static void pci_destroy_dev(struct pci_dev *dev)
> > >  {
> > > +       dev->is_gone = 1;
> > >         device_del(&dev->dev);
> > >
> > >         down_write(&pci_bus_sem);
> > > @@ -109,8 +110,10 @@ static void pci_remove_bus_device(struct
> > >   */
> > >  void pci_stop_and_remove_bus_device(struct pci_dev *dev)
> > >  {
> > > -       pci_stop_bus_device(dev);
> > > -       pci_remove_bus_device(dev);
> > > +       if (!dev->is_gone) {
> > > +               pci_stop_bus_device(dev);
> > > +               pci_remove_bus_device(dev);
> > > +       }
> > >  }
> > >  EXPORT_SYMBOL(pci_stop_and_remove_bus_device);
> > >
> > 
> > Yes, above change should address sys double remove problem.
> 
> I've just realized that we don't need a new flag for that, though.
> 
> It looks like we only need to check dev->dev.kobj.parent and return if that is
> NULL, because that means pci_destroy_dev() has run for that device already
> (I'm wondering why device_del() doesn't clear dev->parent, BTW, it looks like
> it should do that?).
> 
> Of course, that still is going to be racy if we don't hold
> pci_remove_rescan_mutex around pci_stop_and_remove_bus_device() in every code
> path using it (or use another similar synchronization mechanism).

Before I forget about this, on top of the series I sent out on Friday.

Thanks,
Rafael

---
From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: PCI / remove: Check parent kobject in pci_destroy_dev()

If pci_stop_and_remove_bus_device() is run concurrently for a device
and its parent bridge via remove_callback(), both code paths attempt
to acquire pci_rescan_remove_lock.  If the child device removal
acquires it first, there will be no problems.  However, if the parent
bridge removal acquires it first, it will eventually execute
pci_destroy_dev() for the child device, but that device object will
not be freed yet due to the reference held by the concurrent child
removal.  Consequently, both pci_stop_bus_device() and
pci_remove_bus_device() will be executed for that device unnecessarily
and pci_destroy_dev() will see a corrupted list head in that object.
Moreover, an excess put_device() will be executed for that device in
that case which may lead to a use-after-free in the final
kobject_put() done by sysfs_schedule_callback_work().

To avoid that problem, make pci_destroy_dev() check if the device's
parent kobject is NULL, which only happens after device_del() has
already run for it.  Make pci_destroy_dev() return immediately
whithout doing anything in that case.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---
 drivers/pci/remove.c |    3 +++
 1 file changed, 3 insertions(+)

Index: linux-pm/drivers/pci/remove.c
===================================================================
--- linux-pm.orig/drivers/pci/remove.c
+++ linux-pm/drivers/pci/remove.c
@@ -34,6 +34,9 @@ static void pci_stop_dev(struct pci_dev
 
 static void pci_destroy_dev(struct pci_dev *dev)
 {
+	if (!dev->dev.kobj.parent)
+		return;
+
 	device_del(&dev->dev);
 
 	down_write(&pci_bus_sem);


^ permalink raw reply	[flat|nested] 69+ messages in thread

* [Update][PATCH 8/9] powerpc / eeh_driver: Use global PCI rescan-remove locking
  2014-01-10 14:28                                   ` [PATCH 8/9] powerpc / eeh_driver: Use global PCI rescan-remove locking Rafael J. Wysocki
@ 2014-01-15 13:36                                     ` Rafael J. Wysocki
  2014-01-15 17:38                                       ` Bjorn Helgaas
  0 siblings, 1 reply; 69+ messages in thread
From: Rafael J. Wysocki @ 2014-01-15 13:36 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe,
	Benjamin Herrenschmidt, linux-scsi, Matthew Garrett,
	Konrad Rzeszutek Wilk

From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Subject: powerpc / eeh_driver: Use global PCI rescan-remove locking

Race conditions are theoretically possible between the PCI device
addition and removal in the PPC64 PCI error recovery driver and
the generic PCI bus rescan and device removal that can be triggered
via sysfs.

To avoid those race conditions make PPC64 PCI error recovery driver
use global PCI rescan-remove locking around PCI device addition and
removal.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
---

The previous version had wrong function names in the last hunk, sorry about
that.

Rafael

---
 arch/powerpc/kernel/eeh_driver.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

Index: linux-pm/arch/powerpc/kernel/eeh_driver.c
===================================================================
--- linux-pm.orig/arch/powerpc/kernel/eeh_driver.c
+++ linux-pm/arch/powerpc/kernel/eeh_driver.c
@@ -369,7 +369,9 @@ static void *eeh_rmv_device(void *data,
 	edev->mode |= EEH_DEV_DISCONNECTED;
 	(*removed)++;
 
+	pci_lock_rescan_remove();
 	pci_stop_and_remove_bus_device(dev);
+	pci_unlock_rescan_remove();
 
 	return NULL;
 }
@@ -416,10 +418,13 @@ static int eeh_reset_device(struct eeh_p
 	 * into pcibios_add_pci_devices().
 	 */
 	eeh_pe_state_mark(pe, EEH_PE_KEEP);
-	if (bus)
+	if (bus) {
+		pci_lock_rescan_remove();
 		pcibios_remove_pci_devices(bus);
-	else if (frozen_bus)
+		pci_unlock_rescan_remove();
+	} else if (frozen_bus) {
 		eeh_pe_dev_traverse(pe, eeh_rmv_device, &removed);
+	}
 
 	/* Reset the pci controller. (Asserts RST#; resets config space).
 	 * Reconfigure bridges and devices. Don't try to bring the system
@@ -429,6 +434,8 @@ static int eeh_reset_device(struct eeh_p
 	if (rc)
 		return rc;
 
+	pci_lock_rescan_remove();
+
 	/* Restore PE */
 	eeh_ops->configure_bridge(pe);
 	eeh_pe_restore_bars(pe);
@@ -462,6 +469,7 @@ static int eeh_reset_device(struct eeh_p
 	pe->tstamp = tstamp;
 	pe->freeze_count = cnt;
 
+	pci_unlock_rescan_remove();
 	return 0;
 }
 
@@ -618,8 +626,11 @@ perm_error:
 	eeh_pe_dev_traverse(pe, eeh_report_failure, NULL);
 
 	/* Shut down the device drivers for good. */
-	if (frozen_bus)
+	if (frozen_bus) {
+		pci_lock_rescan_remove();
 		pcibios_remove_pci_devices(frozen_bus);
+		pci_unlock_rescan_remove();
+	}
 }
 
 static void eeh_handle_special_event(void)
@@ -692,6 +703,7 @@ static void eeh_handle_special_event(voi
 	if (rc == 2 || rc == 1)
 		eeh_handle_normal_event(pe);
 	else {
+		pci_lock_rescan_remove();
 		list_for_each_entry_safe(hose, tmp,
 			&hose_list, list_node) {
 			phb_pe = eeh_phb_pe_get(hose);
@@ -703,6 +715,7 @@ static void eeh_handle_special_event(voi
 			eeh_pe_dev_traverse(pe, eeh_report_failure, NULL);
 			pcibios_remove_pci_devices(bus);
 		}
+		pci_unlock_rescan_remove();
 	}
 }
 


^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [Update][PATCH 8/9] powerpc / eeh_driver: Use global PCI rescan-remove locking
  2014-01-15 13:36                                     ` [Update][PATCH " Rafael J. Wysocki
@ 2014-01-15 17:38                                       ` Bjorn Helgaas
  0 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2014-01-15 17:38 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe,
	Benjamin Herrenschmidt, linux-scsi, Matthew Garrett,
	Konrad Rzeszutek Wilk

On Wed, Jan 15, 2014 at 02:36:36PM +0100, Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Subject: powerpc / eeh_driver: Use global PCI rescan-remove locking
> 
> Race conditions are theoretically possible between the PCI device
> addition and removal in the PPC64 PCI error recovery driver and
> the generic PCI bus rescan and device removal that can be triggered
> via sysfs.
> 
> To avoid those race conditions make PPC64 PCI error recovery driver
> use global PCI rescan-remove locking around PCI device addition and
> removal.
> 
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> ---
> 
> The previous version had wrong function names in the last hunk, sorry about
> that.

I replaced the previous version and re-pushed the pci/locking branch,
thanks!

> 
> Rafael
> 
> ---
>  arch/powerpc/kernel/eeh_driver.c |   19 ++++++++++++++++---
>  1 file changed, 16 insertions(+), 3 deletions(-)
> 
> Index: linux-pm/arch/powerpc/kernel/eeh_driver.c
> ===================================================================
> --- linux-pm.orig/arch/powerpc/kernel/eeh_driver.c
> +++ linux-pm/arch/powerpc/kernel/eeh_driver.c
> @@ -369,7 +369,9 @@ static void *eeh_rmv_device(void *data,
>  	edev->mode |= EEH_DEV_DISCONNECTED;
>  	(*removed)++;
>  
> +	pci_lock_rescan_remove();
>  	pci_stop_and_remove_bus_device(dev);
> +	pci_unlock_rescan_remove();
>  
>  	return NULL;
>  }
> @@ -416,10 +418,13 @@ static int eeh_reset_device(struct eeh_p
>  	 * into pcibios_add_pci_devices().
>  	 */
>  	eeh_pe_state_mark(pe, EEH_PE_KEEP);
> -	if (bus)
> +	if (bus) {
> +		pci_lock_rescan_remove();
>  		pcibios_remove_pci_devices(bus);
> -	else if (frozen_bus)
> +		pci_unlock_rescan_remove();
> +	} else if (frozen_bus) {
>  		eeh_pe_dev_traverse(pe, eeh_rmv_device, &removed);
> +	}
>  
>  	/* Reset the pci controller. (Asserts RST#; resets config space).
>  	 * Reconfigure bridges and devices. Don't try to bring the system
> @@ -429,6 +434,8 @@ static int eeh_reset_device(struct eeh_p
>  	if (rc)
>  		return rc;
>  
> +	pci_lock_rescan_remove();
> +
>  	/* Restore PE */
>  	eeh_ops->configure_bridge(pe);
>  	eeh_pe_restore_bars(pe);
> @@ -462,6 +469,7 @@ static int eeh_reset_device(struct eeh_p
>  	pe->tstamp = tstamp;
>  	pe->freeze_count = cnt;
>  
> +	pci_unlock_rescan_remove();
>  	return 0;
>  }
>  
> @@ -618,8 +626,11 @@ perm_error:
>  	eeh_pe_dev_traverse(pe, eeh_report_failure, NULL);
>  
>  	/* Shut down the device drivers for good. */
> -	if (frozen_bus)
> +	if (frozen_bus) {
> +		pci_lock_rescan_remove();
>  		pcibios_remove_pci_devices(frozen_bus);
> +		pci_unlock_rescan_remove();
> +	}
>  }
>  
>  static void eeh_handle_special_event(void)
> @@ -692,6 +703,7 @@ static void eeh_handle_special_event(voi
>  	if (rc == 2 || rc == 1)
>  		eeh_handle_normal_event(pe);
>  	else {
> +		pci_lock_rescan_remove();
>  		list_for_each_entry_safe(hose, tmp,
>  			&hose_list, list_node) {
>  			phb_pe = eeh_phb_pe_get(hose);
> @@ -703,6 +715,7 @@ static void eeh_handle_special_event(voi
>  			eeh_pe_dev_traverse(pe, eeh_report_failure, NULL);
>  			pcibios_remove_pci_devices(bus);
>  		}
> +		pci_unlock_rescan_remove();
>  	}
>  }
>  
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 69+ messages in thread

* Re: [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once)
  2014-01-10 14:20                                 ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
                                                     ` (8 preceding siblings ...)
  2014-01-10 14:29                                   ` [PATCH 9/9] Xen / PCI: " Rafael J. Wysocki
@ 2014-01-15 18:02                                   ` Bjorn Helgaas
  9 siblings, 0 replies; 69+ messages in thread
From: Bjorn Helgaas @ 2014-01-15 18:02 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Yinghai Lu, Rafael J. Wysocki, Gu Zheng, Guo Chao, linux-pci,
	linux-kernel, Mika Westerberg, Myron Stowe,
	Benjamin Herrenschmidt, linux-scsi, Matthew Garrett,
	Konrad Rzeszutek Wilk

On Fri, Jan 10, 2014 at 03:20:44PM +0100, Rafael J. Wysocki wrote:
> [Cc: adding linux-scsi for the MPT changes, Ben for powerpc, Matthew for
>  platform/x86 and Konrad for Xen]
> 
> On Friday, December 06, 2013 02:21:50 AM Rafael J. Wysocki wrote:
> 
> [...]
> 
> > 
> > OK
> > 
> > To be a bit more constructive, as the next step I'd try to use
> > pci_remove_rescan_mutex to serialize all PCI hotplug operations (as I said
> > above) without making the other changes made by my patch.  Does that sound
> > reasonable?
> 
> Well, no answer here, so as a followup, a series implementing that idea
> follows.
> 
> I *hope* I found all of the places that need to be synchronized vs the bus
> rescan and device removal that can be triggered via sysfs, but I might overlook
> something.  Also in some cases I wasn't quite sure how much stuff to put under
> the lock, because said stuff is not exactly straightforward.

I applied this series to my pci/locking branch for v3.14.  It should appear
in -next tomorrow.

Note that this touches some areas that are not strictly PCI, so speak
up if I'm treading on your toes:

 arch/powerpc/kernel/eeh_driver.c       |   19 ++++++++++++--
 drivers/acpi/pci_root.c                |    6 ++++
 drivers/message/fusion/mptbase.c       |    2 -
 drivers/pci/hotplug/acpiphp.h          |    5 +++
 drivers/pci/hotplug/acpiphp_core.c     |    2 -
 drivers/pci/hotplug/acpiphp_glue.c     |   43 +++++++++++++++++++++++++++++----
 drivers/pci/hotplug/cpci_hotplug_pci.c |   14 +++++++++-
 drivers/pci/hotplug/cpqphp_pci.c       |    8 +++++-
 drivers/pci/hotplug/ibmphp_core.c      |   13 ++++++++-
 drivers/pci/hotplug/pciehp_pci.c       |   17 +++++++++----
 drivers/pci/hotplug/rpadlpar_core.c    |   19 ++++++++++----
 drivers/pci/hotplug/rpaphp_core.c      |    4 +++
 drivers/pci/hotplug/s390_pci_hpc.c     |    4 ++-
 drivers/pci/hotplug/sgi_hotplug.c      |    5 +++
 drivers/pci/hotplug/shpchp_pci.c       |   18 ++++++++++---
 drivers/pci/pci-sysfs.c                |   19 +++++---------
 drivers/pci/probe.c                    |   18 +++++++++++++
 drivers/pci/remove.c                   |   11 ++++++++
 drivers/pci/xen-pcifront.c             |    8 ++++++
 drivers/pcmcia/cardbus.c               |    7 +++++
 drivers/platform/x86/asus-wmi.c        |    2 +
 drivers/platform/x86/eeepc-laptop.c    |    2 +
 drivers/scsi/mpt2sas/mpt2sas_base.c    |    2 -
 drivers/scsi/mpt3sas/mpt3sas_base.c    |    2 -
 include/linux/pci.h                    |    3 ++


^ permalink raw reply	[flat|nested] 69+ messages in thread

end of thread, other threads:[~2014-01-15 18:02 UTC | newest]

Thread overview: 69+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-11-26  1:28 [PATCH v2 00/10] PCI: Double removing fix and allocate 64bit mmio pref Yinghai Lu
2013-11-26  1:28 ` [PATCH v2 01/10] PCI: Use device_release_driver in pci_stop_root_bus Yinghai Lu
2013-11-27  1:09   ` Rafael J. Wysocki
2013-11-26  1:28 ` [PATCH v2 02/10] PCI: Move back pci_proc_attach_devices calling Yinghai Lu
2013-11-26  1:28 ` [PATCH v2 03/10] PCI: Move resources and bus_list releasing to pci_release_dev Yinghai Lu
2013-11-27  1:15   ` Rafael J. Wysocki
2013-11-27  2:15     ` Yinghai Lu
2013-11-26  1:28 ` [PATCH v2 04/10] PCI: Destroy pci dev only once Yinghai Lu
2013-11-26  3:38   ` Bjorn Helgaas
2013-11-26 19:34     ` Yinghai Lu
2013-11-26 20:13       ` Yinghai Lu
2013-11-27  1:24         ` Rafael J. Wysocki
2013-11-27  2:26           ` Yinghai Lu
2013-11-29 23:38             ` Rafael J. Wysocki
2013-11-29 23:45               ` Rafael J. Wysocki
2013-11-30  0:31                 ` Rafael J. Wysocki
2013-11-30 21:37                   ` Rafael J. Wysocki
2013-11-30 22:27                     ` Yinghai Lu
2013-12-01  1:24                       ` Rafael J. Wysocki
2013-12-02  1:29                         ` Rafael J. Wysocki
2013-12-02 14:49                           ` Rafael J. Wysocki
2013-12-05 22:40                             ` Bjorn Helgaas
2013-12-06  1:21                               ` Rafael J. Wysocki
2013-12-06  6:29                                 ` Yinghai Lu
2014-01-10 14:20                                 ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
2014-01-10 14:22                                   ` [PATCH 1/9] PCI: Global rescan-remove lock Rafael J. Wysocki
2014-01-10 14:23                                   ` [PATCH 2/9] ACPI / PCI: Use global PCI rescan-remove locking in PCI root hotplug Rafael J. Wysocki
2014-01-10 14:24                                   ` [PATCH 3/9] ACPI / hotplug / PCI: Use global PCI rescan-remove locking Rafael J. Wysocki
2014-01-10 14:25                                   ` [PATCH 4/9] PCMCIA / cardbus: " Rafael J. Wysocki
2014-01-10 14:26                                   ` [PATCH 5/9] PCI / hotplug: " Rafael J. Wysocki
2014-01-10 14:27                                   ` [PATCH 6/9] platform / x86: " Rafael J. Wysocki
2014-01-10 14:27                                   ` [PATCH 7/9] MPT / PCI: Use pci_stop_and_remove_bus_device_locked() Rafael J. Wysocki
2014-01-10 14:28                                   ` [PATCH 8/9] powerpc / eeh_driver: Use global PCI rescan-remove locking Rafael J. Wysocki
2014-01-15 13:36                                     ` [Update][PATCH " Rafael J. Wysocki
2014-01-15 17:38                                       ` Bjorn Helgaas
2014-01-10 14:29                                   ` [PATCH 9/9] Xen / PCI: " Rafael J. Wysocki
2014-01-15 18:02                                   ` [PATCH 0/9] PCI: Eliminate race conditions between hotplug and sysfs rescan/remove (Was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Bjorn Helgaas
2013-12-06  6:52                             ` [PATCH v2 04/10] PCI: Destroy pci dev only once Yinghai Lu
2013-12-07  1:27                               ` Rafael J. Wysocki
2013-12-08  3:31                                 ` Yinghai Lu
2013-12-08  3:50                                   ` Greg Kroah-Hartman
2013-12-09 15:24                                     ` Ethan Zhao
2013-12-09 19:08                                       ` Greg Kroah-Hartman
2013-12-10  7:43                                         ` Ethan Zhao
2014-01-13  1:03                                 ` [PATCH] PCI / remove: Check parent kobject in pci_destroy_dev() (was: Re: [PATCH v2 04/10] PCI: Destroy pci dev only once) Rafael J. Wysocki
2013-11-27  1:17       ` [PATCH v2 04/10] PCI: Destroy pci dev only once Rafael J. Wysocki
2013-11-26  1:28 ` [PATCH v2 05/10] PCI: pcibus address to resource converting take bus directly Yinghai Lu
2013-11-26  1:28 ` [PATCH v2 06/10] PCI: Add pcibios_bus_addr_to_res() Yinghai Lu
2013-11-26  1:28 ` [PATCH v2 07/10] PCI: Try to allocate mem64 above 4G at first Yinghai Lu
2013-11-26  4:15   ` Bjorn Helgaas
2013-11-26 20:14     ` Yinghai Lu
2013-11-26  1:28 ` [PATCH v2 08/10] PCI: Try best to allocate pref mmio 64bit above 4g Yinghai Lu
2013-11-26  4:17   ` Bjorn Helgaas
2013-11-26  6:59     ` Guo Chao
2013-11-26 17:53       ` Bjorn Helgaas
2013-11-26 22:00         ` Yinghai Lu
2013-11-26 22:01           ` Bjorn Helgaas
2013-11-27  0:33             ` Yinghai Lu
2013-11-26  1:28 ` [PATCH v2 09/10] PCI: Sort pci root bus resources list Yinghai Lu
2013-11-26  4:18   ` Bjorn Helgaas
2013-11-26  1:28 ` [PATCH v2 10/10] intel-gtt: Read 64bit for gmar_bus_addr Yinghai Lu
2013-11-26  3:46   ` Bjorn Helgaas
2013-11-26 19:35     ` Yinghai Lu
2013-12-11 18:48       ` Bjorn Helgaas
2013-12-11 19:58         ` Yinghai Lu
2013-12-21  0:27   ` Bjorn Helgaas
2013-12-21  1:19     ` Yinghai Lu
2013-12-21 18:50       ` Bjorn Helgaas
2013-12-23 22:33         ` Bjorn Helgaas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).