linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] PCI: pcie hotplug related patch
@ 2012-01-21  9:52 Yinghai Lu
  2012-01-21  9:52 ` [PATCH 1/7] PCI: Make sriov work with hotplug remove Yinghai Lu
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Yinghai Lu @ 2012-01-21  9:52 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Yinghai Lu

b40c9ef: pciehp: Disable/enable link during slot power off/on
fe877ae: pciehp: Add Disable/enable link functions
292e286: pciehp: Add pcie_wait_link_not_active()
186b3fd: pciehp: print out link status when dlla get active.
2781df4: pciehp: Checking pci conf reading to new added device instead of sleep 1s
bf1c4ac: PCI: Separate pci_bus_read_dev_vendor_id from pci_scan_device
601c6f1: PCI: Make sriov work with hotplug remove

 drivers/pci/hotplug/pciehp_hpc.c |  133 +++++++++++++++++++++++++++++++-------
 drivers/pci/pci.h                |    2 +
 drivers/pci/probe.c              |   48 +++++++++-----
 drivers/pci/remove.c             |   28 ++++++++-
 4 files changed, 169 insertions(+), 42 deletions(-)

First one is for hotplug with sriov under bridge.
following two are for hotplug probing with pci conf reading.
Last four are for pcie link disable when power off slots.

Thanks

Yinghai Lu

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-21  9:52 [PATCH 0/7] PCI: pcie hotplug related patch Yinghai Lu
@ 2012-01-21  9:52 ` Yinghai Lu
  2012-01-23 16:06   ` Linus Torvalds
  2012-01-21  9:52 ` [PATCH 2/7] PCI: Separate pci_bus_read_dev_vendor_id from pci_scan_device Yinghai Lu
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Yinghai Lu @ 2012-01-21  9:52 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Yinghai Lu

When hot remove pci express module that have pcie switch and support SRIOV, got

[ 5918.610127] pciehp 0000:80:02.2:pcie04: pcie_isr: intr_loc 1
[ 5918.615779] pciehp 0000:80:02.2:pcie04: Attention button interrupt received
[ 5918.622730] pciehp 0000:80:02.2:pcie04: Button pressed on Slot(3)
[ 5918.629002] pciehp 0000:80:02.2:pcie04: pciehp_get_power_status: SLOTCTRL a8 value read 1f9
[ 5918.637416] pciehp 0000:80:02.2:pcie04: PCI slot #3 - powering off due to button press.
[ 5918.647125] pciehp 0000:80:02.2:pcie04: pcie_isr: intr_loc 10
[ 5918.653039] pciehp 0000:80:02.2:pcie04: pciehp_green_led_blink: SLOTCTRL a8 write cmd 200
[ 5918.661229] pciehp 0000:80:02.2:pcie04: pciehp_set_attention_status: SLOTCTRL a8 write cmd c0
[ 5924.667627] pciehp 0000:80:02.2:pcie04: Disabling domain:bus:device=0000:b0:00
[ 5924.674909] pciehp 0000:80:02.2:pcie04: pciehp_get_power_status: SLOTCTRL a8 value read 2f9
[ 5924.683262] pciehp 0000:80:02.2:pcie04: pciehp_unconfigure_device: domain:bus:dev = 0000:b0:00
[ 5924.693976] libfcoe_device_notification: NETDEV_UNREGISTER eth6
[ 5924.764979] libfcoe_device_notification: NETDEV_UNREGISTER eth14
[ 5924.873539] libfcoe_device_notification: NETDEV_UNREGISTER eth15
[ 5924.995209] libfcoe_device_notification: NETDEV_UNREGISTER eth16
[ 5926.114407] sxge 0000:b2:00.0: PCI INT A disabled
[ 5926.119342] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 5926.127189] IP: [<ffffffff81353a3b>] pci_stop_bus_device+0x33/0x83
[ 5926.133377] PGD 0
[ 5926.135402] Oops: 0000 [#1] SMP
[ 5926.138659] CPU 2
[ 5926.140499] Modules linked in:
...
[ 5926.143754]
[ 5926.275823] Call Trace:
[ 5926.278267]  [<ffffffff81353a38>] pci_stop_bus_device+0x30/0x83
[ 5926.284180]  [<ffffffff81353af4>] pci_remove_bus_device+0x1a/0xba
[ 5926.290264]  [<ffffffff81366311>] pciehp_unconfigure_device+0x110/0x17b
[ 5926.296866]  [<ffffffff81365dd9>] ? pciehp_disable_slot+0x188/0x188
[ 5926.303123]  [<ffffffff81365d6f>] pciehp_disable_slot+0x11e/0x188
[ 5926.309206]  [<ffffffff81365e68>] pciehp_power_thread+0x8f/0xe0
...

 +-[0000:80]-+-00.0-[81-8f]--
 |           +-01.0-[90-9f]--
 |           +-02.0-[a0-af]--
 |           +-02.2-[b0-bf]----00.0-[b1-b3]--+-02.0-[b2]--+-00.0 Device
 |           |                               |            +-00.1 Device
 |           |                               |            +-00.2 Device
 |           |                               |            \-00.3 Device
 |           |                               \-03.0-[b3]--+-00.0 Device
 |           |                                            +-00.1 Device
 |           |                                            +-00.2 Device
 |           |                                            \-00.3 Device

root complex: 80:02.2
pci express modules: have pcie switch and are listed as b0:00.0, b1:02.0 and b1:03.0.
                end devices  are b2:00.0 and b3.00.0.
                VFs are: b2:00.1,... b2:00.3, and b3:00.1,...,b3:00.3

Root cause: when doing pci_stop_bus_device() with phys fn, it will stop virt fn and
remove the fn, so
	list_for_each_safe(l, n, &bus->devices)
will have problem to refer freed n that is pointed to vf entry.

Solution is just replacing list_for_each_safe() with list_for_each().
	pci_stop_bus_device(dev) will not remove dev from bus->devices list,
	so We only need use for_each here.

During reviewing the patch, Bjorn said:
|   The PCI hot-remove path calls pci_stop_bus_devices() via
|   pci_remove_bus_device().
|
|   pci_stop_bus_devices() traverses the bus->devices list (point A below),
|   stopping each device in turn, which calls the driver remove() method.  When
|   the device is an SR-IOV PF, the driver calls pci_disable_sriov(), which
|   also uses pci_remove_bus_device() to remove the VF devices from the
|   bus->devices list (point B).
|
|       pci_remove_bus_device
|         pci_stop_bus_device
|           pci_stop_bus_devices(subordinate)
|             list_for_each(bus->devices)             <-- A
|               pci_stop_bus_device(PF)
|                 ...
|                   driver->remove
|                     pci_disable_sriov
|                       ...
|                         pci_remove_bus_device(VF)
|                             <remove from bus_list>  <-- B
|
|   At B, we're changing the same list we're iterating through at A, so when
|   the driver remove() method returns, the pci_stop_bus_devices() iterator has
|   a pointer to a list entry that has already been freed.

Discussion thread can be found : https://lkml.org/lkml/2011/10/15/141

-v2: updated changelog.
-v3: According to Kenji, do not allocate another list for PF.
     Also We don't need to check dev->is_virtfn anymore, because PF should come
	first in the list, and even VF come first, it is still ok.
-v4: According to Kenji, expand list_for_each(), and add is_virtfn checking

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 drivers/pci/remove.c |   28 +++++++++++++++++++++++++++-
 1 files changed, 27 insertions(+), 1 deletions(-)

diff --git a/drivers/pci/remove.c b/drivers/pci/remove.c
index 7f87bee..932c5e2 100644
--- a/drivers/pci/remove.c
+++ b/drivers/pci/remove.c
@@ -123,10 +123,36 @@ void pci_remove_behind_bridge(struct pci_dev *dev)
 static void pci_stop_bus_devices(struct pci_bus *bus)
 {
 	struct list_head *l, *n;
+	struct list_head *head = &bus->devices;
 
+	/*
+	 * pci_stop_bus_device(dev) will not remove dev from bus->devices list,
+	 *  so We don't need use _safe version for_each here.
+	 * Also _safe version has problem when pci_stop_bus_device() for PF try
+	 *  to remove VFs.
+	 */
+	for (l = head->next; l != head;) {
+		struct pci_dev *dev = pci_dev_b(l);
+
+		/*
+		 * VFs are removed by pci_remove_bus_device() in the
+		 *  pci_stop_bus_devices() code path for PF.
+		 *  aka, bus->devices get updated in the process.
+		 * barrier() will make sure we get right next from that list.
+		 */
+		if (!dev->is_virtfn) {
+			pci_stop_bus_device(dev);
+			barrier();
+		}
+		l = l->next;
+	}
+
+	/* For left over VFs in case */
 	list_for_each_safe(l, n, &bus->devices) {
 		struct pci_dev *dev = pci_dev_b(l);
-		pci_stop_bus_device(dev);
+
+		if (dev->is_virtfn)
+			pci_stop_bus_device(dev);
 	}
 }
 
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/7] PCI: Separate pci_bus_read_dev_vendor_id from pci_scan_device
  2012-01-21  9:52 [PATCH 0/7] PCI: pcie hotplug related patch Yinghai Lu
  2012-01-21  9:52 ` [PATCH 1/7] PCI: Make sriov work with hotplug remove Yinghai Lu
@ 2012-01-21  9:52 ` Yinghai Lu
  2012-01-21  9:52 ` [PATCH 3/7] pciehp: Checking pci conf reading to new added device instead of sleep 1s Yinghai Lu
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Yinghai Lu @ 2012-01-21  9:52 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Yinghai Lu

Will reuse that function for pciehp probing.

-v2: according to Kenji, fix crs timeout checking, and export the function
     for later using when pciehp is compiled as module.

Suggested-by: Matthew Wilcox <matthew@wil.cx>
Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 drivers/pci/pci.h   |    2 ++
 drivers/pci/probe.c |   48 +++++++++++++++++++++++++++++++-----------------
 2 files changed, 33 insertions(+), 17 deletions(-)

diff --git a/drivers/pci/pci.h b/drivers/pci/pci.h
index b74084e..8dadbd3 100644
--- a/drivers/pci/pci.h
+++ b/drivers/pci/pci.h
@@ -205,6 +205,8 @@ enum pci_bar_type {
 	pci_bar_mem64,		/* A 64-bit memory BAR */
 };
 
+bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *pl,
+				int crs_timeout);
 extern int pci_setup_device(struct pci_dev *dev);
 extern int __pci_read_base(struct pci_dev *dev, enum pci_bar_type type,
 				struct resource *res, unsigned int reg);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 04e74f4..101e90f 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1113,40 +1113,54 @@ struct pci_dev *alloc_pci_dev(void)
 }
 EXPORT_SYMBOL(alloc_pci_dev);
 
-/*
- * Read the config data for a PCI device, sanity-check it
- * and fill in the dev structure...
- */
-static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
+bool pci_bus_read_dev_vendor_id(struct pci_bus *bus, int devfn, u32 *l,
+				 int crs_timeout)
 {
-	struct pci_dev *dev;
-	u32 l;
 	int delay = 1;
 
-	if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, &l))
-		return NULL;
+	if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
+		return false;
 
 	/* some broken boards return 0 or ~0 if a slot is empty: */
-	if (l == 0xffffffff || l == 0x00000000 ||
-	    l == 0x0000ffff || l == 0xffff0000)
-		return NULL;
+	if (*l == 0xffffffff || *l == 0x00000000 ||
+	    *l == 0x0000ffff || *l == 0xffff0000)
+		return false;
 
 	/* Configuration request Retry Status */
-	while (l == 0xffff0001) {
+	while (*l == 0xffff0001) {
+		if (!crs_timeout)
+			return false;
+
 		msleep(delay);
 		delay *= 2;
-		if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, &l))
-			return NULL;
+		if (pci_bus_read_config_dword(bus, devfn, PCI_VENDOR_ID, l))
+			return false;
 		/* Card hasn't responded in 60 seconds?  Must be stuck. */
-		if (delay > 60 * 1000) {
+		if (delay > crs_timeout) {
 			printk(KERN_WARNING "pci %04x:%02x:%02x.%d: not "
 					"responding\n", pci_domain_nr(bus),
 					bus->number, PCI_SLOT(devfn),
 					PCI_FUNC(devfn));
-			return NULL;
+			return false;
 		}
 	}
 
+	return true;
+}
+EXPORT_SYMBOL(pci_bus_read_dev_vendor_id);
+
+/*
+ * Read the config data for a PCI device, sanity-check it
+ * and fill in the dev structure...
+ */
+static struct pci_dev *pci_scan_device(struct pci_bus *bus, int devfn)
+{
+	struct pci_dev *dev;
+	u32 l;
+
+	if (!pci_bus_read_dev_vendor_id(bus, devfn, &l, 60*1000))
+		return NULL;
+
 	dev = alloc_pci_dev();
 	if (!dev)
 		return NULL;
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/7] pciehp: Checking pci conf reading to new added device instead of sleep 1s
  2012-01-21  9:52 [PATCH 0/7] PCI: pcie hotplug related patch Yinghai Lu
  2012-01-21  9:52 ` [PATCH 1/7] PCI: Make sriov work with hotplug remove Yinghai Lu
  2012-01-21  9:52 ` [PATCH 2/7] PCI: Separate pci_bus_read_dev_vendor_id from pci_scan_device Yinghai Lu
@ 2012-01-21  9:52 ` Yinghai Lu
  2012-01-21  9:52 ` [PATCH 4/7] pciehp: print out link status when dlla get active Yinghai Lu
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Yinghai Lu @ 2012-01-21  9:52 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Yinghai Lu

During reviewing
|	PCI: pciehp: wait 1000 ms before Link Training check
Linus said:
>...
> That's a *long* time, and it's irritating to the user. It makes the
> user think "the machine is slow".
>...
> And quite frankly, an unconditional one-second delay here seems bad.
>Two seconds was unacceptable, one second is just bad.

Try to access the pci conf of pci device that is supposed to show up in 1s,
if could read back valid vender/device id, could bail out early.

Related discussion could be found:
	https://lkml.org/lkml/2011/12/6/339

-v2: seperate code to pci_bus_read_dev_vendor_id() from pci_scan_device()
    and reuse it from pciehp code. Suggested by Matthew Wilcox.
-v3: According to Kenj, don't use array in stack, and don't wait too long
    for crs, also return fail status if not found.
    Also separate pci_bus_dev_read_vendor_id() change to another patch.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 drivers/pci/hotplug/pciehp_hpc.c |   49 ++++++++++++++++++++++++++-----------
 1 files changed, 34 insertions(+), 15 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index 7b14148..ad8c1be 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -265,10 +265,37 @@ static void pcie_wait_link_active(struct controller *ctrl)
 	ctrl_dbg(ctrl, "Data Link Layer Link Active not set in 1000 msec\n");
 }
 
+static bool pci_bus_check_dev(struct pci_bus *bus, int devfn)
+{
+	u32 l;
+	int count = 0;
+	int delay = 1000, step = 20;
+	bool found = false;
+
+	do {
+		found = pci_bus_read_dev_vendor_id(bus, devfn, &l, 0);
+		count++;
+
+		if (found)
+			break;
+
+		msleep(step);
+		delay -= step;
+	} while (delay > 0);
+
+	if (count > 1 && pciehp_debug)
+		printk(KERN_DEBUG "pci %04x:%02x:%02x.%d id reading try %d times with interval %d ms to get %08x\n",
+			pci_domain_nr(bus), bus->number, PCI_SLOT(devfn),
+			PCI_FUNC(devfn), count, step, l);
+
+	return found;
+}
+
 int pciehp_check_link_status(struct controller *ctrl)
 {
 	u16 lnk_status;
 	int retval = 0;
+	bool found = false;
 
         /*
          * Data Link Layer Link Active Reporting must be capable for
@@ -280,13 +307,10 @@ int pciehp_check_link_status(struct controller *ctrl)
         else
                 msleep(1000);
 
-	/*
-	 * Need to wait for 1000 ms after Data Link Layer Link Active
-	 * (DLLLA) bit reads 1b before sending configuration request.
-	 * We need it before checking Link Training (LT) bit becuase
-	 * LT is still set even after DLLLA bit is set on some platform.
-	 */
-	msleep(1000);
+	/* wait 100ms before read pci conf, and try in 1s */
+	msleep(100);
+	found = pci_bus_check_dev(ctrl->pcie->port->subordinate,
+					PCI_DEVFN(0, 0));
 
 	retval = pciehp_readw(ctrl, PCI_EXP_LNKSTA, &lnk_status);
 	if (retval) {
@@ -302,16 +326,11 @@ int pciehp_check_link_status(struct controller *ctrl)
 		return retval;
 	}
 
-	/*
-	 * If the port supports Link speeds greater than 5.0 GT/s, we
-	 * must wait for 100 ms after Link training completes before
-	 * sending configuration request.
-	 */
-	if (ctrl->pcie->port->subordinate->max_bus_speed > PCIE_SPEED_5_0GT)
-		msleep(100);
-
 	pcie_update_link_speed(ctrl->pcie->port->subordinate, lnk_status);
 
+	if (!found && !retval)
+		retval = -1;
+
 	return retval;
 }
 
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/7] pciehp: print out link status when dlla get active.
  2012-01-21  9:52 [PATCH 0/7] PCI: pcie hotplug related patch Yinghai Lu
                   ` (2 preceding siblings ...)
  2012-01-21  9:52 ` [PATCH 3/7] pciehp: Checking pci conf reading to new added device instead of sleep 1s Yinghai Lu
@ 2012-01-21  9:52 ` Yinghai Lu
  2012-01-21  9:52 ` [PATCH 5/7] pciehp: Add pcie_wait_link_not_active() Yinghai Lu
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Yinghai Lu @ 2012-01-21  9:52 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Yinghai Lu

Use it get timestamp to calculate time for the pci conf access readness.

and let it return bool instead of int.

also remove inline attribute, let compiler decide it.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>
---
 drivers/pci/hotplug/pciehp_hpc.c |   17 ++++++++++++-----
 1 files changed, 12 insertions(+), 5 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index ad8c1be..a1a9879 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -241,13 +241,20 @@ static int pcie_write_cmd(struct controller *ctrl, u16 cmd, u16 mask)
 	return retval;
 }
 
-static inline int check_link_active(struct controller *ctrl)
+static bool check_link_active(struct controller *ctrl)
 {
-	u16 link_status;
+	bool ret = false;
+	u16 lnk_status;
 
-	if (pciehp_readw(ctrl, PCI_EXP_LNKSTA, &link_status))
-		return 0;
-	return !!(link_status & PCI_EXP_LNKSTA_DLLLA);
+	if (pciehp_readw(ctrl, PCI_EXP_LNKSTA, &lnk_status))
+		return ret;
+
+	ret = !!(lnk_status & PCI_EXP_LNKSTA_DLLLA);
+
+	if (ret)
+		ctrl_dbg(ctrl, "%s: lnk_status = %x\n", __func__, lnk_status);
+
+	return ret;
 }
 
 static void pcie_wait_link_active(struct controller *ctrl)
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 5/7] pciehp: Add pcie_wait_link_not_active()
  2012-01-21  9:52 [PATCH 0/7] PCI: pcie hotplug related patch Yinghai Lu
                   ` (3 preceding siblings ...)
  2012-01-21  9:52 ` [PATCH 4/7] pciehp: print out link status when dlla get active Yinghai Lu
@ 2012-01-21  9:52 ` Yinghai Lu
  2012-01-21  9:52 ` [PATCH 6/7] pciehp: Add Disable/enable link functions Yinghai Lu
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Yinghai Lu @ 2012-01-21  9:52 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Yinghai Lu, Yinghai Lu

Will use for link disable status checking

Signed-off-by: Yinghai Lu <yinghai.lu@oracle.com>
---
 drivers/pci/hotplug/pciehp_hpc.c |   19 +++++++++++++++----
 1 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index a1a9879..ea54aef 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -257,19 +257,30 @@ static bool check_link_active(struct controller *ctrl)
 	return ret;
 }
 
-static void pcie_wait_link_active(struct controller *ctrl)
+static void __pcie_wait_link_active(struct controller *ctrl, bool active)
 {
 	int timeout = 1000;
 
-	if (check_link_active(ctrl))
+	if (check_link_active(ctrl) == active)
 		return;
 	while (timeout > 0) {
 		msleep(10);
 		timeout -= 10;
-		if (check_link_active(ctrl))
+		if (check_link_active(ctrl) == active)
 			return;
 	}
-	ctrl_dbg(ctrl, "Data Link Layer Link Active not set in 1000 msec\n");
+	ctrl_dbg(ctrl, "Data Link Layer Link Active not %s in 1000 msec\n",
+			active ? "set" : "cleared");
+}
+
+static void pcie_wait_link_active(struct controller *ctrl)
+{
+	__pcie_wait_link_active(ctrl, true);
+}
+
+static void pcie_wait_link_not_active(struct controller *ctrl)
+{
+	__pcie_wait_link_active(ctrl, false);
 }
 
 static bool pci_bus_check_dev(struct pci_bus *bus, int devfn)
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 6/7] pciehp: Add Disable/enable link functions
  2012-01-21  9:52 [PATCH 0/7] PCI: pcie hotplug related patch Yinghai Lu
                   ` (4 preceding siblings ...)
  2012-01-21  9:52 ` [PATCH 5/7] pciehp: Add pcie_wait_link_not_active() Yinghai Lu
@ 2012-01-21  9:52 ` Yinghai Lu
  2012-01-23 16:13   ` Linus Torvalds
  2012-01-21  9:52 ` [PATCH 7/7] pciehp: Disable/enable link during slot power off/on Yinghai Lu
  2012-01-21 10:26 ` [PATCH 0/7] PCI: pcie hotplug related patch Yinghai Lu
  7 siblings, 1 reply; 27+ messages in thread
From: Yinghai Lu @ 2012-01-21  9:52 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Yinghai Lu, Yinghai Lu

Will use it during power off/on slots

Signed-off-by: Yinghai Lu <yinghai.lu@oracle.com>
---
 drivers/pci/hotplug/pciehp_hpc.c |   36 ++++++++++++++++++++++++++++++++++++
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index ea54aef..fdab8a6 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -352,6 +352,42 @@ int pciehp_check_link_status(struct controller *ctrl)
 	return retval;
 }
 
+static int __pciehp_link_set(struct controller *ctrl, bool enable)
+{
+	u16 lnk_ctrl;
+	int retval = 0;
+
+	retval = pciehp_readw(ctrl, PCI_EXP_LNKCTL, &lnk_ctrl);
+	if (retval) {
+		ctrl_err(ctrl, "Cannot read LNKCTRL register\n");
+		return retval;
+	}
+
+	if (enable)
+		lnk_ctrl &= ~PCI_EXP_LNKCTL_LD;
+	else
+		lnk_ctrl |= PCI_EXP_LNKCTL_LD;
+
+	retval = pciehp_writew(ctrl, PCI_EXP_LNKCTL, lnk_ctrl);
+	if (retval) {
+		ctrl_err(ctrl, "Cannot write LNKCTRL register\n");
+		return retval;
+	}
+	ctrl_dbg(ctrl, "%s: lnk_ctrl = %x\n", __func__, lnk_ctrl);
+
+	return retval;
+}
+
+static int pciehp_link_enable(struct controller *ctrl)
+{
+	return __pciehp_link_set(ctrl, true);
+}
+
+static int pciehp_link_disable(struct controller *ctrl)
+{
+	return __pciehp_link_set(ctrl, false);
+}
+
 int pciehp_get_attention_status(struct slot *slot, u8 *status)
 {
 	struct controller *ctrl = slot->ctrl;
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 7/7] pciehp: Disable/enable link during slot power off/on
  2012-01-21  9:52 [PATCH 0/7] PCI: pcie hotplug related patch Yinghai Lu
                   ` (5 preceding siblings ...)
  2012-01-21  9:52 ` [PATCH 6/7] pciehp: Add Disable/enable link functions Yinghai Lu
@ 2012-01-21  9:52 ` Yinghai Lu
  2012-01-21 10:26 ` [PATCH 0/7] PCI: pcie hotplug related patch Yinghai Lu
  7 siblings, 0 replies; 27+ messages in thread
From: Yinghai Lu @ 2012-01-21  9:52 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Yinghai Lu, Yinghai Lu

One system have repeater in system board to support gen2 hotplug.

Found when EM is removed from some slots, /var/log/message will be full of
"card present/not present" warning.

It turns out root complex still try to train the link to repeater because
repeater is not reset.

This patch will disable link to make repeater could reset properly.
Also could kill AER during EM removal.

Recently when testing hotplug on one system under development, found if boot
the system without EM, and later hotplug does not work with Linux.
But other OS is ok.
The root cause is that bios left link disabled when slot is empty,
and other OS is playing link disable bit in link ctrl during power on/off.

So We could do the same thing to disable/enable link during power off/on.

-v2: check link DLLA bit instead of 100ms waiting.
     Separate link disable/enable functions to another patch.

Signed-off-by: Yinghai Lu <yinghai.lu@oracle.com>
---
 drivers/pci/hotplug/pciehp_hpc.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/hotplug/pciehp_hpc.c b/drivers/pci/hotplug/pciehp_hpc.c
index fdab8a6..0e35d70 100644
--- a/drivers/pci/hotplug/pciehp_hpc.c
+++ b/drivers/pci/hotplug/pciehp_hpc.c
@@ -606,6 +606,10 @@ int pciehp_power_on_slot(struct slot * slot)
 	ctrl_dbg(ctrl, "%s: SLOTCTRL %x write cmd %x\n", __func__,
 		 pci_pcie_cap(ctrl->pcie->port) + PCI_EXP_SLTCTL, slot_cmd);
 
+	retval = pciehp_link_enable(ctrl);
+	if (retval)
+		ctrl_err(ctrl, "%s: Can not enable the link!\n", __func__);
+
 	return retval;
 }
 
@@ -616,6 +620,14 @@ int pciehp_power_off_slot(struct slot * slot)
 	u16 cmd_mask;
 	int retval;
 
+	/* Disable the link at first */
+	pciehp_link_disable(ctrl);
+	/* wait the link is down */
+	if (ctrl->link_active_reporting)
+		pcie_wait_link_not_active(ctrl);
+	else
+		msleep(1000);
+
 	slot_cmd = POWER_OFF;
 	cmd_mask = PCI_EXP_SLTCTL_PCC;
 	retval = pcie_write_cmd(ctrl, slot_cmd, cmd_mask);
-- 
1.7.7


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] PCI: pcie hotplug related patch
  2012-01-21  9:52 [PATCH 0/7] PCI: pcie hotplug related patch Yinghai Lu
                   ` (6 preceding siblings ...)
  2012-01-21  9:52 ` [PATCH 7/7] pciehp: Disable/enable link during slot power off/on Yinghai Lu
@ 2012-01-21 10:26 ` Yinghai Lu
  2012-01-27 18:36   ` Jesse Barnes
  7 siblings, 1 reply; 27+ messages in thread
From: Yinghai Lu @ 2012-01-21 10:26 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Yinghai Lu

On Sat, Jan 21, 2012 at 1:52 AM, Yinghai Lu <yinghai@kernel.org> wrote:
> b40c9ef: pciehp: Disable/enable link during slot power off/on
> fe877ae: pciehp: Add Disable/enable link functions
> 292e286: pciehp: Add pcie_wait_link_not_active()
> 186b3fd: pciehp: print out link status when dlla get active.
> 2781df4: pciehp: Checking pci conf reading to new added device instead of sleep 1s
> bf1c4ac: PCI: Separate pci_bus_read_dev_vendor_id from pci_scan_device
> 601c6f1: PCI: Make sriov work with hotplug remove
>
>  drivers/pci/hotplug/pciehp_hpc.c |  133 +++++++++++++++++++++++++++++++-------
>  drivers/pci/pci.h                |    2 +
>  drivers/pci/probe.c              |   48 +++++++++-----
>  drivers/pci/remove.c             |   28 ++++++++-
>  4 files changed, 169 insertions(+), 42 deletions(-)
>
> First one is for hotplug with sriov under bridge.
> following two are for hotplug probing with pci conf reading.
> Last four are for pcie link disable when power off slots.

could be found at:

	git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-pci-hp

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-21  9:52 ` [PATCH 1/7] PCI: Make sriov work with hotplug remove Yinghai Lu
@ 2012-01-23 16:06   ` Linus Torvalds
  2012-01-23 18:30     ` Yinghai Lu
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2012-01-23 16:06 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci, linux-kernel

On Sat, Jan 21, 2012 at 1:52 AM, Yinghai Lu <yinghai@kernel.org> wrote:
>
> +       /*
> +        * pci_stop_bus_device(dev) will not remove dev from bus->devices list,
> +        *  so We don't need use _safe version for_each here.
> +        * Also _safe version has problem when pci_stop_bus_device() for PF try
> +        *  to remove VFs.
> +        */
> +       for (l = head->next; l != head;) {

That's crazy. Why would you open-code this? Why isn't it just a
"list_for_each()"?

And what are the problems with the safe version? If the safe version
doesn't work, then something is *seriously* wrong with the list.

> +               struct pci_dev *dev = pci_dev_b(l);
> +
> +               /*
> +                * VFs are removed by pci_remove_bus_device() in the
> +                *  pci_stop_bus_devices() code path for PF.
> +                *  aka, bus->devices get updated in the process.
> +                * barrier() will make sure we get right next from that list.
> +                */
> +               if (!dev->is_virtfn) {
> +                       pci_stop_bus_device(dev);
> +                       barrier();
> +               }

And this is just insanity. The "barrier()" cannot *possibly* do
anything sane. If it really makes a difference, there is again some
serious problem with the whole f*cking thing.

NAK on the patch until sanity is restored. This is just total voodoo
programming.

                            Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/7] pciehp: Add Disable/enable link functions
  2012-01-21  9:52 ` [PATCH 6/7] pciehp: Add Disable/enable link functions Yinghai Lu
@ 2012-01-23 16:13   ` Linus Torvalds
  2012-01-24  5:36     ` Yinghai Lu
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2012-01-23 16:13 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Yinghai Lu

On Sat, Jan 21, 2012 at 1:52 AM, Yinghai Lu <yinghai@kernel.org> wrote:
> +
> +       retval = pciehp_readw(ctrl, PCI_EXP_LNKCTL, &lnk_ctrl);
> +       if (retval) {
> +               ctrl_err(ctrl, "Cannot read LNKCTRL register\n");
> +               return retval;
> +       }

Is there really any point at all in checking the return value of that
readw() function?

Nobody actually ever sets it. We should stop even bothering to pretend
to care. There is no return value in real life.

If the device doesn't exist or the read fails, you will never ever get
an error *anyway*. You'll likely get 0xffff, and possibly a machine
check exception. And that is unlikely to ever change - even if the
hardware were to do it, it's insane to do error checking on an
individual IO level. It causes more problems than it could possibly
ever solve - error checking has to be done at a different level than
on some random individual access.

We should make the return type of pci_read_config_word() be 'void',
instead of having code like this that looks like it is careful, but in
actual fact in reality is just totally pointless.

                          Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-23 16:06   ` Linus Torvalds
@ 2012-01-23 18:30     ` Yinghai Lu
  2012-01-23 18:45       ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Yinghai Lu @ 2012-01-23 18:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2352 bytes --]

On Mon, Jan 23, 2012 at 8:06 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sat, Jan 21, 2012 at 1:52 AM, Yinghai Lu <yinghai@kernel.org> wrote:
>>
>> +       /*
>> +        * pci_stop_bus_device(dev) will not remove dev from bus->devices list,
>> +        *  so We don't need use _safe version for_each here.
>> +        * Also _safe version has problem when pci_stop_bus_device() for PF try
>> +        *  to remove VFs.
>> +        */
>> +       for (l = head->next; l != head;) {
>
> That's crazy. Why would you open-code this? Why isn't it just a
> "list_for_each()"?

I have previous version used list_for_each(), but Kenji thought we
should open version because it could be clear that l is updated in the
loop.

>
> And what are the problems with the safe version? If the safe version
> doesn't work, then something is *seriously* wrong with the list.

in list_for_each_safe()

#define list_for_each_safe(pos, n, head) \
        for (pos = (head)->next, n = pos->next; pos != (head); \
                pos = n, n = pos->next)

n is saved before, and safe only mean pos could be freed from the
list, but n still can be used for next loop.

in our case, the list have PF and several VFs, when
pci_stop_bus_device() is called for PF, pos are still valid, but
VFs are removed from the list. so n will not be valid.

>
>> +               struct pci_dev *dev = pci_dev_b(l);
>> +
>> +               /*
>> +                * VFs are removed by pci_remove_bus_device() in the
>> +                *  pci_stop_bus_devices() code path for PF.
>> +                *  aka, bus->devices get updated in the process.
>> +                * barrier() will make sure we get right next from that list.
>> +                */
>> +               if (!dev->is_virtfn) {
>> +                       pci_stop_bus_device(dev);
>> +                       barrier();
>> +               }
>
> And this is just insanity. The "barrier()" cannot *possibly* do
> anything sane. If it really makes a difference, there is again some
> serious problem with the whole f*cking thing.
>
> NAK on the patch until sanity is restored. This is just total voodoo
> programming.

Sorry for that.

Can you please check V1 version ?

https://lkml.org/lkml/2011/10/15/141
or from attached one.

Thanks

Yinghai

[-- Attachment #2: pci_001_debug_sriov_hot_remove.patch --]
[-- Type: text/x-patch, Size: 6117 bytes --]

From: Yinghai Lu <yinghai@kerne.org>
Subject: [PATCH 01/10] PCI: Make sriov work with hotplug remove

When hot remove pci express module that have pcie switch and support SRIOV, got

[ 5918.610127] pciehp 0000:80:02.2:pcie04: pcie_isr: intr_loc 1
[ 5918.615779] pciehp 0000:80:02.2:pcie04: Attention button interrupt received
[ 5918.622730] pciehp 0000:80:02.2:pcie04: Button pressed on Slot(3)
[ 5918.629002] pciehp 0000:80:02.2:pcie04: pciehp_get_power_status: SLOTCTRL a8 value read 1f9
[ 5918.637416] pciehp 0000:80:02.2:pcie04: PCI slot #3 - powering off due to button press.
[ 5918.647125] pciehp 0000:80:02.2:pcie04: pcie_isr: intr_loc 10
[ 5918.653039] pciehp 0000:80:02.2:pcie04: pciehp_green_led_blink: SLOTCTRL a8 write cmd 200
[ 5918.661229] pciehp 0000:80:02.2:pcie04: pciehp_set_attention_status: SLOTCTRL a8 write cmd c0
[ 5924.667627] pciehp 0000:80:02.2:pcie04: Disabling domain:bus:device=0000:b0:00
[ 5924.674909] pciehp 0000:80:02.2:pcie04: pciehp_get_power_status: SLOTCTRL a8 value read 2f9
[ 5924.683262] pciehp 0000:80:02.2:pcie04: pciehp_unconfigure_device: domain:bus:dev = 0000:b0:00
[ 5924.693976] libfcoe_device_notification: NETDEV_UNREGISTER eth6
[ 5924.764979] libfcoe_device_notification: NETDEV_UNREGISTER eth14
[ 5924.873539] libfcoe_device_notification: NETDEV_UNREGISTER eth15
[ 5924.995209] libfcoe_device_notification: NETDEV_UNREGISTER eth16
[ 5926.114407] sxge 0000:b2:00.0: PCI INT A disabled
[ 5926.119342] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 5926.127189] IP: [<ffffffff81353a3b>] pci_stop_bus_device+0x33/0x83
[ 5926.133377] PGD 0
[ 5926.135402] Oops: 0000 [#1] SMP
[ 5926.138659] CPU 2
[ 5926.140499] Modules linked in:
...
[ 5926.143754]
[ 5926.275823] Call Trace:
[ 5926.278267]  [<ffffffff81353a38>] pci_stop_bus_device+0x30/0x83
[ 5926.284180]  [<ffffffff81353af4>] pci_remove_bus_device+0x1a/0xba
[ 5926.290264]  [<ffffffff81366311>] pciehp_unconfigure_device+0x110/0x17b
[ 5926.296866]  [<ffffffff81365dd9>] ? pciehp_disable_slot+0x188/0x188
[ 5926.303123]  [<ffffffff81365d6f>] pciehp_disable_slot+0x11e/0x188
[ 5926.309206]  [<ffffffff81365e68>] pciehp_power_thread+0x8f/0xe0
...

 +-[0000:80]-+-00.0-[81-8f]--
 |           +-01.0-[90-9f]--
 |           +-02.0-[a0-af]--
 |           +-02.2-[b0-bf]----00.0-[b1-b3]--+-02.0-[b2]--+-00.0 Device
 |           |                               |            +-00.1 Device
 |           |                               |            +-00.2 Device
 |           |                               |            \-00.3 Device
 |           |                               \-03.0-[b3]--+-00.0 Device
 |           |                                            +-00.1 Device
 |           |                                            +-00.2 Device
 |           |                                            \-00.3 Device

root complex: 80:02.2
pci express modules: have pcie switch and are listed as b0:00.0, b1:02.0 and b1:03.0.
                end devices  are b2:00.0 and b3.00.0.
                VFs are: b2:00.1,... b2:00.3, and b3:00.1,...,b3:00.3

Root cause: when doing pci_stop_bus_device() with phys fn, it will stop virt fn and
remove the fn, so
	list_for_each_safe(l, n, &bus->devices)
will have problem to refer freed n that is pointed to vf entry.

Solution is just call pci_stop_bus_device() with phys fn only. and before that need to
save phys fn aside and avoid to use bus->devices to loop.

During reviewing the patch, Bjorn said:
|   The PCI hot-remove path calls pci_stop_bus_devices() via
|   pci_remove_bus_device().
|
|   pci_stop_bus_devices() traverses the bus->devices list (point A below),
|   stopping each device in turn, which calls the driver remove() method.  When
|   the device is an SR-IOV PF, the driver calls pci_disable_sriov(), which
|   also uses pci_remove_bus_device() to remove the VF devices from the
|   bus->devices list (point B).
|
|       pci_remove_bus_device
|         pci_stop_bus_device
|           pci_stop_bus_devices(subordinate)
|             list_for_each(bus->devices)             <-- A
|               pci_stop_bus_device(PF)
|                 ...
|                   driver->remove
|                     pci_disable_sriov
|                       ...
|                         pci_remove_bus_device(VF)
|                             <remove from bus_list>  <-- B
|
|   At B, we're changing the same list we're iterating through at A, so when
|   the driver remove() method returns, the pci_stop_bus_devices() iterator has
|   a pointer to a list entry that has already been freed.
|
|   This patch avoids the problem by building a separate list of all PFs on
|   the bus and traversing that at A instead of the bus->devices list.

Discussion thread can be found : https://lkml.org/lkml/2011/10/15/141

Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 drivers/pci/remove.c |   33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

Index: linux-2.6/drivers/pci/remove.c
===================================================================
--- linux-2.6.orig/drivers/pci/remove.c
+++ linux-2.6/drivers/pci/remove.c
@@ -120,10 +120,43 @@ void pci_remove_behind_bridge(struct pci
 			pci_remove_bus_device(pci_dev_b(l));
 }
 
+struct dev_list {
+	struct pci_dev *dev;
+	struct list_head list;
+};
+
 static void pci_stop_bus_devices(struct pci_bus *bus)
 {
 	struct list_head *l, *n;
+	struct dev_list *dl, *dn;
+	LIST_HEAD(physfn_list);
+
+	/* Save phys_fn aside at first */
+	list_for_each(l, &bus->devices) {
+		struct pci_dev *dev = pci_dev_b(l);
+
+		if (!dev->is_virtfn) {
+			dl = kmalloc(sizeof(*dl), GFP_KERNEL);
+			if (!dl)
+				continue;
+			dl->dev = dev;
+			list_add_tail(&dl->list, &physfn_list);
+		}
+	}
+
+	/*
+	 * stop bus device for phys_fn at first
+	 *  it will stop and remove vf in driver remove action
+	 */
+	list_for_each_entry_safe(dl, dn, &physfn_list, list) {
+		struct pci_dev *dev = dl->dev;
+
+		pci_stop_bus_device(dev);
+
+		kfree(dl);
+	}
 
+	/* Do it again for left over in case */
 	list_for_each_safe(l, n, &bus->devices) {
 		struct pci_dev *dev = pci_dev_b(l);
 		pci_stop_bus_device(dev);

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-23 18:30     ` Yinghai Lu
@ 2012-01-23 18:45       ` Linus Torvalds
  2012-01-23 19:34         ` Linus Torvalds
  2012-01-23 19:36         ` Yinghai Lu
  0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2012-01-23 18:45 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci, linux-kernel

On Mon, Jan 23, 2012 at 10:30 AM, Yinghai Lu <yinghai@kernel.org> wrote:
> in list_for_each_safe()
>
> #define list_for_each_safe(pos, n, head) \
>        for (pos = (head)->next, n = pos->next; pos != (head); \
>                pos = n, n = pos->next)
>
> n is saved before, and safe only mean pos could be freed from the
> list, but n still can be used for next loop.
>
> in our case, the list have PF and several VFs, when
> pci_stop_bus_device() is called for PF, pos are still valid, but
> VFs are removed from the list. so n will not be valid.

That still doesn't explain anything.

The whole point of the safe version is that if the entry that is being
worked on is removed, then we can still use the next one.

Why isn't this magically true in this case? If some *other* random
entry than the one that is being iterated over can magically be
removed, then the whole thing is just pure and utter crap, and no
amount of list maintenance can ever fix it.

So explain.

> https://lkml.org/lkml/2011/10/15/141
> or from attached one.

This still doesn't make sense. Why do that stupid allocation? Why not
just move the entry? Why doesn't just the sane approach work?

Exactly why does not the obvious

    /* stop bus device for phys_fn at first */
    list_for_each_safe(l, n, &bus->devices) {
        struct pci_dev *dev = pci_dev_b(l);
        if (!dev->is_virtfn)
            pci_stop_bus_device(dev);
    }

work? What the f*^& does that pci_stop_bus_device() function do to
that bus->devices list that isn't just a "list_del()" of that single
entry? And if it does anything else, it should damn well stop doing
that.

The *exact* same loop is then apparently working for the virtual
device case, so what the hell is going on that it wouldn't work for
the physical one?

What's the confusion? Why all the crazy idiotic code that makes no sense?

                      Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-23 18:45       ` Linus Torvalds
@ 2012-01-23 19:34         ` Linus Torvalds
  2012-01-23 19:59           ` Yinghai Lu
  2012-01-24  4:34           ` Benjamin Herrenschmidt
  2012-01-23 19:36         ` Yinghai Lu
  1 sibling, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2012-01-23 19:34 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Konrad Rzeszutek Wilk

On Mon, Jan 23, 2012 at 10:45 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Why isn't this magically true in this case? If some *other* random
> entry than the one that is being iterated over can magically be
> removed, then the whole thing is just pure and utter crap, and no
> amount of list maintenance can ever fix it.
>
> So explain.

Ahh. I finally understand what's going on. The virtual device attached
to a physical device can go away, and it's on the same damn list.

That's broken. Virtual devices set up by a physical device should be
*children* of the physical device, not "siblings". But that's
apparently not what the broken virtual PCI layer does.

So I think that there are two possible solutions:

 (a) fix the virtual devices that are attached to physical devices to
really be children of the physical device, with the physical device as
a "bridge" to that virtual bus.

I think this is the correct solution from any sane standpoint (now the
topology of the device tree actually matches the logical
relationship), which is why I think this is the RightThing(tm) to do.
And it should automatically fix this insane issue where removing a
device from a bus can remove *multiple* devices from the same list.

 (b) if that isn't an option, and the virtual devices make a mess, at
least don't make the code uglier, just do something like:

    while (!list_empty(&bus->devices)) {
        struct pci_dev *dev = list_first_entry(struct pci_dev, bus_list);

        pci_stop_bus_device(dev);
    }

which at least isn't butt ugly. Add a large comment about how
pci_stop_bus_device() can remove multiple devices due to the virtual
devices having been done badly.

Who is in charge of the whole 'is_virtfn' mess? How realistic is it to
fix that crud?

                     Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-23 18:45       ` Linus Torvalds
  2012-01-23 19:34         ` Linus Torvalds
@ 2012-01-23 19:36         ` Yinghai Lu
  2012-01-23 19:44           ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Yinghai Lu @ 2012-01-23 19:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci, linux-kernel

On Mon, Jan 23, 2012 at 10:45 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jan 23, 2012 at 10:30 AM, Yinghai Lu <yinghai@kernel.org> wrote:
>> in list_for_each_safe()
>>
>> #define list_for_each_safe(pos, n, head) \
>>        for (pos = (head)->next, n = pos->next; pos != (head); \
>>                pos = n, n = pos->next)
>>
>> n is saved before, and safe only mean pos could be freed from the
>> list, but n still can be used for next loop.
>>
>> in our case, the list have PF and several VFs, when
>> pci_stop_bus_device() is called for PF, pos are still valid, but
>> VFs are removed from the list. so n will not be valid.
>
> That still doesn't explain anything.
>
> The whole point of the safe version is that if the entry that is being
> worked on is removed, then we can still use the next one.
>
> Why isn't this magically true in this case? If some *other* random
> entry than the one that is being iterated over can magically be
> removed, then the whole thing is just pure and utter crap, and no
> amount of list maintenance can ever fix it.
>
> So explain.

Now current design with SRIOV is:
when driver for PF is loaded, it will enable sriov on PF, and VFs are
created and added to
bus->devices list.

when driver for PF is removed in pci_stop_bus_device(), it will
disable sriov on PF, and VFs are removed for the list.

>
>> https://lkml.org/lkml/2011/10/15/141
>> or from attached one.
>
> This still doesn't make sense. Why do that stupid allocation? Why not
> just move the entry? Why doesn't just the sane approach work?
>
> Exactly why does not the obvious
>
>    /* stop bus device for phys_fn at first */
>    list_for_each_safe(l, n, &bus->devices) {
>        struct pci_dev *dev = pci_dev_b(l);
>        if (!dev->is_virtfn)
>            pci_stop_bus_device(dev);
>    }
>
> work? What the f*^& does that pci_stop_bus_device() function do to
> that bus->devices list that isn't just a "list_del()" of that single
> entry? And if it does anything else, it should damn well stop doing
> that.

it does not work.  because pci_stop_bus_device for PF will remove VF,
and n is not
valid.

>
> The *exact* same loop is then apparently working for the virtual
> device case, so what the hell is going on that it wouldn't work for
> the physical one?
>
> What's the confusion? Why all the crazy idiotic code that makes no sense?

Maybe we can put VF and PF in bus->devices like:
VF come first than PF?

something like:

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index 0321fa3..3f63b55 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -116,11 +116,11 @@ static int virtfn_add(struct pci_dev *dev, int
id, int reset)
 	if (reset)
 		__pci_reset_function(virtfn);

+	virtfn->is_virtfn = 1;
 	pci_device_add(virtfn, virtfn->bus);
 	mutex_unlock(&iov->dev->sriov->lock);

 	virtfn->physfn = pci_dev_get(dev);
-	virtfn->is_virtfn = 1;

 	rc = pci_bus_add_device(virtfn);
 	if (rc)
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 7cc9e2f..e43d81c 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1218,7 +1218,10 @@ void pci_device_add(struct pci_dev *dev, struct
pci_bus *bus)
 	 * and the bus list for fixup functions, etc.
 	 */
 	down_write(&pci_bus_sem);
-	list_add_tail(&dev->bus_list, &bus->devices);
+	if (dev->is_virtfn)
+		list_add(&dev->bus_list, &bus->devices);
+	else
+		list_add_tail(&dev->bus_list, &bus->devices);
 	up_write(&pci_bus_sem);
 }

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-23 19:36         ` Yinghai Lu
@ 2012-01-23 19:44           ` Linus Torvalds
  2012-01-23 21:34             ` Yinghai Lu
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2012-01-23 19:44 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Konrad Rzeszutek Wilk

On Mon, Jan 23, 2012 at 11:36 AM, Yinghai Lu <yinghai@kernel.org> wrote:
>
> Maybe we can put VF and PF in bus->devices like:
> VF come first than PF?

Ugh. Ok, so that's a disgusting hack, but it's better than messing up
the generic PCI subsystem. At least it's a disgusting hack in the IOV
code.

I still would prefer to just do the virtual devices right instead. Or
even just make the removal loop inherently robust, rather than have
that insane knowledge of virtual function devices that were done so
horribly wrong.

Or even just *keep* the virtual devices on the list even though the
physical device has been removed - make them independent of the
physical device.

Anything but that "do virtual devices utterly wrong, and then have to
work around it in the generic pci layer because it was done so badly".

                      Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-23 19:34         ` Linus Torvalds
@ 2012-01-23 19:59           ` Yinghai Lu
  2012-01-23 20:48             ` Yinghai Lu
  2012-01-24  4:34           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 27+ messages in thread
From: Yinghai Lu @ 2012-01-23 19:59 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Konrad Rzeszutek Wilk

On Mon, Jan 23, 2012 at 11:34 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>  (b) if that isn't an option, and the virtual devices make a mess, at
> least don't make the code uglier, just do something like:
>
>    while (!list_empty(&bus->devices)) {
>        struct pci_dev *dev = list_first_entry(struct pci_dev, bus_list);
>
>        pci_stop_bus_device(dev);
>    }
>
> which at least isn't butt ugly. Add a large comment about how
> pci_stop_bus_device() can remove multiple devices due to the virtual
> devices having been done badly.

yes, this one should work and less confusing.

>
> Who is in charge of the whole 'is_virtfn' mess? How realistic is it to
> fix that crud?

Not sure.  it seems the guy Yu Zhao (?) already left intel several years ago.

It seems  ./scripts/get_maintainer.pl for pci patches now only return Jesse.

Yinghai

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-23 19:59           ` Yinghai Lu
@ 2012-01-23 20:48             ` Yinghai Lu
  2012-01-23 22:35               ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Yinghai Lu @ 2012-01-23 20:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Konrad Rzeszutek Wilk

On Mon, Jan 23, 2012 at 11:59 AM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Mon, Jan 23, 2012 at 11:34 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>>  (b) if that isn't an option, and the virtual devices make a mess, at
>> least don't make the code uglier, just do something like:
>>
>>    while (!list_empty(&bus->devices)) {
>>        struct pci_dev *dev = list_first_entry(struct pci_dev, bus_list);
>>
>>        pci_stop_bus_device(dev);
>>    }
>>
>> which at least isn't butt ugly. Add a large comment about how
>> pci_stop_bus_device() can remove multiple devices due to the virtual
>> devices having been done badly.
>
> yes, this one should work and less confusing.

---
 drivers/pci/remove.c |   11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

Index: linux-2.6/drivers/pci/remove.c
===================================================================
--- linux-2.6.orig/drivers/pci/remove.c
+++ linux-2.6/drivers/pci/remove.c
@@ -126,10 +126,15 @@ void pci_remove_behind_bridge(struct pci

 static void pci_stop_bus_devices(struct pci_bus *bus)
 {
-       struct list_head *l, *n;
+       /*
+        * VFs are removed by pci_remove_bus_device() in the
+        *  pci_stop_bus_devices() code path for PF.
+        *  aka, bus->devices get updated in the process.
+        */
+       while (!list_empty(&bus->devices)) {
+               struct pci_dev *dev;

-       list_for_each_safe(l, n, &bus->devices) {
-               struct pci_dev *dev = pci_dev_b(l);
+               dev = list_first_entry(&bus->devices, struct pci_dev, bus_list);
                pci_stop_bus_device(dev);
        }
 }


it does not work... get lock up

sca21d-0a86d3bf:~ # echo 0 > /sys/bus/pci/slots/3/power
[  145.027439] pciehp 0000:80:02.2:pcie04: disable_slot: physical_slot = 3
[  145.034329] pciehp 0000:80:02.2:pcie04: pciehp_get_power_status:
SLOTCTRL a8 value read 1f9
[  145.042709] pciehp 0000:80:02.2:pcie04: pciehp_unconfigure_device:
domain:bus:dev = 0000:b0:00
[  145.053636] libfcoe_device_notification: NETDEV_UNREGISTER eth6
[  145.132359] libfcoe_device_notification: NETDEV_UNREGISTER eth14
[  145.203456]   free irq_desc for 273
[  145.207012]   free irq_desc for 274
[  145.210538]   free irq_desc for 275
[  145.214094]   free irq_desc for 276
[  145.217642]   free irq_desc for 277
[  145.221188]   free irq_desc for 278
[  145.224727]   free irq_desc for 279
[  145.233569] libfcoe_device_notification: NETDEV_UNREGISTER eth15
[  145.311168]   free irq_desc for 280
[  145.314768]   free irq_desc for 281
[  145.318296]   free irq_desc for 282
[  145.321857]   free irq_desc for 283
[  145.325412]   free irq_desc for 284
[  145.328965]   free irq_desc for 285
[  145.332505]   free irq_desc for 286
[  145.337528] libfcoe_device_notification: NETDEV_UNREGISTER eth16
[  145.423045]   free irq_desc for 287
[  145.426643]   free irq_desc for 288
[  145.430164]   free irq_desc for 289
[  145.433716]   free irq_desc for 290
[  145.437257]   free irq_desc for 291
[  145.440808]   free irq_desc for 292
[  145.444355]   free irq_desc for 293
[  146.449844]   free irq_desc for 217
[  146.453458]   free irq_desc for 218
[  146.456994]   free irq_desc for 219
[  146.460566]   free irq_desc for 220
[  146.464123]   free irq_desc for 221
[  146.467675]   free irq_desc for 222
[  146.471245]   free irq_desc for 223

[  171.286565] BUG: soft lockup - CPU#2 stuck for 22s! [bash:10294]
[  171.292570] Modules linked in:
[  171.295644] irq event stamp: 107478
[  171.299132] hardirqs last  enabled at (107477):
[<ffffffff81ceaadd>] restore_args+0x0/0x30
[  171.307424] hardirqs last disabled at (107478):
[<ffffffff81cf20eb>] apic_timer_interrupt+0x6b/0x80
[  171.316495] softirqs last  enabled at (107476):
[<ffffffff8106e59c>] __do_softirq+0x195/0x1ab
[  171.325046] softirqs last disabled at (107461):
[<ffffffff81cf2adc>] call_softirq+0x1c/0x30
[  171.333417] CPU 2
[  171.335260] Modules linked in:
[  171.338518]
[  171.340023] Pid: 10294, comm: bash Not tainted
3.3.0-rc1-tip-yh-01748-g1b96060-dirty #143 Oracle Corporation Sun
Blade X6270 M3/BD,ASSY,
[  171.352330] RIP: 0010:[<ffffffff813c9a6a>]  [<ffffffff813c9a6a>]
pci_stop_bus_device+0x66/0x78
[  171.360971] RSP: 0018:ffff8820330dfce8  EFLAGS: 00000286
[  171.366285] RAX: ffff881036485000 RBX: 0000000000000002 RCX: 00000000000003bd
[  171.373408] RDX: 0000000000000000 RSI: 0000000000000002 RDI: ffff881036471000
[  171.380532] RBP: ffff8820330dfd08 R08: 0000000000000002 R09: ffff88202c330740
[  171.387657] R10: 0000000bfc77a084 R11: ffff88202c3306d0 R12: 0000000bfc77a084
[  171.394788] R13: ffff88202c3306d0 R14: ffffffff81ceaadd R15: ffffffff810af1b9
[  171.401914] FS:  00007f07c4d7a700(0000) GS:ffff88103e200000(0000)
knlGS:0000000000000000
[  171.410005] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  171.415745] CR2: 00007fd6b37000a0 CR3: 00000010310f1000 CR4: 00000000000407e0
[  171.422869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  171.429994] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  171.437125] Process bash (pid: 10294, threadinfo ffff8820330de000,
task ffff88202c330000)
[  171.445294] Stack:
[  171.447304]  0000000000000000 ffff88103645f000 ffff881036485000
ffff881036485028
[  171.454748]  ffff8820330dfd38 ffffffff813c9a2b 0000000000000007
ffff88103645e000
[  171.462191]  ffff881036484800 ffff881036484828 ffff8820330dfd68
ffffffff813c9a2b
[  171.469647] Call Trace:
[  171.472100]  [<ffffffff813c9a2b>] pci_stop_bus_device+0x27/0x78
[  171.478011]  [<ffffffff813c9a2b>] pci_stop_bus_device+0x27/0x78
[  171.483940]  [<ffffffff813c9bc0>] pci_remove_bus_device+0x14/0x24
[  171.490036]  [<ffffffff813dbb66>] pciehp_unconfigure_device+0x107/0x16d
[  171.496648]  [<ffffffff813db5d9>] pciehp_disable_slot+0x11e/0x186
[  171.502733]  [<ffffffff813db88c>] pciehp_sysfs_disable_slot+0x68/0x108
[  171.509251]  [<ffffffff813da9c9>] disable_slot+0x58/0x5c
[  171.514556]  [<ffffffff813d77dc>] power_write_file+0xd2/0x11c
[  171.520295]  [<ffffffff813d770a>] ? get_attention_status+0xae/0xae
[  171.526471]  [<ffffffff813d1a49>] pci_slot_attr_store+0x29/0x2b
[  171.532389]  [<ffffffff81194eec>] sysfs_write_file+0x10e/0x14a
[  171.538218]  [<ffffffff8113de25>] vfs_write+0xb5/0x128
[  171.543356]  [<ffffffff8113e081>] sys_write+0x4d/0x77
[  171.548417]  [<ffffffff81cf1652>] system_call_fastpath+0x16/0x1b
[  171.554416] Code: 89 df e8 4d 7f 00 00 48 89 df e8 1b 6a 00 00 48
8d bb 90 00 00 00 e8 78 11 0c 00 80 a3 50 09 00 00 fb 48 8b 43 10 48
83 78 38 00 <74> 08 48 89 df e8 87 a3 00 00 58 5b 41 5c 41 5d 5d c3 55
48 89
[  171.574366] Call Trace:
[  171.576809]  [<ffffffff813c9a2b>] pci_stop_bus_device+0x27/0x78
[  171.582724]  [<ffffffff813c9a2b>] pci_stop_bus_device+0x27/0x78
[  171.588634]  [<ffffffff813c9bc0>] pci_remove_bus_device+0x14/0x24
[  171.594719]  [<ffffffff813dbb66>] pciehp_unconfigure_device+0x107/0x16d
[  171.601324]  [<ffffffff813db5d9>] pciehp_disable_slot+0x11e/0x186
[  171.607409]  [<ffffffff813db88c>] pciehp_sysfs_disable_slot+0x68/0x108
[  171.613926]  [<ffffffff813da9c9>] disable_slot+0x58/0x5c
[  171.619233]  [<ffffffff813d77dc>] power_write_file+0xd2/0x11c
[  171.624971]  [<ffffffff813d770a>] ? get_attention_status+0xae/0xae
[  171.631144]  [<ffffffff813d1a49>] pci_slot_attr_store+0x29/0x2b
[  171.637064]  [<ffffffff81194eec>] sysfs_write_file+0x10e/0x14a
[  171.642890]  [<ffffffff8113de25>] vfs_write+0xb5/0x128
[  171.648024]  [<ffffffff8113e081>] sys_write+0x4d/0x77
[  171.653070]  [<ffffffff81cf1652>] system_call_fastpath+0x16/0x1b

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-23 19:44           ` Linus Torvalds
@ 2012-01-23 21:34             ` Yinghai Lu
  2012-01-23 22:30               ` Yinghai Lu
  0 siblings, 1 reply; 27+ messages in thread
From: Yinghai Lu @ 2012-01-23 21:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Konrad Rzeszutek Wilk

On Mon, Jan 23, 2012 at 11:44 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Jan 23, 2012 at 11:36 AM, Yinghai Lu <yinghai@kernel.org> wrote:
>>
>> Maybe we can put VF and PF in bus->devices like:
>> VF come first than PF?
>
> Ugh. Ok, so that's a disgusting hack, but it's better than messing up
> the generic PCI subsystem. At least it's a disgusting hack in the IOV
> code.

just tested, the patch with this hack works.

>
> I still would prefer to just do the virtual devices right instead. Or
> even just make the removal loop inherently robust, rather than have
> that insane knowledge of virtual function devices that were done so
> horribly wrong.

ok, let me think about it more.

>
> Or even just *keep* the virtual devices on the list even though the
> physical device has been removed - make them independent of the
> physical device.

ah, not sure how this going to work.

Yinghai

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-23 21:34             ` Yinghai Lu
@ 2012-01-23 22:30               ` Yinghai Lu
  2012-01-23 22:38                 ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Yinghai Lu @ 2012-01-23 22:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Konrad Rzeszutek Wilk

[-- Attachment #1: Type: text/plain, Size: 853 bytes --]

On Mon, Jan 23, 2012 at 1:34 PM, Yinghai Lu <yinghai@kernel.org> wrote:
> On Mon, Jan 23, 2012 at 11:44 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Mon, Jan 23, 2012 at 11:36 AM, Yinghai Lu <yinghai@kernel.org> wrote:
>>>
>>> Maybe we can put VF and PF in bus->devices like:
>>> VF come first than PF?
>>
>> Ugh. Ok, so that's a disgusting hack, but it's better than messing up
>> the generic PCI subsystem. At least it's a disgusting hack in the IOV
>> code.
>
> just tested, the patch with this hack works.
>
>>
>> I still would prefer to just do the virtual devices right instead. Or
>> even just make the removal loop inherently robust, rather than have
>> that insane knowledge of virtual function devices that were done so
>> horribly wrong.

for_each_prev_safe is working...

please check attached patch...

Thanks

Yinghai

[-- Attachment #2: pci_001_debug_sriov_hot_remove_7.patch --]
[-- Type: text/x-patch, Size: 5702 bytes --]

Subject: [PATCH -v5] PCI: Make sriov work with hotplug remove

When hot remove pci express module that have pcie switch and support SRIOV, got

[ 5918.610127] pciehp 0000:80:02.2:pcie04: pcie_isr: intr_loc 1
[ 5918.615779] pciehp 0000:80:02.2:pcie04: Attention button interrupt received
[ 5918.622730] pciehp 0000:80:02.2:pcie04: Button pressed on Slot(3)
[ 5918.629002] pciehp 0000:80:02.2:pcie04: pciehp_get_power_status: SLOTCTRL a8 value read 1f9
[ 5918.637416] pciehp 0000:80:02.2:pcie04: PCI slot #3 - powering off due to button press.
[ 5918.647125] pciehp 0000:80:02.2:pcie04: pcie_isr: intr_loc 10
[ 5918.653039] pciehp 0000:80:02.2:pcie04: pciehp_green_led_blink: SLOTCTRL a8 write cmd 200
[ 5918.661229] pciehp 0000:80:02.2:pcie04: pciehp_set_attention_status: SLOTCTRL a8 write cmd c0
[ 5924.667627] pciehp 0000:80:02.2:pcie04: Disabling domain:bus:device=0000:b0:00
[ 5924.674909] pciehp 0000:80:02.2:pcie04: pciehp_get_power_status: SLOTCTRL a8 value read 2f9
[ 5924.683262] pciehp 0000:80:02.2:pcie04: pciehp_unconfigure_device: domain:bus:dev = 0000:b0:00
[ 5924.693976] libfcoe_device_notification: NETDEV_UNREGISTER eth6
[ 5924.764979] libfcoe_device_notification: NETDEV_UNREGISTER eth14
[ 5924.873539] libfcoe_device_notification: NETDEV_UNREGISTER eth15
[ 5924.995209] libfcoe_device_notification: NETDEV_UNREGISTER eth16
[ 5926.114407] sxge 0000:b2:00.0: PCI INT A disabled
[ 5926.119342] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 5926.127189] IP: [<ffffffff81353a3b>] pci_stop_bus_device+0x33/0x83
[ 5926.133377] PGD 0
[ 5926.135402] Oops: 0000 [#1] SMP
[ 5926.138659] CPU 2
[ 5926.140499] Modules linked in:
...
[ 5926.143754]
[ 5926.275823] Call Trace:
[ 5926.278267]  [<ffffffff81353a38>] pci_stop_bus_device+0x30/0x83
[ 5926.284180]  [<ffffffff81353af4>] pci_remove_bus_device+0x1a/0xba
[ 5926.290264]  [<ffffffff81366311>] pciehp_unconfigure_device+0x110/0x17b
[ 5926.296866]  [<ffffffff81365dd9>] ? pciehp_disable_slot+0x188/0x188
[ 5926.303123]  [<ffffffff81365d6f>] pciehp_disable_slot+0x11e/0x188
[ 5926.309206]  [<ffffffff81365e68>] pciehp_power_thread+0x8f/0xe0
...

 +-[0000:80]-+-00.0-[81-8f]--
 |           +-01.0-[90-9f]--
 |           +-02.0-[a0-af]--
 |           +-02.2-[b0-bf]----00.0-[b1-b3]--+-02.0-[b2]--+-00.0 Device
 |           |                               |            +-00.1 Device
 |           |                               |            +-00.2 Device
 |           |                               |            \-00.3 Device
 |           |                               \-03.0-[b3]--+-00.0 Device
 |           |                                            +-00.1 Device
 |           |                                            +-00.2 Device
 |           |                                            \-00.3 Device

root complex: 80:02.2
pci express modules: have pcie switch and are listed as b0:00.0, b1:02.0 and b1:03.0.
                end devices  are b2:00.0 and b3.00.0.
                VFs are: b2:00.1,... b2:00.3, and b3:00.1,...,b3:00.3

Root cause: when doing pci_stop_bus_device() with phys fn, it will stop virt fn and
remove the fn, so
	list_for_each_safe(l, n, &bus->devices)
will have problem to refer freed n that is pointed to vf entry.

Solution is just replacing list_for_each_safe() with list_for_each().
	pci_stop_bus_device(dev) will not remove dev from bus->devices list,
	so We only need use for_each here.

During reviewing the patch, Bjorn said:
|   The PCI hot-remove path calls pci_stop_bus_devices() via
|   pci_remove_bus_device().
|
|   pci_stop_bus_devices() traverses the bus->devices list (point A below),
|   stopping each device in turn, which calls the driver remove() method.  When
|   the device is an SR-IOV PF, the driver calls pci_disable_sriov(), which
|   also uses pci_remove_bus_device() to remove the VF devices from the
|   bus->devices list (point B).
|
|       pci_remove_bus_device
|         pci_stop_bus_device
|           pci_stop_bus_devices(subordinate)
|             list_for_each(bus->devices)             <-- A
|               pci_stop_bus_device(PF)
|                 ...
|                   driver->remove
|                     pci_disable_sriov
|                       ...
|                         pci_remove_bus_device(VF)
|                             <remove from bus_list>  <-- B
|
|   At B, we're changing the same list we're iterating through at A, so when
|   the driver remove() method returns, the pci_stop_bus_devices() iterator has
|   a pointer to a list entry that has already been freed.

Discussion thread can be found : https://lkml.org/lkml/2011/10/15/141

-v5: According to Linus to make remove more robust, Change to list_for_each_prev_safe
	instead. That is more reasonable, because those devices are added to tail of
	the list before.

Signed-off-by: Yinghai Lu <yinghai@kernel.org>

---
 drivers/pci/remove.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

Index: linux-2.6/drivers/pci/remove.c
===================================================================
--- linux-2.6.orig/drivers/pci/remove.c
+++ linux-2.6/drivers/pci/remove.c
@@ -128,7 +128,15 @@ static void pci_stop_bus_devices(struct
 {
 	struct list_head *l, *n;
 
-	list_for_each_safe(l, n, &bus->devices) {
+	/*
+	 * VFs could be removed by pci_remove_bus_device() in the
+	 *  pci_stop_bus_devices() code path for PF.
+	 *  aka, bus->devices get updated in the process.
+	 * but VFs are inserted after PFs when SRIOV is enabled for PF,
+	 * We can iterate the list backwards to get prev valid PF instead
+	 *  of removed VF.
+	 */
+	list_for_each_prev_safe(l, n, &bus->devices) {
 		struct pci_dev *dev = pci_dev_b(l);
 		pci_stop_bus_device(dev);
 	}

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-23 20:48             ` Yinghai Lu
@ 2012-01-23 22:35               ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2012-01-23 22:35 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Konrad Rzeszutek Wilk

On Mon, Jan 23, 2012 at 12:48 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>
> it does not work... get lock up

Hmm. That's consistent with pci_stop_bus_device() just not removing
the device at all from the bus list. Which on the face of it would be
sane, since it's "stop", not "remove".

There's definitely something odd going on about that whole thing - not
just the confusion about "stop" vs "remove". It has that "is_added"
logic too. I have no idea what the logic of that whole thing is, it
really doesn't look sane.

                        Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-23 22:30               ` Yinghai Lu
@ 2012-01-23 22:38                 ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2012-01-23 22:38 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci,
	linux-kernel, Konrad Rzeszutek Wilk

On Mon, Jan 23, 2012 at 2:30 PM, Yinghai Lu <yinghai@kernel.org> wrote:
>
> for_each_prev_safe is working...
>
> please check attached patch...

Ugh. This is probably the best one so far, but the whole thing is
really confused.

Why does "stop_bus_device" do a "remove" for VF's at _all_? Why is
everything about those stupid virtual devices so screwed up? Why do we
have separate "stop" and "remove" methods, and then the "stop" does a
remove? WTF?

                         Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/7] PCI: Make sriov work with hotplug remove
  2012-01-23 19:34         ` Linus Torvalds
  2012-01-23 19:59           ` Yinghai Lu
@ 2012-01-24  4:34           ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2012-01-24  4:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Yinghai Lu, Jesse Barnes, Kenji Kaneshige, Matthew Wilcox,
	linux-pci, linux-kernel, Konrad Rzeszutek Wilk

On Mon, 2012-01-23 at 11:34 -0800, Linus Torvalds wrote:
> On Mon, Jan 23, 2012 at 10:45 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Why isn't this magically true in this case? If some *other* random
> > entry than the one that is being iterated over can magically be
> > removed, then the whole thing is just pure and utter crap, and no
> > amount of list maintenance can ever fix it.
> >
> > So explain.
> 
> Ahh. I finally understand what's going on. The virtual device attached
> to a physical device can go away, and it's on the same damn list.
> 
> That's broken. Virtual devices set up by a physical device should be
> *children* of the physical device, not "siblings". But that's
> apparently not what the broken virtual PCI layer does.

Thank the PCI SIG for that ... they are sibling functions (or even
devices in some case) of the PF :-(

> So I think that there are two possible solutions:
> 
>  (a) fix the virtual devices that are attached to physical devices to
> really be children of the physical device, with the physical device as
> a "bridge" to that virtual bus.

This will confuse various other aspects of the PCI code since they are
really siblings from an addressing standpoint (ie bus/dev/fn)

Cheers,
Ben.

> I think this is the correct solution from any sane standpoint (now the
> topology of the device tree actually matches the logical
> relationship), which is why I think this is the RightThing(tm) to do.
> And it should automatically fix this insane issue where removing a
> device from a bus can remove *multiple* devices from the same list.
> 
>  (b) if that isn't an option, and the virtual devices make a mess, at
> least don't make the code uglier, just do something like:
> 
>     while (!list_empty(&bus->devices)) {
>         struct pci_dev *dev = list_first_entry(struct pci_dev, bus_list);
> 
>         pci_stop_bus_device(dev);
>     }
> 
> which at least isn't butt ugly. Add a large comment about how
> pci_stop_bus_device() can remove multiple devices due to the virtual
> devices having been done badly.
> 
> Who is in charge of the whole 'is_virtfn' mess? How realistic is it to
> fix that crud?
> 
>                      Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pci" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 6/7] pciehp: Add Disable/enable link functions
  2012-01-23 16:13   ` Linus Torvalds
@ 2012-01-24  5:36     ` Yinghai Lu
  0 siblings, 0 replies; 27+ messages in thread
From: Yinghai Lu @ 2012-01-24  5:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jesse Barnes, Kenji Kaneshige, Matthew Wilcox, linux-pci, linux-kernel

On Mon, Jan 23, 2012 at 8:13 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sat, Jan 21, 2012 at 1:52 AM, Yinghai Lu <yinghai@kernel.org> wrote:
>> +
>> +       retval = pciehp_readw(ctrl, PCI_EXP_LNKCTL, &lnk_ctrl);
>> +       if (retval) {
>> +               ctrl_err(ctrl, "Cannot read LNKCTRL register\n");
>> +               return retval;
>> +       }
>
> Is there really any point at all in checking the return value of that
> readw() function?
>
> Nobody actually ever sets it. We should stop even bothering to pretend
> to care. There is no return value in real life.
>
> If the device doesn't exist or the read fails, you will never ever get
> an error *anyway*. You'll likely get 0xffff, and possibly a machine
> check exception. And that is unlikely to ever change - even if the
> hardware were to do it, it's insane to do error checking on an
> individual IO level. It causes more problems than it could possibly
> ever solve - error checking has to be done at a different level than
> on some random individual access.
>
> We should make the return type of pci_read_config_word() be 'void',
> instead of having code like this that looks like it is careful, but in
> actual fact in reality is just totally pointless.

looks like

#define PCI_OP_READ(size,type,len) \
int pci_bus_read_config_##size \
        (struct pci_bus *bus, unsigned int devfn, int pos, type *value) \
{                                                                       \
        int res;                                                        \
        unsigned long flags;                                            \
        u32 data = 0;                                                   \
        if (PCI_##size##_BAD) return PCIBIOS_BAD_REGISTER_NUMBER;       \
        raw_spin_lock_irqsave(&pci_lock, flags);                        \
        res = bus->ops->read(bus, devfn, pos, len, &data);              \
        *value = (type)data;                                            \
        raw_spin_unlock_irqrestore(&pci_lock, flags);           \
        return res;                                                     \
}

will check register number. like we can not pass 0x01 but want to read
word and dword.

also bus ops read/write like pci_mmcfg_read/write will double check if
passed bus/dev/fn/reg
ranges.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] PCI: pcie hotplug related patch
  2012-01-21 10:26 ` [PATCH 0/7] PCI: pcie hotplug related patch Yinghai Lu
@ 2012-01-27 18:36   ` Jesse Barnes
  2012-01-27 18:58     ` Yinghai Lu
  0 siblings, 1 reply; 27+ messages in thread
From: Jesse Barnes @ 2012-01-27 18:36 UTC (permalink / raw)
  To: Yinghai Lu
  Cc: Linus Torvalds, Kenji Kaneshige, Matthew Wilcox, linux-pci, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 1499 bytes --]

On Sat, 21 Jan 2012 02:26:46 -0800
Yinghai Lu <yinghai@kernel.org> wrote:

> On Sat, Jan 21, 2012 at 1:52 AM, Yinghai Lu <yinghai@kernel.org> wrote:
> > b40c9ef: pciehp: Disable/enable link during slot power off/on
> > fe877ae: pciehp: Add Disable/enable link functions
> > 292e286: pciehp: Add pcie_wait_link_not_active()
> > 186b3fd: pciehp: print out link status when dlla get active.
> > 2781df4: pciehp: Checking pci conf reading to new added device instead of sleep 1s
> > bf1c4ac: PCI: Separate pci_bus_read_dev_vendor_id from pci_scan_device
> > 601c6f1: PCI: Make sriov work with hotplug remove
> >
> >  drivers/pci/hotplug/pciehp_hpc.c |  133 +++++++++++++++++++++++++++++++-------
> >  drivers/pci/pci.h                |    2 +
> >  drivers/pci/probe.c              |   48 +++++++++-----
> >  drivers/pci/remove.c             |   28 ++++++++-
> >  4 files changed, 169 insertions(+), 42 deletions(-)
> >
> > First one is for hotplug with sriov under bridge.
> > following two are for hotplug probing with pci conf reading.
> > Last four are for pcie link disable when power off slots.
> 
> could be found at:
> 
> 	git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
> for-pci-hp

Ok I dropped the earlier link state check; can you re-post the series
with Linus's comments addressed?

Kenji-san, do you have time to at least review these?  Testing would be
nice too...

Thanks,
-- 
Jesse Barnes, Intel Open Source Technology Center

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] PCI: pcie hotplug related patch
  2012-01-27 18:36   ` Jesse Barnes
@ 2012-01-27 18:58     ` Yinghai Lu
  2012-01-30  3:42       ` Kenji Kaneshige
  0 siblings, 1 reply; 27+ messages in thread
From: Yinghai Lu @ 2012-01-27 18:58 UTC (permalink / raw)
  To: Jesse Barnes
  Cc: Linus Torvalds, Kenji Kaneshige, Matthew Wilcox, linux-pci, linux-kernel

On Fri, Jan 27, 2012 at 10:36 AM, Jesse Barnes <jbarnes@virtuousgeek.org> wrote:

>>
>>       git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
>> for-pci-hp
>
> Ok I dropped the earlier link state check; can you re-post the series
> with Linus's comments addressed?

resent.

first one patch should get Linus's Ack, but he just said that solution
is the best so far.

>
> Kenji-san, do you have time to at least review these?  Testing would be
> nice too...

others should already address all Kenji's reviews.

Thanks

Yinghai

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/7] PCI: pcie hotplug related patch
  2012-01-27 18:58     ` Yinghai Lu
@ 2012-01-30  3:42       ` Kenji Kaneshige
  0 siblings, 0 replies; 27+ messages in thread
From: Kenji Kaneshige @ 2012-01-30  3:42 UTC (permalink / raw)
  To: Yinghai Lu, Jesse Barnes
  Cc: Linus Torvalds, Matthew Wilcox, linux-pci, linux-kernel

(2012/01/28 3:58), Yinghai Lu wrote:
> On Fri, Jan 27, 2012 at 10:36 AM, Jesse Barnes<jbarnes@virtuousgeek.org>  wrote:
>
>>>
>>>        git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
>>> for-pci-hp
>>
>> Ok I dropped the earlier link state check; can you re-post the series
>> with Linus's comments addressed?
>
> resent.
>
> first one patch should get Linus's Ack, but he just said that solution
> is the best so far.
>
>>
>> Kenji-san, do you have time to at least review these?  Testing would be
>> nice too...
>
> others should already address all Kenji's reviews.
>

I have a chance to test the patches in this week. I will do it and report
the result. For some patches, I don't have any environment to reproduce
the problem the patch wants to solve. So it is difficult for me to confirm
the patch fix the problem. But I will test them from the regression
point of view.

Regards,
Kenji Kaneshige

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2012-01-30  3:42 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2012-01-21  9:52 [PATCH 0/7] PCI: pcie hotplug related patch Yinghai Lu
2012-01-21  9:52 ` [PATCH 1/7] PCI: Make sriov work with hotplug remove Yinghai Lu
2012-01-23 16:06   ` Linus Torvalds
2012-01-23 18:30     ` Yinghai Lu
2012-01-23 18:45       ` Linus Torvalds
2012-01-23 19:34         ` Linus Torvalds
2012-01-23 19:59           ` Yinghai Lu
2012-01-23 20:48             ` Yinghai Lu
2012-01-23 22:35               ` Linus Torvalds
2012-01-24  4:34           ` Benjamin Herrenschmidt
2012-01-23 19:36         ` Yinghai Lu
2012-01-23 19:44           ` Linus Torvalds
2012-01-23 21:34             ` Yinghai Lu
2012-01-23 22:30               ` Yinghai Lu
2012-01-23 22:38                 ` Linus Torvalds
2012-01-21  9:52 ` [PATCH 2/7] PCI: Separate pci_bus_read_dev_vendor_id from pci_scan_device Yinghai Lu
2012-01-21  9:52 ` [PATCH 3/7] pciehp: Checking pci conf reading to new added device instead of sleep 1s Yinghai Lu
2012-01-21  9:52 ` [PATCH 4/7] pciehp: print out link status when dlla get active Yinghai Lu
2012-01-21  9:52 ` [PATCH 5/7] pciehp: Add pcie_wait_link_not_active() Yinghai Lu
2012-01-21  9:52 ` [PATCH 6/7] pciehp: Add Disable/enable link functions Yinghai Lu
2012-01-23 16:13   ` Linus Torvalds
2012-01-24  5:36     ` Yinghai Lu
2012-01-21  9:52 ` [PATCH 7/7] pciehp: Disable/enable link during slot power off/on Yinghai Lu
2012-01-21 10:26 ` [PATCH 0/7] PCI: pcie hotplug related patch Yinghai Lu
2012-01-27 18:36   ` Jesse Barnes
2012-01-27 18:58     ` Yinghai Lu
2012-01-30  3:42       ` Kenji Kaneshige

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).