All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC/PATCH 0/16] Ops based MSI Implementation
@ 2007-01-25  8:34 Michael Ellerman
  2007-01-25  8:34 ` [RFC/PATCH 1/16] Replace pci_msi_quirk with calls to pci_no_msi() Michael Ellerman
                   ` (19 more replies)
  0 siblings, 20 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

OK, here's a first cut at moving ops based MSI into the generic code. I'm
posting this now to make sure I'm not heading off into the weeds.

The fifth patch contain the guts of it, I've included the MPIC and
RTAS backends as examples. In fact they actually work.

In order to smoothly merge this with the old MSI code, the two will need to
coexist in the tree for at least a few commits, so I've added (invisible)
Kconfig symbols to allow that.

I plan to merge the Intel code by:
 * copying it into drivers/pci/msi/intel.c with zero changes.
 * providing a minimal shim to connect the ops code to the intel code.
 * at this point the code should be functional but ugly as hell.
 * via a longish series of patches, adapt the intel code to better match
   the new ops code.
 * this should allow us to bisect through to find any mistakes.

If people think that's crazy and or stupid please let me know :)

TBD are:
 * suspend / resume hooks in the ops - this shouldn't be too tricky with
   the power management API cleaned up a touch.
 * working out why the hell msi_remove_pci_irq_vectors() is a special case ?

cheers

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 1/16] Replace pci_msi_quirk with calls to pci_no_msi()
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25 22:33   ` patch msi-replace-pci_msi_quirk-with-calls-to-pci_no_msi.patch added to gregkh-2.6 tree gregkh
  2007-01-25  8:34 ` [RFC/PATCH 3/16] Combine pci_(save|restore)_msi/msix_state Michael Ellerman
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

I don't see any reason why we need pci_msi_quirk, quirk code can just
call pci_no_msi() instead.

Remove the check of pci_msi_quirk in msi_init(). This is safe as all
calls to msi_init() are protected by calls to pci_msi_supported(),
which checks pci_msi_enable, which is disabled by pci_no_msi().

The pci_disable_msi routines didn't check pci_msi_quirk, only
pci_msi_enable, but as far as I can see that was a bug not a feature.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 drivers/net/bnx2.c   |    3 +--
 drivers/pci/msi.c    |    7 -------
 drivers/pci/pci.h    |    6 +-----
 drivers/pci/quirks.c |    7 ++-----
 4 files changed, 4 insertions(+), 19 deletions(-)

Index: msi/drivers/net/bnx2.c
===================================================================
--- msi.orig/drivers/net/bnx2.c
+++ msi/drivers/net/bnx2.c
@@ -5942,8 +5942,7 @@ bnx2_init_board(struct pci_dev *pdev, st
 	 * responding after a while.
 	 *
 	 * AMD believes this incompatibility is unique to the 5706, and
-	 * prefers to locally disable MSI rather than globally disabling it
-	 * using pci_msi_quirk.
+	 * prefers to locally disable MSI rather than globally disabling it.
 	 */
 	if (CHIP_NUM(bp) == CHIP_NUM_5706 && disable_msi == 0) {
 		struct pci_dev *amd_8132 = NULL;
Index: msi/drivers/pci/msi.c
===================================================================
--- msi.orig/drivers/pci/msi.c
+++ msi/drivers/pci/msi.c
@@ -169,13 +169,6 @@ static int msi_init(void)
 	if (!status)
 		return status;
 
-	if (pci_msi_quirk) {
-		pci_msi_enable = 0;
-		printk(KERN_WARNING "PCI: MSI quirk detected. MSI disabled.\n");
-		status = -EINVAL;
-		return status;
-	}
-
 	status = msi_cache_init();
 	if (status < 0) {
 		pci_msi_enable = 0;
Index: msi/drivers/pci/pci.h
===================================================================
--- msi.orig/drivers/pci/pci.h
+++ msi/drivers/pci/pci.h
@@ -43,12 +43,8 @@ extern void pci_remove_legacy_files(stru
 /* Lock for read/write access to pci device and bus lists */
 extern struct rw_semaphore pci_bus_sem;
 
-#ifdef CONFIG_PCI_MSI
-extern int pci_msi_quirk;
-#else
-#define pci_msi_quirk 0
-#endif
 extern unsigned int pci_pm_d3_delay;
+
 #ifdef CONFIG_PCI_MSI
 void disable_msi_mode(struct pci_dev *dev, int pos, int type);
 void pci_no_msi(void);
Index: msi/drivers/pci/quirks.c
===================================================================
--- msi.orig/drivers/pci/quirks.c
+++ msi/drivers/pci/quirks.c
@@ -1682,9 +1682,6 @@ DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_N
 			quirk_nvidia_ck804_pcie_aer_ext_cap);
 
 #ifdef CONFIG_PCI_MSI
-/* To disable MSI globally */
-int pci_msi_quirk;
-
 /* The Serverworks PCI-X chipset does not support MSI. We cannot easily rely
  * on setting PCI_BUS_FLAGS_NO_MSI in its bus flags because there are actually
  * some other busses controlled by the chipset even if Linux is not aware of it.
@@ -1693,8 +1690,8 @@ int pci_msi_quirk;
  */
 static void __init quirk_svw_msi(struct pci_dev *dev)
 {
-	pci_msi_quirk = 1;
-	printk(KERN_WARNING "PCI: MSI quirk detected. pci_msi_quirk set.\n");
+	pci_no_msi();
+	printk(KERN_WARNING "PCI: MSI quirk detected. MSI deactivated.\n");
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_SERVERWORKS, PCI_DEVICE_ID_SERVERWORKS_GCNB_LE, quirk_svw_msi);
 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 2/16] Remove pci_scan_msi_device()
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
  2007-01-25  8:34 ` [RFC/PATCH 1/16] Replace pci_msi_quirk with calls to pci_no_msi() Michael Ellerman
  2007-01-25  8:34 ` [RFC/PATCH 3/16] Combine pci_(save|restore)_msi/msix_state Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25 22:33   ` patch msi-remove-pci_scan_msi_device.patch added to gregkh-2.6 tree gregkh
  2007-01-25  8:34 ` [RFC/PATCH 5/16] Ops based MSI implementation Michael Ellerman
                   ` (16 subsequent siblings)
  19 siblings, 1 reply; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

pci_scan_msi_device() doesn't do anything anymore, so remove it.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 arch/powerpc/kernel/pci_64.c |    2 --
 drivers/pci/msi.c            |    6 ------
 drivers/pci/probe.c          |    1 -
 include/linux/pci.h          |    2 --
 4 files changed, 11 deletions(-)

Index: msi/arch/powerpc/kernel/pci_64.c
===================================================================
--- msi.orig/arch/powerpc/kernel/pci_64.c
+++ msi/arch/powerpc/kernel/pci_64.c
@@ -381,8 +381,6 @@ struct pci_dev *of_create_pci_dev(struct
 
 	pci_device_add(dev, bus);
 
-	/* XXX pci_scan_msi_device(dev); */
-
 	return dev;
 }
 EXPORT_SYMBOL(of_create_pci_dev);
Index: msi/drivers/pci/msi.c
===================================================================
--- msi.orig/drivers/pci/msi.c
+++ msi/drivers/pci/msi.c
@@ -293,12 +293,6 @@ static int msi_lookup_irq(struct pci_dev
 	return -EACCES;
 }
 
-void pci_scan_msi_device(struct pci_dev *dev)
-{
-	if (!dev)
-		return;
-}
-
 #ifdef CONFIG_PM
 int pci_save_msi_state(struct pci_dev *dev)
 {
Index: msi/drivers/pci/probe.c
===================================================================
--- msi.orig/drivers/pci/probe.c
+++ msi/drivers/pci/probe.c
@@ -902,7 +902,6 @@ pci_scan_single_device(struct pci_bus *b
 		return NULL;
 
 	pci_device_add(dev, bus);
-	pci_scan_msi_device(dev);
 
 	return dev;
 }
Index: msi/include/linux/pci.h
===================================================================
--- msi.orig/include/linux/pci.h
+++ msi/include/linux/pci.h
@@ -622,7 +622,6 @@ struct msix_entry {
 
 
 #ifndef CONFIG_PCI_MSI
-static inline void pci_scan_msi_device(struct pci_dev *dev) {}
 static inline int pci_enable_msi(struct pci_dev *dev) {return -1;}
 static inline void pci_disable_msi(struct pci_dev *dev) {}
 static inline int pci_enable_msix(struct pci_dev* dev,
@@ -630,7 +629,6 @@ static inline int pci_enable_msix(struct
 static inline void pci_disable_msix(struct pci_dev *dev) {}
 static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) {}
 #else
-extern void pci_scan_msi_device(struct pci_dev *dev);
 extern int pci_enable_msi(struct pci_dev *dev);
 extern void pci_disable_msi(struct pci_dev *dev);
 extern int pci_enable_msix(struct pci_dev* dev,

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 3/16] Combine pci_(save|restore)_msi/msix_state
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
  2007-01-25  8:34 ` [RFC/PATCH 1/16] Replace pci_msi_quirk with calls to pci_no_msi() Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25 22:33   ` patch msi-combine-pci__msi-msix_state.patch added to gregkh-2.6 tree gregkh
  2007-01-25  8:34 ` [RFC/PATCH 2/16] Remove pci_scan_msi_device() Michael Ellerman
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

The PCI save/restore code doesn't need to care about MSI vs MSI-X, all
it really wants is to say "save/restore all MSI(-X) info for this device".

This is borne out in the code, we call the MSI and MSI-X save routines
side by side, and similarly with the restore routines.

So combine the MSI/MSI-X routines into pci_save_msi_state() and
pci_restore_msi_state(). It is up to those routines to decide what state
needs to be saved.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 drivers/pci/msi.c |   27 +++++++++++++++++++++++----
 drivers/pci/pci.c |    4 +---
 drivers/pci/pci.h |    6 ++----
 3 files changed, 26 insertions(+), 11 deletions(-)

Index: msi/drivers/pci/msi.c
===================================================================
--- msi.orig/drivers/pci/msi.c
+++ msi/drivers/pci/msi.c
@@ -294,7 +294,7 @@ static int msi_lookup_irq(struct pci_dev
 }
 
 #ifdef CONFIG_PM
-int pci_save_msi_state(struct pci_dev *dev)
+static int __pci_save_msi_state(struct pci_dev *dev)
 {
 	int pos, i = 0;
 	u16 control;
@@ -332,7 +332,7 @@ int pci_save_msi_state(struct pci_dev *d
 	return 0;
 }
 
-void pci_restore_msi_state(struct pci_dev *dev)
+static void __pci_restore_msi_state(struct pci_dev *dev)
 {
 	int i = 0, pos;
 	u16 control;
@@ -360,7 +360,7 @@ void pci_restore_msi_state(struct pci_de
 	kfree(save_state);
 }
 
-int pci_save_msix_state(struct pci_dev *dev)
+static int __pci_save_msix_state(struct pci_dev *dev)
 {
 	int pos;
 	int temp;
@@ -408,7 +408,20 @@ int pci_save_msix_state(struct pci_dev *
 	return 0;
 }
 
-void pci_restore_msix_state(struct pci_dev *dev)
+int pci_save_msi_state(struct pci_dev *dev)
+{
+	int rc;
+
+	rc = __pci_save_msi_state(dev);
+	if (rc)
+		return rc;
+
+	rc = __pci_save_msix_state(dev);
+
+	return rc;
+}
+
+static void __pci_restore_msix_state(struct pci_dev *dev)
 {
 	u16 save;
 	int pos;
@@ -445,6 +458,12 @@ void pci_restore_msix_state(struct pci_d
 	pci_write_config_word(dev, msi_control_reg(pos), save);
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
 }
+
+void pci_restore_msi_state(struct pci_dev *dev)
+{
+	__pci_restore_msi_state(dev);
+	__pci_restore_msix_state(dev);
+}
 #endif
 
 /**
Index: msi/drivers/pci/pci.c
===================================================================
--- msi.orig/drivers/pci/pci.c
+++ msi/drivers/pci/pci.c
@@ -632,8 +632,6 @@ pci_save_state(struct pci_dev *dev)
 		pci_read_config_dword(dev, i * 4,&dev->saved_config_space[i]);
 	if ((i = pci_save_msi_state(dev)) != 0)
 		return i;
-	if ((i = pci_save_msix_state(dev)) != 0)
-		return i;
 	if ((i = pci_save_pcie_state(dev)) != 0)
 		return i;
 	if ((i = pci_save_pcix_state(dev)) != 0)
@@ -671,7 +669,7 @@ pci_restore_state(struct pci_dev *dev)
 	}
 	pci_restore_pcix_state(dev);
 	pci_restore_msi_state(dev);
-	pci_restore_msix_state(dev);
+
 	return 0;
 }
 
Index: msi/drivers/pci/pci.h
===================================================================
--- msi.orig/drivers/pci/pci.h
+++ msi/drivers/pci/pci.h
@@ -52,17 +52,15 @@ void pci_no_msi(void);
 static inline void disable_msi_mode(struct pci_dev *dev, int pos, int type) { }
 static inline void pci_no_msi(void) { }
 #endif
+
 #if defined(CONFIG_PCI_MSI) && defined(CONFIG_PM)
 int pci_save_msi_state(struct pci_dev *dev);
-int pci_save_msix_state(struct pci_dev *dev);
 void pci_restore_msi_state(struct pci_dev *dev);
-void pci_restore_msix_state(struct pci_dev *dev);
 #else
 static inline int pci_save_msi_state(struct pci_dev *dev) { return 0; }
-static inline int pci_save_msix_state(struct pci_dev *dev) { return 0; }
 static inline void pci_restore_msi_state(struct pci_dev *dev) {}
-static inline void pci_restore_msix_state(struct pci_dev *dev) {}
 #endif
+
 static inline int pci_no_d1d2(struct pci_dev *dev)
 {
 	unsigned int parent_dstates = 0;

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 4/16] Abstract MSI suspend
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (3 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 5/16] Ops based MSI implementation Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25 22:33   ` patch msi-abstract-msi-suspend.patch added to gregkh-2.6 tree gregkh
  2007-01-28  8:27   ` [RFC/PATCH 4/16] Abstract MSI suspend Eric W. Biederman
  2007-01-25  8:34 ` [RFC/PATCH 6/16] Add bare metal MSI enable & disable routines Michael Ellerman
                   ` (14 subsequent siblings)
  19 siblings, 2 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

Currently pci_disable_device() disables MSI on a device by twiddling
bits in config space via disable_msi_mode().

On some platforms that may not be appropriate, so abstract the MSI
suspend logic into pci_disable_device_msi().

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 drivers/pci/msi.c |   11 +++++++++++
 drivers/pci/pci.c |    7 +------
 drivers/pci/pci.h |    2 ++
 3 files changed, 14 insertions(+), 6 deletions(-)

Index: msi/drivers/pci/msi.c
===================================================================
--- msi.orig/drivers/pci/msi.c
+++ msi/drivers/pci/msi.c
@@ -271,6 +271,17 @@ void disable_msi_mode(struct pci_dev *de
 	pci_intx(dev, 1);  /* enable intx */
 }
 
+void pci_disable_device_msi(struct pci_dev *dev)
+{
+	if (dev->msi_enabled)
+		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
+			PCI_CAP_ID_MSI);
+
+	if (dev->msix_enabled)
+		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
+			PCI_CAP_ID_MSIX);
+}
+
 static int msi_lookup_irq(struct pci_dev *dev, int type)
 {
 	int irq;
Index: msi/drivers/pci/pci.c
===================================================================
--- msi.orig/drivers/pci/pci.c
+++ msi/drivers/pci/pci.c
@@ -770,12 +770,7 @@ pci_disable_device(struct pci_dev *dev)
 	if (atomic_sub_return(1, &dev->enable_cnt) != 0)
 		return;
 
-	if (dev->msi_enabled)
-		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
-			PCI_CAP_ID_MSI);
-	if (dev->msix_enabled)
-		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
-			PCI_CAP_ID_MSIX);
+	pci_disable_device_msi(dev);
 
 	pci_read_config_word(dev, PCI_COMMAND, &pci_command);
 	if (pci_command & PCI_COMMAND_MASTER) {
Index: msi/drivers/pci/pci.h
===================================================================
--- msi.orig/drivers/pci/pci.h
+++ msi/drivers/pci/pci.h
@@ -47,9 +47,11 @@ extern unsigned int pci_pm_d3_delay;
 
 #ifdef CONFIG_PCI_MSI
 void disable_msi_mode(struct pci_dev *dev, int pos, int type);
+extern void pci_disable_device_msi(struct pci_dev *dev);
 void pci_no_msi(void);
 #else
 static inline void disable_msi_mode(struct pci_dev *dev, int pos, int type) { }
+static inline void pci_disable_device_msi(struct pci_dev *dev) { }
 static inline void pci_no_msi(void) { }
 #endif
 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 5/16] Ops based MSI implementation
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (2 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 2/16] Remove pci_scan_msi_device() Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25 21:52   ` Greg KH
  2007-01-25  8:34 ` [RFC/PATCH 4/16] Abstract MSI suspend Michael Ellerman
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

This is the guts of our ops based MSI implementation. We need to use the
ops approach to accommodate RTAS, where firmware handles all MSI
configuration, and also so we can build a single kernel which boots on
multiple hardware configurations.

So that we don't have to replace the existing code in a single patch, we
add PCI_MSI_NEW and PCI_MSI_OLD Kconfig symbols. These will vanish once
all platforms are using the new code.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 drivers/pci/Kconfig      |   21 ++++
 drivers/pci/Makefile     |    3 
 drivers/pci/msi/Makefile |    9 +
 drivers/pci/msi/core.c   |  224 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/msi-ops.h  |  168 +++++++++++++++++++++++++++++++++++
 include/linux/pci.h      |    5 +
 6 files changed, 429 insertions(+), 1 deletion(-)

Index: msi/drivers/pci/Kconfig
===================================================================
--- msi.orig/drivers/pci/Kconfig
+++ msi/drivers/pci/Kconfig
@@ -17,6 +17,27 @@ config PCI_MSI
 
 	   If you don't know what to do here, say N.
 
+config PCI_MSI_NEW
+	bool
+	depends on PCI_MSI && !PCI_MSI_OLD
+
+config PCI_MSI_OLD
+	bool
+	depends on PCI_MSI && !PCI_MSI_NEW
+	default y
+
+config PCI_MSI_DEBUG
+	bool "PCI MSI Debugging"
+	depends on PCI_MSI_NEW && DEBUG_KERNEL
+	default y
+	help
+	  Say Y here if you want the PCI MSI code to produce a bunch of
+	  debug messages. This is probably only useful if you're working
+	  on MSI support for your platform, or debugging a driver that
+	  uses MSI.
+
+	  If in doubt, say N.
+
 config PCI_MULTITHREAD_PROBE
 	bool "PCI Multi-threaded probe (EXPERIMENTAL)"
 	depends on PCI && EXPERIMENTAL && BROKEN
Index: msi/drivers/pci/Makefile
===================================================================
--- msi.orig/drivers/pci/Makefile
+++ msi/drivers/pci/Makefile
@@ -15,7 +15,8 @@ obj-$(CONFIG_HOTPLUG) += hotplug.o
 obj-$(CONFIG_HOTPLUG_PCI) += hotplug/
 
 # Build the PCI MSI interrupt support
-obj-$(CONFIG_PCI_MSI) += msi.o
+obj-$(CONFIG_PCI_MSI_OLD) += msi.o
+obj-$(CONFIG_PCI_MSI_NEW) += msi/
 
 # Build the Hypertransport interrupt support
 obj-$(CONFIG_HT_IRQ) += htirq.o
Index: msi/drivers/pci/msi/Makefile
===================================================================
--- /dev/null
+++ msi/drivers/pci/msi/Makefile
@@ -0,0 +1,9 @@
+#
+# Makefile for the PCI MSI support
+#
+
+obj-y			+= core.o
+
+ifeq ($(CONFIG_PCI_MSI_DEBUG),y)
+EXTRA_CFLAGS += -DDEBUG
+endif
Index: msi/drivers/pci/msi/core.c
===================================================================
--- /dev/null
+++ msi/drivers/pci/msi/core.c
@@ -0,0 +1,224 @@
+/*
+ * Copyright 2006-2007, Michael Ellerman, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/msi.h>
+#include <linux/msi-ops.h>
+#include <linux/pci.h>
+#include <linux/slab.h>
+#include <linux/irq.h>
+#include <asm/msi.h>
+
+static int no_msi;
+
+void pci_no_msi(void)
+{
+	printk(KERN_DEBUG "PCI MSI disabled on command line.\n");
+	no_msi = 1;
+}
+
+
+/* msi_info helpers */
+
+static int alloc_msi_info(struct pci_dev *pdev, int num,
+			  struct msix_entry *entries, int type)
+{
+	struct msi_info *info;
+	unsigned int entries_size;
+
+	entries_size = sizeof(struct msix_entry) * num;
+
+	info = kzalloc(sizeof(struct msi_info) + entries_size, GFP_KERNEL);
+	if (!info) {
+		msi_debug("kzalloc failed for %s\n", pci_name(pdev));
+		return -ENOMEM;
+	}
+
+	info->type = type;
+	info->num = num;
+	info->entries = (struct msix_entry *)(info + 1);
+
+	BUG_ON(pdev->msi_info); /* don't leak info structs */
+	pdev->msi_info = info;
+
+	return 0;
+}
+
+static void free_msi_info(struct pci_dev *pdev)
+{
+	kfree(pdev->msi_info);
+	pdev->msi_info = NULL;
+}
+
+
+/* Generic helpers */
+
+static int generic_msi_enable(struct pci_dev *pdev, int nvec,
+				struct msix_entry *entries, int type)
+{
+	struct msi_ops *ops;
+	int i, rc;
+
+	if (no_msi || !pdev || !entries || !nvec || pdev->msi_info) {
+		msi_debug("precondition failed for %p\n", pdev);
+		return -EINVAL;
+	}
+
+	ops = arch_get_msi_ops(pdev);
+	if (!ops) {
+		msi_debug("no ops for %s\n", pci_name(pdev));
+		return -EINVAL;
+	}
+
+	for (i = 0; i < nvec; i++)
+		entries[i].vector = NO_IRQ;
+
+	rc = ops->check(pdev, nvec, entries, type);
+	if (rc) {
+		msi_debug("check failed (%d) for %s\n", rc, pci_name(pdev));
+		return rc;
+	}
+
+	rc = alloc_msi_info(pdev, nvec, entries, type);
+	if (rc)
+		return rc;
+
+	rc = ops->alloc(pdev, nvec, entries, type);
+	if (rc) {
+		msi_debug("alloc failed (%d) for %s\n", rc, pci_name(pdev));
+		goto out_free_info;
+	}
+
+	if (ops->enable) {
+		rc = ops->enable(pdev, nvec, entries, type);
+		if (rc) {
+			msi_debug("enable failed (%d) for %s\n", rc,
+				pci_name(pdev));
+			goto out_ops_free;
+		}
+	}
+
+	/* Copy the updated entries into the msi_info */
+	memcpy(pdev->msi_info->entries, entries,
+			sizeof(struct msix_entry) * nvec);
+	pci_intx(pdev, 0);
+
+	return 0;
+
+ out_ops_free:
+	ops->free(pdev, nvec, entries, type);
+ out_free_info:
+	free_msi_info(pdev);
+
+	return rc;
+}
+
+static int generic_msi_disable(struct pci_dev *pdev, int type)
+{
+	struct msi_ops *ops;
+	struct msi_info *info;
+
+	if (no_msi || !pdev) {
+		msi_debug("precondition failed for %p\n", pdev);
+		return -1;
+	}
+
+	info = pdev->msi_info;
+	if (!info) {
+		msi_debug("No info for %s\n", pci_name(pdev));
+		return -1;
+	}
+
+	ops = arch_get_msi_ops(pdev);
+	if (!ops) {
+		msi_debug("no ops for %s\n", pci_name(pdev));
+		return -1;
+	}
+
+	if (ops->disable)
+		ops->disable(pdev, info->num, info->entries, type);
+
+	ops->free(pdev, info->num, info->entries, type);
+
+	pci_intx(pdev, 1);
+
+	return 0;
+}
+
+
+/* MSI */
+
+int pci_enable_msi(struct pci_dev *pdev)
+{
+	struct msix_entry entry;
+	int rc;
+
+	entry.entry = 0;
+
+	rc = generic_msi_enable(pdev, 1, &entry, PCI_CAP_ID_MSI);
+	if (rc)
+		return rc;
+
+	pdev->msi_info->saved_irq = pdev->irq;
+	pdev->irq = entry.vector;
+	pdev->msi_enabled = 1;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pci_enable_msi);
+
+void pci_disable_msi(struct pci_dev *pdev)
+{
+	if (generic_msi_disable(pdev, PCI_CAP_ID_MSI) != 0)
+		return;
+
+	pdev->irq = pdev->msi_info->saved_irq;
+	free_msi_info(pdev);
+	pdev->msi_enabled = 0;
+}
+EXPORT_SYMBOL_GPL(pci_disable_msi);
+
+
+/* MSI-X */
+
+int pci_enable_msix(struct pci_dev *pdev, struct msix_entry *entries, int nvec)
+{
+	int rc;
+
+	rc = generic_msi_enable(pdev, nvec, entries, PCI_CAP_ID_MSIX);
+	if (rc)
+		return rc;
+
+	pdev->msix_enabled = 1;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(pci_enable_msix);
+
+void pci_disable_msix(struct pci_dev *pdev)
+{
+	if (generic_msi_disable(pdev, PCI_CAP_ID_MSIX) != 0)
+		return;
+
+	free_msi_info(pdev);
+	pdev->msix_enabled = 0;
+}
+EXPORT_SYMBOL_GPL(pci_disable_msix);
+
+
+/* Stubs for now */
+
+void disable_msi_mode(struct pci_dev *dev, int pos, int type)
+{
+	return;
+}
+
+void msi_remove_pci_irq_vectors(struct pci_dev* dev)
+{
+	return;
+}
Index: msi/include/linux/msi-ops.h
===================================================================
--- /dev/null
+++ msi/include/linux/msi-ops.h
@@ -0,0 +1,168 @@
+/*
+ * Copyright 2006-2007, Michael Ellerman, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#ifndef LINUX_MSI_OPS_H
+#define LINUX_MSI_OPS_H
+
+#ifdef __KERNEL__
+#ifndef __ASSEMBLY__
+
+#include <linux/pci.h>
+#include <linux/msi.h>
+
+/*
+ * MSI and MSI-X although different in some details, are also similar in
+ * many respects, and ultimately achieve the same end. Given that, this code
+ * tries as far as possible to implement both MSI and MSI-X with a minimum
+ * of code duplication. We will use "MSI" to refer to both MSI and MSI-X,
+ * except where it is important to differentiate between the two.
+ *
+ * Enabling MSI for a device can be broken down into:
+ *  1) Checking the device can support the type/number of MSIs requested.
+ *  2) Allocating irqs for the MSIs and setting up the irq_descs.
+ *  3) Writing the appropriate configuration to the device and enabling MSIs.
+ *
+ * To implement that we have the following callbacks:
+ *  1) check(pdev, num, msix_entries, type)
+ *  2) alloc(pdev, num, msix_entries, type)
+ *  3) enable(pdev, num, msix_entries, type)
+ *	a) setup_msi_msg(pdev, msix_entry, msi_msg, type)
+ *
+ * We give platforms full control over the enable step. However many
+ * platforms will simply want to program the device using standard PCI
+ * accessors. These platforms can use a generic enable callback and define
+ * a setup_msi_msg() callback which simply fills in the "magic" address and
+ * data values. Other platforms may leave setup_msi_msg() empty.
+ *
+ * Disabling MSI requires:
+ *  1) Disabling MSI on the device.
+ *  2) Freeing the irqs and any associated accounting information.
+ *
+ * Which maps directly to the two callbacks:
+ *  1) disable(pdev, num, msix_entries, type)
+ *  2) free(pdev, num, msix_entries, type)
+ */
+
+struct msi_ops
+{
+	/* check - Check that the requested MSI allocation is OK.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs being requested.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * This routine is responsible for checking that the given PCI device
+	 * can be allocated the requested type and number of MSIs.
+	 *
+	 * It is up to this routine to determine if the requested number of
+	 * MSIs is valid for the device in question. If the number of MSIs,
+	 * or the particular MSI entries, can not be supported for any
+	 * reason this routine must return non-zero.
+	 *
+	 * If the check is succesful this routine must return 0.
+	 */
+	int (*check) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+
+	/* alloc - Allocate MSIs for the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs being requested.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * This routine is responsible for allocating the number of
+	 * MSIs to the given PCI device.
+	 *
+	 * Upon completion there must be @num MSIs assigned to this device,
+	 * the "vector" member of each struct msix_entry must be filled in
+	 * with the Linux irq number allocated to it. The corresponding
+	 * irq_descs must also be setup with an appropriate handler if
+	 * required.
+	 *
+	 * If the allocation completes succesfully this routine must return 0.
+	 */
+	int (*alloc) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+
+	/* enable - Enable the MSIs on the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs being requested.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * This routine enables the MSIs on the given PCI device.
+	 *
+	 * If the enable completes succesfully this routine must return 0.
+	 *
+	 * This callback is optional.
+	 */
+	int (*enable) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+
+	/* setup_msi_msg - Setup an MSI message for the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @entry:	The MSI entry to create a msi_msg for.
+	 * @msg:	Written with the magic address and data.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * Returns the "magic address and data" used to trigger the msi.
+	 * If the setup is succesful this routine must return 0.
+	 *
+	 * This callback is optional.
+	 */
+	int (*setup_msi_msg) (struct pci_dev *pdev, struct msix_entry *entry,
+				struct msi_msg *msg, int type);
+
+	/* disable - disable the MSI for the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs to disable.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+         * This routine should perform the inverse of enable.
+	 */
+	void (*disable) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+
+	/* free - free the MSIs assigned to the device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * Free all MSIs and associated resources for the device. If any
+	 * MSIs have been enabled they will have been disabled already by
+	 * the generic code.
+	 */
+	void (*free) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+};
+
+
+/* Used by the MSI code to track MSI info for a pci_dev */
+struct msi_info {
+	int type;
+	unsigned int saved_irq;
+	unsigned int num;
+	struct msix_entry *entries;
+	void __iomem *msix_base;
+};
+
+#define msi_debug(fmt, args...)	\
+	pr_debug("MSI:%s:%d: " fmt, __FUNCTION__, __LINE__, ## args)
+
+#endif /* __KERNEL__ */
+#endif /* __ASSEMBLY__ */
+#endif /* LINUX_MSI_OPS_H */
Index: msi/include/linux/pci.h
===================================================================
--- msi.orig/include/linux/pci.h
+++ msi/include/linux/pci.h
@@ -107,6 +107,8 @@ struct pci_cap_saved_state {
 	u32 data[0];
 };
 
+struct msi_info;
+
 /*
  * The pci_dev structure is used to describe PCI devices.
  */
@@ -174,6 +176,9 @@ struct pci_dev {
 	struct bin_attribute *rom_attr; /* attribute descriptor for sysfs ROM entry */
 	int rom_attr_enabled;		/* has display of the rom attribute been enabled? */
 	struct bin_attribute *res_attr[DEVICE_COUNT_RESOURCE]; /* sysfs file for resources */
+#if defined(CONFIG_PCI_MSI_NEW)
+	struct	msi_info *msi_info;
+#endif
 };
 
 #define pci_dev_g(n) list_entry(n, struct pci_dev, global_list)

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 6/16] Add bare metal MSI enable & disable routines
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (4 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 4/16] Abstract MSI suspend Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-26  5:35   ` Eric W. Biederman
  2007-01-25  8:34 ` [RFC/PATCH 7/16] Rip out the existing powerpc msi stubs Michael Ellerman
                   ` (13 subsequent siblings)
  19 siblings, 1 reply; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

Add bare metal MSI enable & disable routines.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 drivers/pci/msi/Makefile |    2 -
 drivers/pci/msi/raw.c    |   94 +++++++++++++++++++++++++++++++++++++++++++++++
 include/linux/msi-ops.h  |    5 ++
 3 files changed, 100 insertions(+), 1 deletion(-)

Index: msi/drivers/pci/msi/Makefile
===================================================================
--- msi.orig/drivers/pci/msi/Makefile
+++ msi/drivers/pci/msi/Makefile
@@ -2,7 +2,7 @@
 # Makefile for the PCI MSI support
 #
 
-obj-y			+= core.o
+obj-y			+= core.o raw.o
 
 ifeq ($(CONFIG_PCI_MSI_DEBUG),y)
 EXTRA_CFLAGS += -DDEBUG
Index: msi/drivers/pci/msi/raw.c
===================================================================
--- /dev/null
+++ msi/drivers/pci/msi/raw.c
@@ -0,0 +1,94 @@
+/*
+ * Bare metal MSI enable & disable.
+ *
+ * Copyright 2006-2007, Michael Ellerman, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version
+ * 2 of the License, or (at your option) any later version.
+ */
+
+#include <linux/kernel.h>
+#include <linux/msi-ops.h>
+#include <linux/pci.h>
+#include <asm/msi.h>
+
+int msi_raw_enable(struct pci_dev *pdev, int num,
+			struct msix_entry *entries, int type)
+{
+	struct msi_ops *ops;
+	struct msi_msg msg;
+	int pos;
+	u16 control;
+
+	pos = pci_find_capability(pdev, type);
+	if (!pos) {
+		msi_debug("cap (%d) not found for %s\n", type, pci_name(pdev));
+		return -1;
+	}
+
+	ops = arch_get_msi_ops(pdev);
+	BUG_ON(!ops);
+
+	pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &control);
+
+	switch (type) {
+	case PCI_CAP_ID_MSI:
+		BUG_ON(!ops->setup_msi_msg);
+
+		ops->setup_msi_msg(pdev, &entries[0], &msg, type);
+
+		pci_write_config_dword(pdev, pos + PCI_MSI_ADDRESS_LO,
+			msg.address_lo);
+
+		if (control & PCI_MSI_FLAGS_64BIT) {
+			pci_write_config_dword(pdev, pos + PCI_MSI_ADDRESS_HI,
+						msg.address_hi);
+			pci_write_config_dword(pdev, pos + PCI_MSI_DATA_64,
+						msg.data);
+		} else {
+			pci_write_config_dword(pdev, pos + PCI_MSI_DATA_32,
+						msg.data);
+		}
+
+		control |= PCI_MSI_FLAGS_ENABLE;
+		break;
+	case PCI_CAP_ID_MSIX:
+		WARN_ON(1); /* XXX implement me */
+		return -1;
+	default:
+		BUG();
+	}
+
+	pci_write_config_word(pdev, pos + PCI_MSI_FLAGS, control);
+
+	return 0;
+}
+
+void msi_raw_disable(struct pci_dev *pdev, int num,
+			struct msix_entry *entries, int type)
+{
+	int pos;
+	u16 control;
+
+	pos = pci_find_capability(pdev, type);
+	BUG_ON(!pos);
+
+	pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &control);
+
+	switch (type) {
+	case PCI_CAP_ID_MSI:
+		control &= ~PCI_MSI_FLAGS_ENABLE;
+		break;
+	case PCI_CAP_ID_MSIX:
+		control &= ~PCI_MSIX_FLAGS_ENABLE;
+		break;
+	default:
+		BUG();
+	}
+
+	pci_write_config_word(pdev, pos + PCI_MSI_FLAGS, control);
+
+	return;
+}
Index: msi/include/linux/msi-ops.h
===================================================================
--- msi.orig/include/linux/msi-ops.h
+++ msi/include/linux/msi-ops.h
@@ -163,6 +163,11 @@ struct msi_info {
 #define msi_debug(fmt, args...)	\
 	pr_debug("MSI:%s:%d: " fmt, __FUNCTION__, __LINE__, ## args)
 
+extern int msi_raw_enable(struct pci_dev *pdev, int num,
+			struct msix_entry *entries, int type);
+extern void msi_raw_disable(struct pci_dev *pdev, int num,
+			struct msix_entry *entries, int type);
+
 #endif /* __KERNEL__ */
 #endif /* __ASSEMBLY__ */
 #endif /* LINUX_MSI_OPS_H */

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 7/16] Rip out the existing powerpc msi stubs
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (5 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 6/16] Add bare metal MSI enable & disable routines Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25  8:34 ` [RFC/PATCH 9/16] RTAS MSI implementation Michael Ellerman
                   ` (12 subsequent siblings)
  19 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

Rip out the existing powerpc msi stubs. These were the start of an
implementation based on ppc_md calls, but were never used in mainline.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 arch/powerpc/kernel/irq.c     |   28 ----------------------------
 include/asm-powerpc/machdep.h |    5 -----
 2 files changed, 33 deletions(-)

Index: msi/arch/powerpc/kernel/irq.c
===================================================================
--- msi.orig/arch/powerpc/kernel/irq.c
+++ msi/arch/powerpc/kernel/irq.c
@@ -945,34 +945,6 @@ arch_initcall(irq_late_init);
 
 #endif /* CONFIG_PPC_MERGE */
 
-#ifdef CONFIG_PCI_MSI
-int pci_enable_msi(struct pci_dev * pdev)
-{
-	if (ppc_md.enable_msi)
-		return ppc_md.enable_msi(pdev);
-	else
-		return -1;
-}
-EXPORT_SYMBOL(pci_enable_msi);
-
-void pci_disable_msi(struct pci_dev * pdev)
-{
-	if (ppc_md.disable_msi)
-		ppc_md.disable_msi(pdev);
-}
-EXPORT_SYMBOL(pci_disable_msi);
-
-void pci_scan_msi_device(struct pci_dev *dev) {}
-int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec) {return -1;}
-void pci_disable_msix(struct pci_dev *dev) {}
-void msi_remove_pci_irq_vectors(struct pci_dev *dev) {}
-void disable_msi_mode(struct pci_dev *dev, int pos, int type) {}
-void pci_no_msi(void) {}
-EXPORT_SYMBOL(pci_enable_msix);
-EXPORT_SYMBOL(pci_disable_msix);
-
-#endif
-
 #ifdef CONFIG_PPC64
 static int __init setup_noirqdistrib(char *str)
 {
Index: msi/include/asm-powerpc/machdep.h
===================================================================
--- msi.orig/include/asm-powerpc/machdep.h
+++ msi/include/asm-powerpc/machdep.h
@@ -243,11 +243,6 @@ struct machdep_calls {
 	 */
 	void (*machine_kexec)(struct kimage *image);
 #endif /* CONFIG_KEXEC */
-
-#ifdef CONFIG_PCI_MSI
-	int (*enable_msi)(struct pci_dev *pdev);
-	void (*disable_msi)(struct pci_dev *pdev);
-#endif /* CONFIG_PCI_MSI */
 };
 
 extern void power4_idle(void);

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 8/16] Enable MSI on Powerpc
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (7 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 9/16] RTAS MSI implementation Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25  8:34 ` [RFC/PATCH 10/16] Add a pci_irq_fixup for MSI via RTAS Michael Ellerman
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

Allow PCI_MSI to build on Powerpc. Until we merge and enable some
backends, pci_enable_msi() etc. will always return an error.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 drivers/pci/Kconfig           |    3 ++-
 include/asm-powerpc/machdep.h |    4 ++++
 include/asm-powerpc/msi.h     |   23 +++++++++++++++++++++++
 3 files changed, 29 insertions(+), 1 deletion(-)

Index: msi/drivers/pci/Kconfig
===================================================================
--- msi.orig/drivers/pci/Kconfig
+++ msi/drivers/pci/Kconfig
@@ -4,7 +4,8 @@
 config PCI_MSI
 	bool "Message Signaled Interrupts (MSI and MSI-X)"
 	depends on PCI
-	depends on (X86_LOCAL_APIC && X86_IO_APIC) || IA64
+	depends on (X86_LOCAL_APIC && X86_IO_APIC) || IA64 || PPC_MERGE
+	select PCI_MSI_NEW if PPC_MERGE
 	help
 	   This allows device drivers to enable MSI (Message Signaled
 	   Interrupts).  Message Signaled Interrupts enable a device to
Index: msi/include/asm-powerpc/machdep.h
===================================================================
--- msi.orig/include/asm-powerpc/machdep.h
+++ msi/include/asm-powerpc/machdep.h
@@ -30,6 +30,7 @@ struct pci_controller;
 #ifdef CONFIG_KEXEC
 struct kimage;
 #endif
+struct msi_ops;
 
 #ifdef CONFIG_SMP
 struct smp_ops_t {
@@ -111,6 +112,9 @@ struct machdep_calls {
 	void		(*pcibios_fixup)(void);
 	int		(*pci_probe_mode)(struct pci_bus *);
 	void		(*pci_irq_fixup)(struct pci_dev *dev);
+#ifdef CONFIG_PCI_MSI
+	struct msi_ops*	(*get_msi_ops)(struct pci_dev *pdev);
+#endif
 
 	/* To setup PHBs when using automatic OF platform driver for PCI */
 	int		(*pci_setup_phb)(struct pci_controller *host);
Index: msi/include/asm-powerpc/msi.h
===================================================================
--- /dev/null
+++ msi/include/asm-powerpc/msi.h
@@ -0,0 +1,23 @@
+#ifndef __ASM_POWERPC_MSI_H
+#define __ASM_POWERPC_MSI_H
+/*
+ * Copyright (C) 2006-2007 Michael Ellerman, IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2 of the
+ * License.
+ *
+ */
+
+#include <asm/machdep.h>
+
+static inline struct msi_ops *arch_get_msi_ops(struct pci_dev *pdev)
+{
+	if (ppc_md.get_msi_ops)
+		return ppc_md.get_msi_ops(pdev);
+
+	return NULL;
+}
+
+#endif /* __ASM_POWERPC_MSI_H */

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 9/16] RTAS MSI implementation
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (6 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 7/16] Rip out the existing powerpc msi stubs Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25  8:34 ` [RFC/PATCH 8/16] Enable MSI on Powerpc Michael Ellerman
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

Powerpc MSI support via RTAS.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 drivers/pci/msi/Makefile  |    1 
 drivers/pci/msi/rtas.c    |  268 ++++++++++++++++++++++++++++++++++++++++++++++
 include/asm-powerpc/msi.h |    6 +
 3 files changed, 275 insertions(+)

Index: msi/drivers/pci/msi/Makefile
===================================================================
--- msi.orig/drivers/pci/msi/Makefile
+++ msi/drivers/pci/msi/Makefile
@@ -3,6 +3,7 @@
 #
 
 obj-y			+= core.o raw.o
+obj-$(CONFIG_PPC_RTAS)	+= rtas.o
 
 ifeq ($(CONFIG_PCI_MSI_DEBUG),y)
 EXTRA_CFLAGS += -DDEBUG
Index: msi/drivers/pci/msi/rtas.c
===================================================================
--- /dev/null
+++ msi/drivers/pci/msi/rtas.c
@@ -0,0 +1,268 @@
+/*
+ * Copyright (C) 2006 Jake Moilanen <moilanen@austin.ibm.com>, IBM Corp.
+ * Copyright (C) 2006 Michael Ellerman, IBM Corp.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/irq.h>
+#include <linux/msi-ops.h>
+#include <asm/msi.h>
+#include <asm/rtas.h>
+#include <asm/hw_irq.h>
+#include <asm/ppc-pci.h>
+
+static int query_token, change_token;
+
+#define RTAS_QUERY_FN		0
+#define RTAS_CHANGE_FN		1
+#define RTAS_RESET_FN		2
+#define RTAS_CHANGE_MSI_FN	3
+#define RTAS_CHANGE_MSIX_FN	4
+
+static struct pci_dn *get_pdn(struct pci_dev *pdev)
+{
+	struct device_node *dn;
+	struct pci_dn *pdn;
+
+	dn = pci_device_to_OF_node(pdev);
+	if (!dn) {
+		msi_debug("No OF device node for %s\n", pci_name(pdev));
+		return NULL;
+	}
+
+	pdn = PCI_DN(dn);
+	if (!pdn) {
+		msi_debug("No PCI DN for %s\n", pci_name(pdev));
+		return NULL;
+	}
+
+	return pdn;
+}
+
+/* RTAS Helpers */
+
+static int rtas_change_msi(struct pci_dn *pdn, u32 func, u32 num_irqs)
+{
+	u32 addr, seq_num, rtas_ret[3];
+	unsigned long buid;
+	int rc;
+
+	addr = rtas_config_addr(pdn->busno, pdn->devfn, 0);
+	buid = pdn->phb->buid;
+
+	seq_num = 1;
+	do {
+		if (func == RTAS_CHANGE_MSI_FN || func == RTAS_CHANGE_MSIX_FN)
+			rc = rtas_call(change_token, 6, 4, rtas_ret, addr,
+					BUID_HI(buid), BUID_LO(buid),
+					func, num_irqs, seq_num);
+		else
+			rc = rtas_call(change_token, 6, 3, rtas_ret, addr,
+					BUID_HI(buid), BUID_LO(buid),
+					func, num_irqs, seq_num);
+
+		seq_num = rtas_ret[1];
+	} while (rtas_busy_delay(rc));
+
+	if (rc) {
+		msi_debug("error (%d) for %s\n", rc, pci_name(pdn->pcidev));
+		return rc;
+	}
+
+	return rtas_ret[0];
+}
+
+static void rtas_disable_msi(struct pci_dev *pdev)
+{
+	struct pci_dn *pdn;
+
+	pdn = get_pdn(pdev);
+	if (!pdn)
+		return;
+
+	if (rtas_change_msi(pdn, RTAS_CHANGE_FN, 0) != 0) {
+		msi_debug("Setting MSIs to 0 failed!\n");
+		BUG();
+	}
+}
+
+static int rtas_query_irq_number(struct pci_dn *pdn, int offset)
+{
+	u32 addr, rtas_ret[2];
+	unsigned long buid;
+	int rc;
+
+	addr = rtas_config_addr(pdn->busno, pdn->devfn, 0);
+	buid = pdn->phb->buid;
+
+	do {
+		rc = rtas_call(query_token, 4, 3, rtas_ret, addr,
+			       BUID_HI(buid), BUID_LO(buid), offset);
+	} while (rtas_busy_delay(rc));
+
+	if (rc) {
+		msi_debug("error (%d) querying source number for %s\n",
+				rc, pci_name(pdn->pcidev));
+		return rc;
+	}
+
+	return rtas_ret[0];
+}
+
+static void msi_rtas_free(struct pci_dev *pdev, int num,
+			struct msix_entry *entries, int type)
+{
+	int i;
+
+	for (i = 0; i < num; i++) {
+		irq_dispose_mapping(entries[i].vector);
+	}
+
+	rtas_disable_msi(pdev);
+}
+
+static int check_req_msi(struct pci_dev *pdev)
+{
+	struct device_node *dn;
+	struct pci_dn *pdn;
+	const u32 *req_msi;
+
+	pdn = get_pdn(pdev);
+	if (!pdn)
+		return -1;
+
+	dn = pdn->node;
+
+	req_msi = get_property(dn, "ibm,req#msi", NULL);
+	if (!req_msi) {
+		msi_debug("No ibm,req#msi for %s\n", pci_name(pdev));
+		return -1;
+	}
+
+	if (*req_msi == 0) {
+		msi_debug("ibm,req#msi requests 0 MSIs for %s\n",
+			  pci_name(pdev));
+		return -1;
+	}
+
+	return 0;
+}
+
+static int msi_rtas_check(struct pci_dev *pdev, int num,
+			struct msix_entry *entries, int type)
+{
+	int i, rc;
+
+	rc = check_req_msi(pdev);
+	if (rc)
+		return rc;
+
+	/*
+	 * Firmware gives us no control over which entries are allocated
+	 * for MSI-X, it seems to assume we want 0 - n. For now just insist
+	 * that the entries array entry members are 0 - n.
+	 */
+	for (i = 0; i < num; i++) {
+		if (entries[i].entry != i) {
+			msi_debug("entries[%d].entry (%d) != %d\n", i,
+					entries[i].entry, i);
+			return -1;
+		}
+	}
+
+	return 0;
+}
+
+static int msi_rtas_alloc(struct pci_dev *pdev, int num,
+			struct msix_entry *entries, int type)
+{
+	struct pci_dn *pdn;
+	int hwirq, virq, i, rc;
+
+	pdn = get_pdn(pdev);
+	if (!pdn)
+		return -1;
+
+	/*
+	 * Try the new more explicit firmware interface, if that fails fall
+	 * back to the old interface. The old interface is known to never
+	 * return MSI-Xs.
+	 */
+	if (type == PCI_CAP_ID_MSI) {
+		rc = rtas_change_msi(pdn, RTAS_CHANGE_MSI_FN, num);
+
+		if (rc != num) {
+			msi_debug("trying the old firmware interface.\n");
+			rc = rtas_change_msi(pdn, RTAS_CHANGE_FN, num);
+		}
+	} else
+		rc = rtas_change_msi(pdn, RTAS_CHANGE_MSIX_FN, num);
+
+	if (rc != num) {
+		msi_debug("rtas_change_msi() failed for %s\n", pci_name(pdev));
+
+		/*
+		 * In case of an error it's not clear whether the device is
+		 * left with MSI enabled or not, so we explicitly disable.
+		 */
+		goto out_free;
+	}
+
+	for (i = 0; i < num; i++) {
+		hwirq = rtas_query_irq_number(pdn, i);
+		if (hwirq < 0) {
+			msi_debug("error (%d) getting hwirq for %s\n",
+					hwirq, pci_name(pdev));
+			goto out_free;
+		}
+
+		virq = irq_create_mapping(NULL, hwirq);
+
+		if (virq == NO_IRQ) {
+			msi_debug("Failed mapping hwirq %d\n", hwirq);
+			goto out_free;
+		}
+
+		entries[i].vector = virq;
+	}
+
+	return 0;
+
+ out_free:
+	msi_rtas_free(pdev, num, entries, type);
+	return -1;
+}
+
+static struct msi_ops rtas_msi_ops = {
+	.check = msi_rtas_check,
+	.alloc = msi_rtas_alloc,
+	.free  = msi_rtas_free
+};
+
+static struct msi_ops *rtas_get_msi_ops(struct pci_dev *pdev)
+{
+	return &rtas_msi_ops;
+}
+
+int msi_rtas_init(void)
+{
+	query_token  = rtas_token("ibm,query-interrupt-source-number");
+	change_token = rtas_token("ibm,change-msi");
+
+	if ((query_token == RTAS_UNKNOWN_SERVICE) ||
+			(change_token == RTAS_UNKNOWN_SERVICE)) {
+		msi_debug("Couldn't find RTAS tokens, no MSI support.\n");
+		return -1;
+	}
+
+	msi_debug("Registering RTAS MSI ops.\n");
+
+	ppc_md.get_msi_ops = rtas_get_msi_ops;
+
+	return 0;
+}
Index: msi/include/asm-powerpc/msi.h
===================================================================
--- msi.orig/include/asm-powerpc/msi.h
+++ msi/include/asm-powerpc/msi.h
@@ -20,4 +20,10 @@ static inline struct msi_ops *arch_get_m
 	return NULL;
 }
 
+#ifdef CONFIG_PCI_MSI
+extern int msi_rtas_init(void);
+#else
+static inline int msi_rtas_init(void) { return -1; };
+#endif
+
 #endif /* __ASM_POWERPC_MSI_H */

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 10/16] Add a pci_irq_fixup for MSI via RTAS
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (8 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 8/16] Enable MSI on Powerpc Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25  8:34 ` [RFC/PATCH 11/16] Activate MSI via RTAS on pseries Michael Ellerman
                   ` (9 subsequent siblings)
  19 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

When RTAS is managing MSIs for us, it will/may enable MSI on devices that
support it by default. This is contrary to the Linux model where a device
is in LSI mode until the driver requests MSIs.

To remedy this we add a pci_irq_fixup call, which disables MSI if they've
been assigned by firmware and the device also supports LSI.

At the moment there is no pci_irq_fixup on pSeries, so we can just set it
unconditionally. If other platforms use the RTAS MSI backend they'll need
to check that still holds.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 drivers/pci/msi/rtas.c |   21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

Index: msi/drivers/pci/msi/rtas.c
===================================================================
--- msi.orig/drivers/pci/msi/rtas.c
+++ msi/drivers/pci/msi/rtas.c
@@ -238,6 +238,24 @@ static int msi_rtas_alloc(struct pci_dev
 	return -1;
 }
 
+static void msi_rtas_pci_irq_fixup(struct pci_dev *pdev)
+{
+	/* No LSI -> leave MSIs (if any) configured */
+	if (pdev->irq == NO_IRQ) {
+		msi_debug("no LSI on %s, nothing to do.\n", pci_name(pdev));
+		return;
+	}
+
+	/* No MSI -> MSIs can't have been assigned by fw, leave LSI */
+	if (check_req_msi(pdev)) {
+		msi_debug("no req#msi on %s, nothing to do.\n", pci_name(pdev));
+		return;
+	}
+
+	msi_debug("disabling existing MSI on %s\n", pci_name(pdev));
+	rtas_disable_msi(pdev);
+}
+
 static struct msi_ops rtas_msi_ops = {
 	.check = msi_rtas_check,
 	.alloc = msi_rtas_alloc,
@@ -264,5 +282,8 @@ int msi_rtas_init(void)
 
 	ppc_md.get_msi_ops = rtas_get_msi_ops;
 
+	WARN_ON(ppc_md.pci_irq_fixup);
+	ppc_md.pci_irq_fixup = msi_rtas_pci_irq_fixup;
+
 	return 0;
 }

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 11/16] Activate MSI via RTAS on pseries
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (9 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 10/16] Add a pci_irq_fixup for MSI via RTAS Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25  8:34 ` [RFC/PATCH 12/16] Tell firmware we support MSI Michael Ellerman
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

Activate MSI via RTAS on pseries.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 arch/powerpc/platforms/pseries/setup.c |    2 ++
 1 file changed, 2 insertions(+)

Index: msi/arch/powerpc/platforms/pseries/setup.c
===================================================================
--- msi.orig/arch/powerpc/platforms/pseries/setup.c
+++ msi/arch/powerpc/platforms/pseries/setup.c
@@ -65,6 +65,7 @@
 #include <asm/i8259.h>
 #include <asm/udbg.h>
 #include <asm/smp.h>
+#include <asm/msi.h>
 
 #include "plpar_wrappers.h"
 #include "ras.h"
@@ -284,6 +285,7 @@ static void __init pseries_discover_pic(
 #ifdef CONFIG_SMP
 			smp_init_pseries_xics();
 #endif
+			msi_rtas_init();
 			return;
 		}
 	}

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 12/16] Tell firmware we support MSI
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (10 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 11/16] Activate MSI via RTAS on pseries Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25  8:34 ` [RFC/PATCH 13/16] MPIC MSI allocator Michael Ellerman
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

Tell firmware we support MSI.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 arch/powerpc/kernel/prom_init.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

Index: msi/arch/powerpc/kernel/prom_init.c
===================================================================
--- msi.orig/arch/powerpc/kernel/prom_init.c
+++ msi/arch/powerpc/kernel/prom_init.c
@@ -635,6 +635,12 @@ static void __init early_cmdline_parse(v
 /* ibm,dynamic-reconfiguration-memory property supported */
 #define OV5_DRCONF_MEMORY	0x20
 #define OV5_LARGE_PAGES		0x10	/* large pages supported */
+/* PCIe/MSI support.  Without MSI full PCIe is not supported */
+#ifdef CONFIG_PCI_MSI
+#define OV5_MSI			0x01	/* PCIe/MSI support */
+#else
+#define OV5_MSI			0x00
+#endif /* CONFIG_PCI_MSI */
 
 /*
  * The architecture vector has an array of PVR mask/value pairs,
@@ -679,7 +685,7 @@ static unsigned char ibm_architecture_ve
 	/* option vector 5: PAPR/OF options */
 	3 - 2,				/* length */
 	0,				/* don't ignore, don't halt */
-	OV5_LPAR | OV5_SPLPAR | OV5_LARGE_PAGES | OV5_DRCONF_MEMORY,
+	OV5_LPAR | OV5_SPLPAR | OV5_LARGE_PAGES | OV5_DRCONF_MEMORY | OV5_MSI,
 };
 
 /* Old method - ELF header with PT_NOTE sections */

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 13/16] MPIC MSI allocator
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (11 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 12/16] Tell firmware we support MSI Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25  8:34 ` [RFC/PATCH 14/16] MPIC MSI backend Michael Ellerman
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

To support MSI on MPIC we need a way to reserve and allocate hardware irq
numbers, this patch implements an allocator for that.

Updated to only do dogy-U3-fallback-hacks on U3, all other platforms must
define a "msi-ranges" property on their MPIC node for MSI to work.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 arch/powerpc/sysdev/Makefile   |    5 -
 arch/powerpc/sysdev/mpic.c     |    4 
 arch/powerpc/sysdev/mpic.h     |   27 ++++++
 arch/powerpc/sysdev/mpic_msi.c |  171 +++++++++++++++++++++++++++++++++++++++++
 include/asm-powerpc/mpic.h     |   11 ++
 5 files changed, 217 insertions(+), 1 deletion(-)

Index: msi/arch/powerpc/sysdev/Makefile
===================================================================
--- msi.orig/arch/powerpc/sysdev/Makefile
+++ msi/arch/powerpc/sysdev/Makefile
@@ -2,7 +2,10 @@ ifeq ($(CONFIG_PPC64),y)
 EXTRA_CFLAGS			+= -mno-minimal-toc
 endif
 
-obj-$(CONFIG_MPIC)		+= mpic.o
+mpic-obj-y			:= mpic.o
+mpic-obj-$(CONFIG_PCI_MSI)	+= mpic_msi.o
+obj-$(CONFIG_MPIC)		+= $(mpic-obj-y)
+
 obj-$(CONFIG_PPC_INDIRECT_PCI)	+= indirect_pci.o
 obj-$(CONFIG_PPC_MPC106)	+= grackle.o
 obj-$(CONFIG_PPC_DCR)		+= dcr.o
Index: msi/arch/powerpc/sysdev/mpic.c
===================================================================
--- msi.orig/arch/powerpc/sysdev/mpic.c
+++ msi/arch/powerpc/sysdev/mpic.c
@@ -36,6 +36,8 @@
 #include <asm/mpic.h>
 #include <asm/smp.h>
 
+#include "mpic.h"
+
 #ifdef DEBUG
 #define DBG(fmt...) printk(fmt)
 #else
@@ -825,6 +827,8 @@ static int mpic_host_map(struct irq_host
 	if (hw >= mpic->irq_count)
 		return -EINVAL;
 
+	mpic_msi_reserve_hwirq(mpic, hw);
+
 	/* Default chip */
 	chip = &mpic->hc_irq;
 
Index: msi/arch/powerpc/sysdev/mpic.h
===================================================================
--- /dev/null
+++ msi/arch/powerpc/sysdev/mpic.h
@@ -0,0 +1,27 @@
+#ifndef _POWERPC_SYSDEV_MPIC_H
+#define _POWERPC_SYSDEV_MPIC_H
+
+/*
+ * Copyright 2006-2007, Michael Ellerman, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/bitmap.h>
+#include <asm/msi.h>
+
+#ifdef CONFIG_PCI_MSI
+extern void mpic_msi_reserve_hwirq(struct mpic *mpic, irq_hw_number_t hwirq);
+#else
+static inline void mpic_msi_reserve_hwirq(struct mpic *mpic,
+					  irq_hw_number_t hwirq)
+{
+	return;
+}
+#endif
+
+#endif /* _POWERPC_SYSDEV_MPIC_H */
Index: msi/arch/powerpc/sysdev/mpic_msi.c
===================================================================
--- /dev/null
+++ msi/arch/powerpc/sysdev/mpic_msi.c
@@ -0,0 +1,171 @@
+/*
+ * Copyright 2006-2007, Michael Ellerman, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/irq.h>
+#include <linux/bootmem.h>
+#include <linux/msi-ops.h>
+#include <asm/mpic.h>
+#include <asm/prom.h>
+#include <asm/hw_irq.h>
+#include <asm/ppc-pci.h>
+
+
+static void __mpic_msi_reserve_hwirq(struct mpic *mpic, irq_hw_number_t hwirq)
+{
+	msi_debug("reserving hwirq 0x%lx\n", hwirq);
+	bitmap_allocate_region(mpic->hwirq_bitmap, hwirq, 0);
+}
+
+void mpic_msi_reserve_hwirq(struct mpic *mpic, irq_hw_number_t hwirq)
+{
+	unsigned long flags;
+
+	/* The mpic calls this even when there is no allocator setup */
+	if (!mpic->hwirq_bitmap)
+		return;
+
+	spin_lock_irqsave(&mpic->bitmap_lock, flags);
+	__mpic_msi_reserve_hwirq(mpic, hwirq);
+	spin_unlock_irqrestore(&mpic->bitmap_lock, flags);
+}
+
+irq_hw_number_t mpic_msi_alloc_hwirqs(struct mpic *mpic, int num)
+{
+	unsigned long flags;
+	int offset, order = fls(num);
+
+	spin_lock_irqsave(&mpic->bitmap_lock, flags);
+	/*
+	 * This is fast, but stricter than we need. We might want to add
+	 * a fallback routine which does a linear search with no alignment.
+	 */
+	offset = bitmap_find_free_region(mpic->hwirq_bitmap, mpic->irq_count,
+					 order);
+	spin_unlock_irqrestore(&mpic->bitmap_lock, flags);
+
+	msi_debug("allocated %d (2^%d) at offset %d\n", num, order, offset);
+
+	return offset;
+}
+
+void mpic_msi_free_hwirqs(struct mpic *mpic, int offset, int num)
+{
+	unsigned long flags;
+	int order = fls(num);
+
+	msi_debug("freeing %d (2^%d) at offset %d\n", num, order, offset);
+
+	spin_lock_irqsave(&mpic->bitmap_lock, flags);
+	bitmap_release_region(mpic->hwirq_bitmap, offset, order);
+	spin_unlock_irqrestore(&mpic->bitmap_lock, flags);
+}
+
+#ifdef CONFIG_MPIC_BROKEN_U3
+static int mpic_msi_reserve_u3_hwirqs(struct mpic *mpic)
+{
+	irq_hw_number_t hwirq;
+	struct irq_host_ops *ops = mpic->irqhost->ops;
+	struct device_node *np;
+	int flags, index, i;
+	struct of_irq oirq;
+
+	msi_debug("found U3, guessing msi allocator setup\n");
+
+	/* Reserve source numbers we know are reserved in the HW */
+	for (i = 0;   i < 8;   i++) __mpic_msi_reserve_hwirq(mpic, i);
+	for (i = 42;  i < 46;  i++) __mpic_msi_reserve_hwirq(mpic, i);
+	for (i = 100; i < 105; i++) __mpic_msi_reserve_hwirq(mpic, i);
+
+	np = NULL;
+	while ((np = of_find_all_nodes(np))) {
+		msi_debug("mapping hwirqs for %s\n", np->full_name);
+
+		index = 0;
+		while (of_irq_map_one(np, index++, &oirq) == 0) {
+			ops->xlate(mpic->irqhost, NULL, oirq.specifier,
+						oirq.size, &hwirq, &flags);
+			__mpic_msi_reserve_hwirq(mpic, hwirq);
+		}
+	}
+
+	return 0;
+}
+#else
+static int mpic_msi_reserve_u3_hwirqs(struct mpic *mpic) { return -1; }
+#endif
+
+static int mpic_msi_reserve_dt_hwirqs(struct mpic *mpic)
+{
+	int i, len;
+	const u32 *p;
+
+	p = get_property(mpic->of_node, "msi-available-ranges", &len);
+	if (!p) {
+		msi_debug("no msi-available-ranges property found on %s\n",
+			  mpic->of_node->full_name);
+		return -ENODEV;
+	}
+
+	if (len % 8 != 0) {
+		printk(KERN_WARNING "Malformed msi-available-ranges "
+		       "property on %s\n", mpic->of_node->full_name);
+		return -EINVAL;
+	}
+
+	bitmap_allocate_region(mpic->hwirq_bitmap, 0, fls(mpic->irq_count));
+
+	/* Format is: (<u32 start> <u32 count>)+ */
+	len /= sizeof(u32);
+	for (i = 0; i < len / 2; i++, p += 2)
+		mpic_msi_free_hwirqs(mpic, *p, *(p + 1));
+
+	return 0;
+}
+
+int mpic_msi_init_allocator(struct mpic *mpic)
+{
+	int rc, size;
+
+	BUG_ON(mpic->hwirq_bitmap);
+	spin_lock_init(&mpic->bitmap_lock);
+
+	size = mpic->irq_count / 8;
+	msi_debug("allocator bitmap size is 0x%x bytes\n", size);
+
+	if (mem_init_done)
+		mpic->hwirq_bitmap = kmalloc(size, GFP_KERNEL);
+	else
+		mpic->hwirq_bitmap = alloc_bootmem(size);
+
+	if (!mpic->hwirq_bitmap) {
+		msi_debug("no mem allocating allocator bitmap!\n");
+		return -ENOMEM;
+	}
+
+	memset(mpic->hwirq_bitmap, 0, size);
+
+	rc = mpic_msi_reserve_dt_hwirqs(mpic);
+	if (rc) {
+		if (mpic->flags & MPIC_BROKEN_U3)
+			rc = mpic_msi_reserve_u3_hwirqs(mpic);
+
+		if (rc)
+			goto out_free;
+	}
+
+	return 0;
+
+ out_free:
+	if (mem_init_done)
+		kfree(mpic->hwirq_bitmap);
+
+	mpic->hwirq_bitmap = NULL;
+	return rc;
+}
Index: msi/include/asm-powerpc/mpic.h
===================================================================
--- msi.orig/include/asm-powerpc/mpic.h
+++ msi/include/asm-powerpc/mpic.h
@@ -300,6 +300,11 @@ struct mpic
 	u32			*hw_set;
 #endif
 
+#ifdef CONFIG_PCI_MSI
+	spinlock_t		bitmap_lock;
+	unsigned long		*hwirq_bitmap;
+#endif
+
 	/* link */
 	struct mpic		*next;
 };
@@ -446,5 +451,11 @@ void mpic_set_clk_ratio(struct mpic *mpi
 /* Enable/Disable EPIC serial interrupt mode */
 void mpic_set_serial_int(struct mpic *mpic, int enable);
 
+#ifdef CONFIG_PCI_MSI
+extern int mpic_msi_init_allocator(struct mpic *mpic);
+extern irq_hw_number_t mpic_msi_alloc_hwirqs(struct mpic *mpic, int num);
+extern void mpic_msi_free_hwirqs(struct mpic *mpic, int offset, int num);
+#endif
+
 #endif /* __KERNEL__ */
 #endif	/* _ASM_POWERPC_MPIC_H */

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (12 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 13/16] MPIC MSI allocator Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-26  6:43   ` Grant Grundler
  2007-01-26  9:11   ` Segher Boessenkool
  2007-01-25  8:34 ` [RFC/PATCH 15/16] Enable MSI mappings for MPIC Michael Ellerman
                   ` (5 subsequent siblings)
  19 siblings, 2 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

MPIC MSI backend. Based on code from Segher, heavily hacked by me.
Renamed to mpic_htmsi, as it only deals with MSI over Hypertransport.

We properly discover the HT magic address by reading the config space.
Now we have an irq allocator we can support > 1 MSI, and we don't reuse
the LSI.

Tested, succesfully getting MSIs from the tg3 via HT/PCI-X on a JS21
running SLOF. Successive insmod/rmmods working too.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 drivers/pci/msi/Makefile     |    1 
 drivers/pci/msi/mpic_htmsi.c |  190 +++++++++++++++++++++++++++++++++++++++++++
 include/asm-powerpc/msi.h    |    3 
 3 files changed, 194 insertions(+)

Index: msi/drivers/pci/msi/Makefile
===================================================================
--- msi.orig/drivers/pci/msi/Makefile
+++ msi/drivers/pci/msi/Makefile
@@ -4,6 +4,7 @@
 
 obj-y			+= core.o raw.o
 obj-$(CONFIG_PPC_RTAS)	+= rtas.o
+obj-$(CONFIG_MPIC)	+= mpic_htmsi.o
 
 ifeq ($(CONFIG_PCI_MSI_DEBUG),y)
 EXTRA_CFLAGS += -DDEBUG
Index: msi/drivers/pci/msi/mpic_htmsi.c
===================================================================
--- /dev/null
+++ msi/drivers/pci/msi/mpic_htmsi.c
@@ -0,0 +1,190 @@
+/*
+ * Copyright 2006, Segher Boessenkool, IBM Corporation.
+ * Copyright 2006-2007, Michael Ellerman, IBM Corporation.
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2 of the
+ * License.
+ *
+ */
+
+#include <linux/irq.h>
+#include <linux/bootmem.h>
+#include <linux/msi-ops.h>
+#include <asm/msi.h>
+#include <asm/mpic.h>
+#include <asm/prom.h>
+#include <asm/hw_irq.h>
+#include <asm/ppc-pci.h>
+
+/* XXX Do we ever need > 1 of these? void * msi_ops.data perhaps ? */
+static struct mpic *msi_mpic;
+
+static unsigned int find_ht_msi_capability(struct pci_dev *pdev)
+{
+	unsigned int pos = pci_find_capability(pdev, PCI_CAP_ID_HT);
+	u8 subcap, ttl = 48;
+
+	while (pos && ttl--) {
+		pci_read_config_byte(pdev, pos + 3, &subcap);
+		if ((subcap & 0xF8) == HT_CAPTYPE_MSI_MAPPING)
+			return pos;
+		pos = pci_find_next_capability(pdev, pos, PCI_CAP_ID_HT);
+	}
+
+	return 0;
+}
+
+static u64 read_ht_magic_addr(struct pci_dev *pdev, unsigned int pos)
+{
+	u8 flags;
+	u32 tmp;
+	u64 addr;
+
+	pci_read_config_byte(pdev, pos + HT_MSI_FLAGS, &flags);
+
+	if (flags & HT_MSI_FLAGS_FIXED)
+		return HT_MSI_FIXED_ADDR;
+
+	pci_read_config_dword(pdev, pos + HT_MSI_ADDR_LO, &tmp);
+	addr = tmp & HT_MSI_ADDR_LO_MASK;
+	pci_read_config_dword(pdev, pos + HT_MSI_ADDR_HI, &tmp);
+	addr = addr | ((u64)tmp << 32);
+
+	return addr;
+}
+
+static u64 find_ht_magic_addr(struct pci_dev *pdev)
+{
+	struct pci_bus *bus;
+	unsigned int pos;
+
+	for (bus = pdev->bus; bus; bus = bus->parent) {
+		pos = find_ht_msi_capability(bus->self);
+		if (pos)
+			return read_ht_magic_addr(bus->self, pos);
+	}
+
+	return 0;
+}
+
+static int htmsi_check(struct pci_dev *pdev, int num,
+			struct msix_entry *entries, int type)
+{
+	if (type == PCI_CAP_ID_MSIX) {
+		msi_debug("MSI-X unsupported for %s\n", pci_name(pdev));
+		return 1;
+	}
+
+	/* If we can't find a magic address then MSI ain't gonna work */
+	if (find_ht_magic_addr(pdev) == 0) {
+		msi_debug("no magic address found for %s\n", pci_name(pdev));
+		return 1;
+	}
+
+	return 0;
+}
+
+static void htmsi_free(struct pci_dev *pdev, int num,
+			struct msix_entry *entries, int type)
+{
+	irq_hw_number_t hwirq;
+	int i;
+
+	hwirq = irq_map[entries[0].vector].hwirq;
+
+	for (i = 0; i < num; i++) {
+		irq_dispose_mapping(entries[i].vector);
+		entries[i].vector = NO_IRQ;
+	}
+
+	msi_debug("freeing %d hwirqs for msi at offset 0x%lx\n", num, hwirq);
+	mpic_msi_free_hwirqs(msi_mpic, hwirq, num);
+
+	return;
+}
+
+static int htmsi_alloc(struct pci_dev *pdev, int num,
+			struct msix_entry *entries, int type)
+{
+	int i;
+	irq_hw_number_t hwirq;
+	unsigned int virq;
+
+	hwirq = mpic_msi_alloc_hwirqs(msi_mpic, num);
+	if (hwirq < 0) {
+		msi_debug("failed allocating %d hwirqs for %s\n", num,
+			  pci_name(pdev));
+		return -1;
+	}
+
+	for (i = 0; i < num; i++) {
+		virq = irq_create_mapping(msi_mpic->irqhost, hwirq);
+		if (virq == NO_IRQ) {
+			msi_debug("failed mapping hwirq 0x%lx for %s\n", hwirq,
+				  pci_name(pdev));
+			goto out_free;
+		}
+
+		/* FIXME should we save the existing type */
+		set_irq_type(virq, IRQ_TYPE_EDGE_RISING);
+
+		entries[i].vector = virq;
+		hwirq++;
+	}
+
+	return 0;
+
+ out_free:
+	htmsi_free(pdev, num, entries, type);
+	return -1;
+}
+
+static int htmsi_setup_msi_msg(struct pci_dev *pdev,
+		struct msix_entry *entry, struct msi_msg *msg, int type)
+{
+	u64 addr;
+
+	addr = find_ht_magic_addr(pdev);
+	msg->address_lo = addr & 0xFFFFFFFF;
+	msg->address_hi = addr >> 32;
+	msg->data = irq_map[entry->vector].hwirq;
+
+	msi_debug("allocated irq %d at 0x%lx for %s\n", entry->vector,
+			addr, pci_name(pdev));
+
+	return 0;
+}
+
+static struct msi_ops mpic_htmsi_ops = {
+	.check = htmsi_check,
+	.alloc = htmsi_alloc,
+	.free = htmsi_free,
+	.enable = msi_raw_enable,
+	.disable = msi_raw_disable,
+	.setup_msi_msg = htmsi_setup_msi_msg,
+};
+
+static struct msi_ops *htmsi_get_msi_ops(struct pci_dev *pdev)
+{
+	return &mpic_htmsi_ops;
+}
+
+int mpic_htmsi_init(struct mpic *mpic)
+{
+	int rc;
+
+	rc = mpic_msi_init_allocator(mpic);
+	if (rc) {
+		pr_debug("mpic_htmsi_init: Error allocating bitmap!\n");
+		return rc;
+	}
+
+	msi_mpic = mpic;
+
+	pr_debug("mpic_htmsi_init: Registering MPIC MSI ops.\n");
+	ppc_md.get_msi_ops = htmsi_get_msi_ops;
+
+	return 0;
+}
Index: msi/include/asm-powerpc/msi.h
===================================================================
--- msi.orig/include/asm-powerpc/msi.h
+++ msi/include/asm-powerpc/msi.h
@@ -20,10 +20,13 @@ static inline struct msi_ops *arch_get_m
 	return NULL;
 }
 
+struct mpic;
 #ifdef CONFIG_PCI_MSI
+extern int mpic_htmsi_init(struct mpic *mpic);
 extern int msi_rtas_init(void);
 #else
 static inline int msi_rtas_init(void) { return -1; };
+static inline int mpic_htmsi_init(struct mpic *mpic) { return -1; }
 #endif
 
 #endif /* __ASM_POWERPC_MSI_H */

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 15/16] Enable MSI mappings for MPIC
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (13 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 14/16] MPIC MSI backend Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25  8:34 ` [RFC/PATCH 16/16] Activate MSI for the MPIC backend on U3 Michael Ellerman
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

On some Apple machines the HT MSI mappings are not enabled by firmware, so
we need to do it by hand.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 arch/powerpc/sysdev/mpic.c |   49 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 47 insertions(+), 2 deletions(-)

Index: msi/arch/powerpc/sysdev/mpic.c
===================================================================
--- msi.orig/arch/powerpc/sysdev/mpic.c
+++ msi/arch/powerpc/sysdev/mpic.c
@@ -379,7 +379,51 @@ static void mpic_shutdown_ht_interrupt(s
 	spin_unlock_irqrestore(&mpic->fixup_lock, flags);
 }
 
-static void __init mpic_scan_ht_pic(struct mpic *mpic, u8 __iomem *devbase,
+#ifdef CONFIG_PCI_MSI
+static void __init mpic_setup_ht_msi(struct mpic *mpic, u8 __iomem *devbase,
+				    unsigned int devfn)
+{
+	u8 __iomem *base;
+	u8 pos, flags;
+	u64 addr = 0;
+
+	for (pos = readb(devbase + PCI_CAPABILITY_LIST); pos != 0;
+	     pos = readb(devbase + pos + PCI_CAP_LIST_NEXT)) {
+		u8 id = readb(devbase + pos + PCI_CAP_LIST_ID);
+		if (id == PCI_CAP_ID_HT) {
+			id = readb(devbase + pos + 3);
+			if ((id & HT_5BIT_CAP_MASK) == HT_CAPTYPE_MSI_MAPPING)
+				break;
+		}
+	}
+
+	if (pos == 0)
+		return;
+
+	base = devbase + pos;
+
+	flags = readb(base + HT_MSI_FLAGS);
+	if (!(flags & HT_MSI_FLAGS_FIXED)) {
+		addr = readl(base + HT_MSI_ADDR_LO) & HT_MSI_ADDR_LO_MASK;
+		addr = addr | ((u64)readl(base + HT_MSI_ADDR_HI) << 32);
+	}
+
+	printk(KERN_DEBUG "mpic:   - HT:%02x.%x %s MSI mapping found @ 0x%lx\n",
+		PCI_SLOT(devfn), PCI_FUNC(devfn),
+		flags & HT_MSI_FLAGS_ENABLE ? "enabled" : "disabled", addr);
+
+	if (!(flags & HT_MSI_FLAGS_ENABLE))
+		writeb(flags | HT_MSI_FLAGS_ENABLE, base + HT_MSI_FLAGS);
+}
+#else
+static void __init mpic_setup_ht_msi(struct mpic *mpic, u8 __iomem *devbase,
+				    unsigned int devfn)
+{
+	return;
+}
+#endif
+
+static void __init mpic_setup_ht_pic(struct mpic *mpic, u8 __iomem *devbase,
 				    unsigned int devfn, u32 vdid)
 {
 	int i, irq, n;
@@ -469,7 +513,8 @@ static void __init mpic_scan_ht_pics(str
 		if (!(s & PCI_STATUS_CAP_LIST))
 			goto next;
 
-		mpic_scan_ht_pic(mpic, devbase, devfn, l);
+		mpic_setup_ht_pic(mpic, devbase, devfn, l);
+		mpic_setup_ht_msi(mpic, devbase, devfn);
 
 	next:
 		/* next device, if function 0 */

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [RFC/PATCH 16/16] Activate MSI for the MPIC backend on U3
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (14 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 15/16] Enable MSI mappings for MPIC Michael Ellerman
@ 2007-01-25  8:34 ` Michael Ellerman
  2007-01-25 21:53 ` [RFC/PATCH 0/16] Ops based MSI Implementation Greg KH
                   ` (3 subsequent siblings)
  19 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-25  8:34 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S. Miller, Eric W. Biederman

If we have a U3 and it's the primary, enable MSIs via HT.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
---

 arch/powerpc/sysdev/mpic.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

Index: msi/arch/powerpc/sysdev/mpic.c
===================================================================
--- msi.orig/arch/powerpc/sysdev/mpic.c
+++ msi/arch/powerpc/sysdev/mpic.c
@@ -1171,8 +1171,10 @@ void __init mpic_init(struct mpic *mpic)
 
 	/* Do the HT PIC fixups on U3 broken mpic */
 	DBG("MPIC flags: %x\n", mpic->flags);
-	if ((mpic->flags & MPIC_BROKEN_U3) && (mpic->flags & MPIC_PRIMARY))
- 		mpic_scan_ht_pics(mpic);
+	if ((mpic->flags & MPIC_BROKEN_U3) && (mpic->flags & MPIC_PRIMARY)) {
+		mpic_scan_ht_pics(mpic);
+		mpic_htmsi_init(mpic);
+	}
 
 	for (i = 0; i < mpic->num_sources; i++) {
 		/* start with vector = source number, and masked */

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 5/16] Ops based MSI implementation
  2007-01-25  8:34 ` [RFC/PATCH 5/16] Ops based MSI implementation Michael Ellerman
@ 2007-01-25 21:52   ` Greg KH
  2007-01-25 22:05     ` Roland Dreier
  2007-01-26  1:02     ` Michael Ellerman
  0 siblings, 2 replies; 178+ messages in thread
From: Greg KH @ 2007-01-25 21:52 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller, Eric W. Biederman

On Thu, Jan 25, 2007 at 07:34:09PM +1100, Michael Ellerman wrote:
> --- /dev/null
> +++ msi/include/linux/msi-ops.h
> @@ -0,0 +1,168 @@
> +/*
> + * Copyright 2006-2007, Michael Ellerman, IBM Corporation.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.

Are you sure of the "any later version" part?

> + */
> +
> +#ifndef LINUX_MSI_OPS_H
> +#define LINUX_MSI_OPS_H
> +
> +#ifdef __KERNEL__
> +#ifndef __ASSEMBLY__

These two ifdefs aren't needed.

> +
> +#include <linux/pci.h>
> +#include <linux/msi.h>

Why not just put this structure in the msi.h file?

> +
> +/*
> + * MSI and MSI-X although different in some details, are also similar in
> + * many respects, and ultimately achieve the same end. Given that, this code
> + * tries as far as possible to implement both MSI and MSI-X with a minimum
> + * of code duplication. We will use "MSI" to refer to both MSI and MSI-X,
> + * except where it is important to differentiate between the two.
> + *
> + * Enabling MSI for a device can be broken down into:
> + *  1) Checking the device can support the type/number of MSIs requested.
> + *  2) Allocating irqs for the MSIs and setting up the irq_descs.
> + *  3) Writing the appropriate configuration to the device and enabling MSIs.
> + *
> + * To implement that we have the following callbacks:
> + *  1) check(pdev, num, msix_entries, type)
> + *  2) alloc(pdev, num, msix_entries, type)
> + *  3) enable(pdev, num, msix_entries, type)
> + *	a) setup_msi_msg(pdev, msix_entry, msi_msg, type)
> + *
> + * We give platforms full control over the enable step. However many
> + * platforms will simply want to program the device using standard PCI
> + * accessors. These platforms can use a generic enable callback and define
> + * a setup_msi_msg() callback which simply fills in the "magic" address and
> + * data values. Other platforms may leave setup_msi_msg() empty.
> + *
> + * Disabling MSI requires:
> + *  1) Disabling MSI on the device.
> + *  2) Freeing the irqs and any associated accounting information.
> + *
> + * Which maps directly to the two callbacks:
> + *  1) disable(pdev, num, msix_entries, type)
> + *  2) free(pdev, num, msix_entries, type)
> + */


Please use the proper kernel-doc format so the tools pick up this
documentation automatically.

> +#define msi_debug(fmt, args...)	\
> +	pr_debug("MSI:%s:%d: " fmt, __FUNCTION__, __LINE__, ## args)

Please use dev_dbg(), it makes it easier to track which device is being
referenced.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (15 preceding siblings ...)
  2007-01-25  8:34 ` [RFC/PATCH 16/16] Activate MSI for the MPIC backend on U3 Michael Ellerman
@ 2007-01-25 21:53 ` Greg KH
  2007-01-25 21:55   ` David Miller
  2007-01-26  1:03   ` Michael Ellerman
  2007-01-26  6:18 ` Eric W. Biederman
                   ` (2 subsequent siblings)
  19 siblings, 2 replies; 178+ messages in thread
From: Greg KH @ 2007-01-25 21:53 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S.Miller, Eric W.Biederman

On Thu, Jan 25, 2007 at 07:34:07PM +1100, Michael Ellerman wrote:
> OK, here's a first cut at moving ops based MSI into the generic code. I'm
> posting this now to make sure I'm not heading off into the weeds.
> 
> The fifth patch contain the guts of it, I've included the MPIC and
> RTAS backends as examples. In fact they actually work.
> 
> In order to smoothly merge this with the old MSI code, the two will need to
> coexist in the tree for at least a few commits, so I've added (invisible)
> Kconfig symbols to allow that.
> 
> I plan to merge the Intel code by:
>  * copying it into drivers/pci/msi/intel.c with zero changes.
>  * providing a minimal shim to connect the ops code to the intel code.
>  * at this point the code should be functional but ugly as hell.
>  * via a longish series of patches, adapt the intel code to better match
>    the new ops code.
>  * this should allow us to bisect through to find any mistakes.
> 
> If people think that's crazy and or stupid please let me know :)

At first glance, this looks sane.  I'll apply the first 4 patches to my
trees, and hold off on the rest until you have the intel patches
finished.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-25 21:53 ` [RFC/PATCH 0/16] Ops based MSI Implementation Greg KH
@ 2007-01-25 21:55   ` David Miller
  2007-01-26  1:05     ` Michael Ellerman
  2007-01-26  1:03   ` Michael Ellerman
  1 sibling, 1 reply; 178+ messages in thread
From: David Miller @ 2007-01-25 21:55 UTC (permalink / raw)
  To: greg; +Cc: kyle, linuxppc-dev, brice, shaohua.li, linux-pci, ebiederm

From: Greg KH <greg@kroah.com>
Date: Thu, 25 Jan 2007 13:53:07 -0800

> On Thu, Jan 25, 2007 at 07:34:07PM +1100, Michael Ellerman wrote:
> > OK, here's a first cut at moving ops based MSI into the generic code. I'm
> > posting this now to make sure I'm not heading off into the weeds.
> > 
> > The fifth patch contain the guts of it, I've included the MPIC and
> > RTAS backends as examples. In fact they actually work.
> > 
> > In order to smoothly merge this with the old MSI code, the two will need to
> > coexist in the tree for at least a few commits, so I've added (invisible)
> > Kconfig symbols to allow that.
> > 
> > I plan to merge the Intel code by:
> >  * copying it into drivers/pci/msi/intel.c with zero changes.
> >  * providing a minimal shim to connect the ops code to the intel code.
> >  * at this point the code should be functional but ugly as hell.
> >  * via a longish series of patches, adapt the intel code to better match
> >    the new ops code.
> >  * this should allow us to bisect through to find any mistakes.
> > 
> > If people think that's crazy and or stupid please let me know :)
> 
> At first glance, this looks sane.  I'll apply the first 4 patches to my
> trees, and hold off on the rest until you have the intel patches
> finished.

I'll also look into a sparc64 implementation as soon as I find the
time.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 5/16] Ops based MSI implementation
  2007-01-25 21:52   ` Greg KH
@ 2007-01-25 22:05     ` Roland Dreier
  2007-01-25 22:10       ` Greg KH
  2007-01-26  1:02     ` Michael Ellerman
  1 sibling, 1 reply; 178+ messages in thread
From: Roland Dreier @ 2007-01-25 22:05 UTC (permalink / raw)
  To: Greg KH
  Cc: Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller, Eric W. Biederman

 > Are you sure of the "any later version" part?

The command

    git grep "at your option) any later version"

finds more than 3000 matching files in my kernel source tree, so I
think existing practice shows it's fine if someone wants to license
a file for kernel inclusion that way.

 - R.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 5/16] Ops based MSI implementation
  2007-01-25 22:05     ` Roland Dreier
@ 2007-01-25 22:10       ` Greg KH
  0 siblings, 0 replies; 178+ messages in thread
From: Greg KH @ 2007-01-25 22:10 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller, Eric W. Biederman

On Thu, Jan 25, 2007 at 02:05:11PM -0800, Roland Dreier wrote:
>  > Are you sure of the "any later version" part?
> 
> The command
> 
>     git grep "at your option) any later version"
> 
> finds more than 3000 matching files in my kernel source tree, so I
> think existing practice shows it's fine if someone wants to license
> a file for kernel inclusion that way.

Oh, I'm not saying that it isn't acceptable, just pointing it out so
that the poster thinks about it, instead of just cut-and-pasting it from
somewhere else.  Most companies today have policies about this wording
that the developer probably needs to be aware of.

Especially given the state of the current GPLv3 license wording, but
that's another email thread for another time :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 178+ messages in thread

* patch msi-abstract-msi-suspend.patch added to gregkh-2.6 tree
  2007-01-25  8:34 ` [RFC/PATCH 4/16] Abstract MSI suspend Michael Ellerman
@ 2007-01-25 22:33   ` gregkh
  2007-01-28  8:27   ` [RFC/PATCH 4/16] Abstract MSI suspend Eric W. Biederman
  1 sibling, 0 replies; 178+ messages in thread
From: gregkh @ 2007-01-25 22:33 UTC (permalink / raw)
  To: michael, brice, davem, ebiederm, greg, gregkh, kyle,
	linuxppc-dev, shaohua.li


This is a note to let you know that I've just added the patch titled

     Subject: MSI: Abstract MSI suspend

to my gregkh-2.6 tree.  Its filename is

     msi-abstract-msi-suspend.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From michael@ozlabs.org  Thu Jan 25 14:14:57 2007
From: Michael Ellerman <michael@ellerman.id.au>
Date: Thu, 25 Jan 2007 19:34:09 +1100
Subject: MSI: Abstract MSI suspend
To: linux-pci@atrey.karlin.mff.cuni.cz
Cc: Greg Kroah-Hartman <greg@kroah.com>, Eric W. Biederman <ebiederm@xmission.com>, David S. Miller <davem@davemloft.net>, Kyle McMartin <kyle@parisc-linux.org>, <linuxppc-dev@ozlabs.org>, Brice Goglin <brice@myri.com>, <shaohua.li@intel.com>
Message-ID: <20070125083410.631EEDE277@ozlabs.org>

Currently pci_disable_device() disables MSI on a device by twiddling
bits in config space via disable_msi_mode().

On some platforms that may not be appropriate, so abstract the MSI
suspend logic into pci_disable_device_msi().

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 drivers/pci/msi.c |   11 +++++++++++
 drivers/pci/pci.c |    7 +------
 drivers/pci/pci.h |    2 ++
 3 files changed, 14 insertions(+), 6 deletions(-)

--- gregkh-2.6.orig/drivers/pci/msi.c
+++ gregkh-2.6/drivers/pci/msi.c
@@ -272,6 +272,17 @@ void disable_msi_mode(struct pci_dev *de
 	pci_intx(dev, 1);  /* enable intx */
 }
 
+void pci_disable_device_msi(struct pci_dev *dev)
+{
+	if (dev->msi_enabled)
+		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
+			PCI_CAP_ID_MSI);
+
+	if (dev->msix_enabled)
+		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
+			PCI_CAP_ID_MSIX);
+}
+
 static int msi_lookup_irq(struct pci_dev *dev, int type)
 {
 	int irq;
--- gregkh-2.6.orig/drivers/pci/pci.c
+++ gregkh-2.6/drivers/pci/pci.c
@@ -772,12 +772,7 @@ pci_disable_device(struct pci_dev *dev)
 	if (atomic_sub_return(1, &dev->enable_cnt) != 0)
 		return;
 
-	if (dev->msi_enabled)
-		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
-			PCI_CAP_ID_MSI);
-	if (dev->msix_enabled)
-		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
-			PCI_CAP_ID_MSIX);
+	pci_disable_device_msi(dev);
 
 	pci_read_config_word(dev, PCI_COMMAND, &pci_command);
 	if (pci_command & PCI_COMMAND_MASTER) {
--- gregkh-2.6.orig/drivers/pci/pci.h
+++ gregkh-2.6/drivers/pci/pci.h
@@ -47,9 +47,11 @@ extern unsigned int pci_pm_d3_delay;
 
 #ifdef CONFIG_PCI_MSI
 void disable_msi_mode(struct pci_dev *dev, int pos, int type);
+extern void pci_disable_device_msi(struct pci_dev *dev);
 void pci_no_msi(void);
 #else
 static inline void disable_msi_mode(struct pci_dev *dev, int pos, int type) { }
+static inline void pci_disable_device_msi(struct pci_dev *dev) { }
 static inline void pci_no_msi(void) { }
 #endif
 


Patches currently in gregkh-2.6 which might be from michael@ellerman.id.au are

^ permalink raw reply	[flat|nested] 178+ messages in thread

* patch msi-combine-pci__msi-msix_state.patch added to gregkh-2.6 tree
  2007-01-25  8:34 ` [RFC/PATCH 3/16] Combine pci_(save|restore)_msi/msix_state Michael Ellerman
@ 2007-01-25 22:33   ` gregkh
  0 siblings, 0 replies; 178+ messages in thread
From: gregkh @ 2007-01-25 22:33 UTC (permalink / raw)
  To: michael, brice, davem, ebiederm, greg, gregkh, kyle,
	linuxppc-dev, shaohua.li


This is a note to let you know that I've just added the patch titled

     Subject: MSI: Combine pci_(save|restore)_msi/msix_state

to my gregkh-2.6 tree.  Its filename is

     msi-combine-pci__msi-msix_state.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From michael@ozlabs.org  Thu Jan 25 14:13:06 2007
From: Michael Ellerman <michael@ellerman.id.au>
Date: Thu, 25 Jan 2007 19:34:08 +1100
Subject: MSI: Combine pci_(save|restore)_msi/msix_state
To: linux-pci@atrey.karlin.mff.cuni.cz
Cc: Greg Kroah-Hartman <greg@kroah.com>, Eric W. Biederman <ebiederm@xmission.com>, David S. Miller <davem@davemloft.net>, Kyle McMartin <kyle@parisc-linux.org>, <linuxppc-dev@ozlabs.org>, Brice Goglin <brice@myri.com>, <shaohua.li@intel.com>
Message-ID: <20070125083409.D4EBADE25B@ozlabs.org>


The PCI save/restore code doesn't need to care about MSI vs MSI-X, all
it really wants is to say "save/restore all MSI(-X) info for this device".

This is borne out in the code, we call the MSI and MSI-X save routines
side by side, and similarly with the restore routines.

So combine the MSI/MSI-X routines into pci_save_msi_state() and
pci_restore_msi_state(). It is up to those routines to decide what state
needs to be saved.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 drivers/pci/msi.c |   27 +++++++++++++++++++++++----
 drivers/pci/pci.c |    4 +---
 drivers/pci/pci.h |    6 ++----
 3 files changed, 26 insertions(+), 11 deletions(-)

--- gregkh-2.6.orig/drivers/pci/msi.c
+++ gregkh-2.6/drivers/pci/msi.c
@@ -295,7 +295,7 @@ static int msi_lookup_irq(struct pci_dev
 }
 
 #ifdef CONFIG_PM
-int pci_save_msi_state(struct pci_dev *dev)
+static int __pci_save_msi_state(struct pci_dev *dev)
 {
 	int pos, i = 0;
 	u16 control;
@@ -333,7 +333,7 @@ int pci_save_msi_state(struct pci_dev *d
 	return 0;
 }
 
-void pci_restore_msi_state(struct pci_dev *dev)
+static void __pci_restore_msi_state(struct pci_dev *dev)
 {
 	int i = 0, pos;
 	u16 control;
@@ -361,7 +361,7 @@ void pci_restore_msi_state(struct pci_de
 	kfree(save_state);
 }
 
-int pci_save_msix_state(struct pci_dev *dev)
+static int __pci_save_msix_state(struct pci_dev *dev)
 {
 	int pos;
 	int temp;
@@ -409,7 +409,20 @@ int pci_save_msix_state(struct pci_dev *
 	return 0;
 }
 
-void pci_restore_msix_state(struct pci_dev *dev)
+int pci_save_msi_state(struct pci_dev *dev)
+{
+	int rc;
+
+	rc = __pci_save_msi_state(dev);
+	if (rc)
+		return rc;
+
+	rc = __pci_save_msix_state(dev);
+
+	return rc;
+}
+
+static void __pci_restore_msix_state(struct pci_dev *dev)
 {
 	u16 save;
 	int pos;
@@ -446,6 +459,12 @@ void pci_restore_msix_state(struct pci_d
 	pci_write_config_word(dev, msi_control_reg(pos), save);
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
 }
+
+void pci_restore_msi_state(struct pci_dev *dev)
+{
+	__pci_restore_msi_state(dev);
+	__pci_restore_msix_state(dev);
+}
 #endif	/* CONFIG_PM */
 
 /**
--- gregkh-2.6.orig/drivers/pci/pci.c
+++ gregkh-2.6/drivers/pci/pci.c
@@ -634,8 +634,6 @@ pci_save_state(struct pci_dev *dev)
 		pci_read_config_dword(dev, i * 4,&dev->saved_config_space[i]);
 	if ((i = pci_save_msi_state(dev)) != 0)
 		return i;
-	if ((i = pci_save_msix_state(dev)) != 0)
-		return i;
 	if ((i = pci_save_pcie_state(dev)) != 0)
 		return i;
 	if ((i = pci_save_pcix_state(dev)) != 0)
@@ -673,7 +671,7 @@ pci_restore_state(struct pci_dev *dev)
 	}
 	pci_restore_pcix_state(dev);
 	pci_restore_msi_state(dev);
-	pci_restore_msix_state(dev);
+
 	return 0;
 }
 
--- gregkh-2.6.orig/drivers/pci/pci.h
+++ gregkh-2.6/drivers/pci/pci.h
@@ -52,17 +52,15 @@ void pci_no_msi(void);
 static inline void disable_msi_mode(struct pci_dev *dev, int pos, int type) { }
 static inline void pci_no_msi(void) { }
 #endif
+
 #if defined(CONFIG_PCI_MSI) && defined(CONFIG_PM)
 int pci_save_msi_state(struct pci_dev *dev);
-int pci_save_msix_state(struct pci_dev *dev);
 void pci_restore_msi_state(struct pci_dev *dev);
-void pci_restore_msix_state(struct pci_dev *dev);
 #else
 static inline int pci_save_msi_state(struct pci_dev *dev) { return 0; }
-static inline int pci_save_msix_state(struct pci_dev *dev) { return 0; }
 static inline void pci_restore_msi_state(struct pci_dev *dev) {}
-static inline void pci_restore_msix_state(struct pci_dev *dev) {}
 #endif
+
 static inline int pci_no_d1d2(struct pci_dev *dev)
 {
 	unsigned int parent_dstates = 0;


Patches currently in gregkh-2.6 which might be from michael@ellerman.id.au are

^ permalink raw reply	[flat|nested] 178+ messages in thread

* patch msi-remove-pci_scan_msi_device.patch added to gregkh-2.6 tree
  2007-01-25  8:34 ` [RFC/PATCH 2/16] Remove pci_scan_msi_device() Michael Ellerman
@ 2007-01-25 22:33   ` gregkh
  0 siblings, 0 replies; 178+ messages in thread
From: gregkh @ 2007-01-25 22:33 UTC (permalink / raw)
  To: michael, brice, davem, ebiederm, greg, gregkh, kyle,
	linuxppc-dev, shaohua.li


This is a note to let you know that I've just added the patch titled

     Subject: MSI: Remove pci_scan_msi_device()

to my gregkh-2.6 tree.  Its filename is

     msi-remove-pci_scan_msi_device.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From owner-linux-pci@atrey.karlin.mff.cuni.cz  Thu Jan 25 14:12:48 2007
From: Michael Ellerman <michael@ellerman.id.au>
Date: Thu, 25 Jan 2007 19:34:08 +1100
Subject: MSI: Remove pci_scan_msi_device()
To: linux-pci@atrey.karlin.mff.cuni.cz
Cc: Greg Kroah-Hartman <greg@kroah.com>, Eric W.Biederman <ebiederm@xmission.com>, David S.Miller <davem@davemloft.net>, Kyle McMartin <kyle@parisc-linux.org>, <linuxppc-dev@ozlabs.org>, Brice Goglin <brice@myri.com>, <shaohua.li@intel.com>
Message-ID: <20070125083409.5E9F0DE257@ozlabs.org>


pci_scan_msi_device() doesn't do anything anymore, so remove it.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 arch/powerpc/kernel/pci_64.c |    2 --
 drivers/pci/msi.c            |    6 ------
 drivers/pci/probe.c          |    1 -
 include/linux/pci.h          |    2 --
 4 files changed, 11 deletions(-)

--- gregkh-2.6.orig/arch/powerpc/kernel/pci_64.c
+++ gregkh-2.6/arch/powerpc/kernel/pci_64.c
@@ -381,8 +381,6 @@ struct pci_dev *of_create_pci_dev(struct
 
 	pci_device_add(dev, bus);
 
-	/* XXX pci_scan_msi_device(dev); */
-
 	return dev;
 }
 EXPORT_SYMBOL(of_create_pci_dev);
--- gregkh-2.6.orig/drivers/pci/msi.c
+++ gregkh-2.6/drivers/pci/msi.c
@@ -294,12 +294,6 @@ static int msi_lookup_irq(struct pci_dev
 	return -EACCES;
 }
 
-void pci_scan_msi_device(struct pci_dev *dev)
-{
-	if (!dev)
-		return;
-}
-
 #ifdef CONFIG_PM
 int pci_save_msi_state(struct pci_dev *dev)
 {
--- gregkh-2.6.orig/drivers/pci/probe.c
+++ gregkh-2.6/drivers/pci/probe.c
@@ -946,7 +946,6 @@ pci_scan_single_device(struct pci_bus *b
 		return NULL;
 
 	pci_device_add(dev, bus);
-	pci_scan_msi_device(dev);
 
 	return dev;
 }
--- gregkh-2.6.orig/include/linux/pci.h
+++ gregkh-2.6/include/linux/pci.h
@@ -626,7 +626,6 @@ struct msix_entry {
 
 
 #ifndef CONFIG_PCI_MSI
-static inline void pci_scan_msi_device(struct pci_dev *dev) {}
 static inline int pci_enable_msi(struct pci_dev *dev) {return -1;}
 static inline void pci_disable_msi(struct pci_dev *dev) {}
 static inline int pci_enable_msix(struct pci_dev* dev,
@@ -634,7 +633,6 @@ static inline int pci_enable_msix(struct
 static inline void pci_disable_msix(struct pci_dev *dev) {}
 static inline void msi_remove_pci_irq_vectors(struct pci_dev *dev) {}
 #else
-extern void pci_scan_msi_device(struct pci_dev *dev);
 extern int pci_enable_msi(struct pci_dev *dev);
 extern void pci_disable_msi(struct pci_dev *dev);
 extern int pci_enable_msix(struct pci_dev* dev,


Patches currently in gregkh-2.6 which might be from michael@ellerman.id.au are

^ permalink raw reply	[flat|nested] 178+ messages in thread

* patch msi-replace-pci_msi_quirk-with-calls-to-pci_no_msi.patch added to gregkh-2.6 tree
  2007-01-25  8:34 ` [RFC/PATCH 1/16] Replace pci_msi_quirk with calls to pci_no_msi() Michael Ellerman
@ 2007-01-25 22:33   ` gregkh
  0 siblings, 0 replies; 178+ messages in thread
From: gregkh @ 2007-01-25 22:33 UTC (permalink / raw)
  To: michael, brice, davem, ebiederm, greg, gregkh, kyle,
	linuxppc-dev, shaohua.li


This is a note to let you know that I've just added the patch titled

     Subject: MSI: Replace pci_msi_quirk with calls to pci_no_msi()

to my gregkh-2.6 tree.  Its filename is

     msi-replace-pci_msi_quirk-with-calls-to-pci_no_msi.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From michael@ozlabs.org  Thu Jan 25 14:12:24 2007
From: Michael Ellerman <michael@ellerman.id.au>
Date: Thu, 25 Jan 2007 19:34:07 +1100
Subject: MSI: Replace pci_msi_quirk with calls to pci_no_msi()
To: linux-pci@atrey.karlin.mff.cuni.cz
Cc: Greg Kroah-Hartman <greg@kroah.com>, Eric W. Biederman <ebiederm@xmission.com>, David S. Miller <davem@davemloft.net>, Kyle McMartin <kyle@parisc-linux.org>, <linuxppc-dev@ozlabs.org>, Brice Goglin <brice@myri.com>, <shaohua.li@intel.com>
Message-ID: <20070125083408.DC5BEDE24D@ozlabs.org>


I don't see any reason why we need pci_msi_quirk, quirk code can just
call pci_no_msi() instead.

Remove the check of pci_msi_quirk in msi_init(). This is safe as all
calls to msi_init() are protected by calls to pci_msi_supported(),
which checks pci_msi_enable, which is disabled by pci_no_msi().

The pci_disable_msi routines didn't check pci_msi_quirk, only
pci_msi_enable, but as far as I can see that was a bug not a feature.

Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 drivers/net/bnx2.c   |    3 +--
 drivers/pci/msi.c    |    7 -------
 drivers/pci/pci.h    |    6 +-----
 drivers/pci/quirks.c |    7 ++-----
 4 files changed, 4 insertions(+), 19 deletions(-)

--- gregkh-2.6.orig/drivers/net/bnx2.c
+++ gregkh-2.6/drivers/net/bnx2.c
@@ -5942,8 +5942,7 @@ bnx2_init_board(struct pci_dev *pdev, st
 	 * responding after a while.
 	 *
 	 * AMD believes this incompatibility is unique to the 5706, and
-	 * prefers to locally disable MSI rather than globally disabling it
-	 * using pci_msi_quirk.
+	 * prefers to locally disable MSI rather than globally disabling it.
 	 */
 	if (CHIP_NUM(bp) == CHIP_NUM_5706 && disable_msi == 0) {
 		struct pci_dev *amd_8132 = NULL;
--- gregkh-2.6.orig/drivers/pci/msi.c
+++ gregkh-2.6/drivers/pci/msi.c
@@ -170,13 +170,6 @@ static int msi_init(void)
 	if (!status)
 		return status;
 
-	if (pci_msi_quirk) {
-		pci_msi_enable = 0;
-		printk(KERN_WARNING "PCI: MSI quirk detected. MSI disabled.\n");
-		status = -EINVAL;
-		return status;
-	}
-
 	status = msi_cache_init();
 	if (status < 0) {
 		pci_msi_enable = 0;
--- gregkh-2.6.orig/drivers/pci/pci.h
+++ gregkh-2.6/drivers/pci/pci.h
@@ -43,12 +43,8 @@ extern void pci_remove_legacy_files(stru
 /* Lock for read/write access to pci device and bus lists */
 extern struct rw_semaphore pci_bus_sem;
 
-#ifdef CONFIG_PCI_MSI
-extern int pci_msi_quirk;
-#else
-#define pci_msi_quirk 0
-#endif
 extern unsigned int pci_pm_d3_delay;
+
 #ifdef CONFIG_PCI_MSI
 void disable_msi_mode(struct pci_dev *dev, int pos, int type);
 void pci_no_msi(void);
--- gregkh-2.6.orig/drivers/pci/quirks.c
+++ gregkh-2.6/drivers/pci/quirks.c
@@ -1692,9 +1692,6 @@ DECLARE_PCI_FIXUP_RESUME(PCI_VENDOR_ID_N
 			quirk_nvidia_ck804_pcie_aer_ext_cap);
 
 #ifdef CONFIG_PCI_MSI
-/* To disable MSI globally */
-int pci_msi_quirk;
-
 /* The Serverworks PCI-X chipset does not support MSI. We cannot easily rely
  * on setting PCI_BUS_FLAGS_NO_MSI in its bus flags because there are actually
  * some other busses controlled by the chipset even if Linux is not aware of it.
@@ -1703,8 +1700,8 @@ int pci_msi_quirk;
  */
 static void __init quirk_svw_msi(struct pci_dev *dev)
 {
-	pci_msi_quirk = 1;
-	printk(KERN_WARNING "PCI: MSI quirk detected. pci_msi_quirk set.\n");
+	pci_no_msi();
+	printk(KERN_WARNING "PCI: MSI quirk detected. MSI deactivated.\n");
 }
 DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_SERVERWORKS, PCI_DEVICE_ID_SERVERWORKS_GCNB_LE, quirk_svw_msi);
 


Patches currently in gregkh-2.6 which might be from michael@ellerman.id.au are

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 5/16] Ops based MSI implementation
  2007-01-25 21:52   ` Greg KH
  2007-01-25 22:05     ` Roland Dreier
@ 2007-01-26  1:02     ` Michael Ellerman
  1 sibling, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-26  1:02 UTC (permalink / raw)
  To: Greg KH
  Cc: Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller, Eric W. Biederman

[-- Attachment #1: Type: text/plain, Size: 3814 bytes --]

On Thu, 2007-01-25 at 13:52 -0800, Greg KH wrote:
> On Thu, Jan 25, 2007 at 07:34:09PM +1100, Michael Ellerman wrote:
> > --- /dev/null
> > +++ msi/include/linux/msi-ops.h
> > @@ -0,0 +1,168 @@
> > +/*
> > + * Copyright 2006-2007, Michael Ellerman, IBM Corporation.
> > + *
> > + * This program is free software; you can redistribute it and/or
> > + * modify it under the terms of the GNU General Public License
> > + * as published by the Free Software Foundation; either version
> > + * 2 of the License, or (at your option) any later version.
> 
> Are you sure of the "any later version" part?

Not 100%. I copied it from an existing file in arch/powerpc, and I
haven't heard anything about changing the boiler plate - but I'll see if
anyone knows. I'm not really interested in starting a GPLv2 vs GPLv3
debate :)

> > + */
> > +
> > +#ifndef LINUX_MSI_OPS_H
> > +#define LINUX_MSI_OPS_H
> > +
> > +#ifdef __KERNEL__
> > +#ifndef __ASSEMBLY__
> 
> These two ifdefs aren't needed.

OK. I thought __KERNEL__ was for hiding things from userspace, but I
haven't been following the header developments. 

> > +#include <linux/pci.h>
> > +#include <linux/msi.h>
> 
> Why not just put this structure in the msi.h file?

Yeah OK I'll put it in there.

> > +
> > +/*
> > + * MSI and MSI-X although different in some details, are also similar in
> > + * many respects, and ultimately achieve the same end. Given that, this code
> > + * tries as far as possible to implement both MSI and MSI-X with a minimum
> > + * of code duplication. We will use "MSI" to refer to both MSI and MSI-X,
> > + * except where it is important to differentiate between the two.
> > + *
> > + * Enabling MSI for a device can be broken down into:
> > + *  1) Checking the device can support the type/number of MSIs requested.
> > + *  2) Allocating irqs for the MSIs and setting up the irq_descs.
> > + *  3) Writing the appropriate configuration to the device and enabling MSIs.
> > + *
> > + * To implement that we have the following callbacks:
> > + *  1) check(pdev, num, msix_entries, type)
> > + *  2) alloc(pdev, num, msix_entries, type)
> > + *  3) enable(pdev, num, msix_entries, type)
> > + *	a) setup_msi_msg(pdev, msix_entry, msi_msg, type)
> > + *
> > + * We give platforms full control over the enable step. However many
> > + * platforms will simply want to program the device using standard PCI
> > + * accessors. These platforms can use a generic enable callback and define
> > + * a setup_msi_msg() callback which simply fills in the "magic" address and
> > + * data values. Other platforms may leave setup_msi_msg() empty.
> > + *
> > + * Disabling MSI requires:
> > + *  1) Disabling MSI on the device.
> > + *  2) Freeing the irqs and any associated accounting information.
> > + *
> > + * Which maps directly to the two callbacks:
> > + *  1) disable(pdev, num, msix_entries, type)
> > + *  2) free(pdev, num, msix_entries, type)
> > + */
> 
> 
> Please use the proper kernel-doc format so the tools pick up this
> documentation automatically.

Will do.

> > +#define msi_debug(fmt, args...)	\
> > +	pr_debug("MSI:%s:%d: " fmt, __FUNCTION__, __LINE__, ## args)
> 
> Please use dev_dbg(), it makes it easier to track which device is being
> referenced.

OK. My only gripe with dev_dbg() is it doesn't handle a NULL dev, which
means you have to be very careful where you use it. There's at least one
spot in the MSI code where I call msi_debug() with a possibly NULL pdev.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-25 21:53 ` [RFC/PATCH 0/16] Ops based MSI Implementation Greg KH
  2007-01-25 21:55   ` David Miller
@ 2007-01-26  1:03   ` Michael Ellerman
  1 sibling, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-26  1:03 UTC (permalink / raw)
  To: Greg KH
  Cc: Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S.Miller, Eric W.Biederman

[-- Attachment #1: Type: text/plain, Size: 1556 bytes --]

On Thu, 2007-01-25 at 13:53 -0800, Greg KH wrote:
> On Thu, Jan 25, 2007 at 07:34:07PM +1100, Michael Ellerman wrote:
> > OK, here's a first cut at moving ops based MSI into the generic code. I'm
> > posting this now to make sure I'm not heading off into the weeds.
> > 
> > The fifth patch contain the guts of it, I've included the MPIC and
> > RTAS backends as examples. In fact they actually work.
> > 
> > In order to smoothly merge this with the old MSI code, the two will need to
> > coexist in the tree for at least a few commits, so I've added (invisible)
> > Kconfig symbols to allow that.
> > 
> > I plan to merge the Intel code by:
> >  * copying it into drivers/pci/msi/intel.c with zero changes.
> >  * providing a minimal shim to connect the ops code to the intel code.
> >  * at this point the code should be functional but ugly as hell.
> >  * via a longish series of patches, adapt the intel code to better match
> >    the new ops code.
> >  * this should allow us to bisect through to find any mistakes.
> > 
> > If people think that's crazy and or stupid please let me know :)
> 
> At first glance, this looks sane.  I'll apply the first 4 patches to my
> trees, and hold off on the rest until you have the intel patches
> finished.

Cool, thanks.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-25 21:55   ` David Miller
@ 2007-01-26  1:05     ` Michael Ellerman
  0 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-26  1:05 UTC (permalink / raw)
  To: David Miller
  Cc: greg, kyle, linuxppc-dev, brice, shaohua.li, linux-pci, ebiederm

[-- Attachment #1: Type: text/plain, Size: 1870 bytes --]

On Thu, 2007-01-25 at 13:55 -0800, David Miller wrote:
> From: Greg KH <greg@kroah.com>
> Date: Thu, 25 Jan 2007 13:53:07 -0800
> 
> > On Thu, Jan 25, 2007 at 07:34:07PM +1100, Michael Ellerman wrote:
> > > OK, here's a first cut at moving ops based MSI into the generic code. I'm
> > > posting this now to make sure I'm not heading off into the weeds.
> > > 
> > > The fifth patch contain the guts of it, I've included the MPIC and
> > > RTAS backends as examples. In fact they actually work.
> > > 
> > > In order to smoothly merge this with the old MSI code, the two will need to
> > > coexist in the tree for at least a few commits, so I've added (invisible)
> > > Kconfig symbols to allow that.
> > > 
> > > I plan to merge the Intel code by:
> > >  * copying it into drivers/pci/msi/intel.c with zero changes.
> > >  * providing a minimal shim to connect the ops code to the intel code.
> > >  * at this point the code should be functional but ugly as hell.
> > >  * via a longish series of patches, adapt the intel code to better match
> > >    the new ops code.
> > >  * this should allow us to bisect through to find any mistakes.
> > > 
> > > If people think that's crazy and or stupid please let me know :)
> > 
> > At first glance, this looks sane.  I'll apply the first 4 patches to my
> > trees, and hold off on the rest until you have the intel patches
> > finished.
> 
> I'll also look into a sparc64 implementation as soon as I find the
> time.

That'd be great. The more backends we have the more likely we are to
find the bugs and bogosities in my design.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 6/16] Add bare metal MSI enable & disable routines
  2007-01-25  8:34 ` [RFC/PATCH 6/16] Add bare metal MSI enable & disable routines Michael Ellerman
@ 2007-01-26  5:35   ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-26  5:35 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Michael Ellerman <michael@ellerman.id.au> writes:

> Add bare metal MSI enable & disable routines.

What is the plan for MSI-X support.  That is the interesting case in hardware.
hardware to support.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (16 preceding siblings ...)
  2007-01-25 21:53 ` [RFC/PATCH 0/16] Ops based MSI Implementation Greg KH
@ 2007-01-26  6:18 ` Eric W. Biederman
  2007-01-26  6:56   ` Grant Grundler
                     ` (2 more replies)
  2007-01-27  4:59 ` Michael Ellerman
  2007-01-28 19:40   ` Eric W. Biederman
  19 siblings, 3 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-26  6:18 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Michael Ellerman <michael@ellerman.id.au> writes:

> OK, here's a first cut at moving ops based MSI into the generic code. I'm
> posting this now to make sure I'm not heading off into the weeds.

First thanks for copying me on this.  I really appreciate it.

> The fifth patch contain the guts of it, I've included the MPIC and
> RTAS backends as examples. In fact they actually work.
>
> In order to smoothly merge this with the old MSI code, the two will need to
> coexist in the tree for at least a few commits, so I've added (invisible)
> Kconfig symbols to allow that.
>
> I plan to merge the Intel code by:
>  * copying it into drivers/pci/msi/intel.c with zero changes.
>  * providing a minimal shim to connect the ops code to the intel code.
>  * at this point the code should be functional but ugly as hell.
>  * via a longish series of patches, adapt the intel code to better match
>    the new ops code.
>  * this should allow us to bisect through to find any mistakes.
>
> If people think that's crazy and or stupid please let me know :)
>
> TBD are:
>  * suspend / resume hooks in the ops - this shouldn't be too tricky with
>    the power management API cleaned up a touch.
>  * working out why the hell msi_remove_pci_irq_vectors() is a special case ?


I haven't done more than skim the patches yet but I am distressed.

You code appears to be nice simple clean and to not support MSI in
a useful way.  I may be reading too quickly but at the moment your infrastructure
appears useless if you are on a platform that doesn't enforce MSI's get filtered
with a legacy interrupt controller.

You don't have MSI-X support (which is the interesting case) and you don't have
suspend/resume support.

You don't support the MSI mask bit.

Looking at your msi_ops it does not map to what I am doing on x86.  There
is the implicit assumption that the msi_message is fixed for the lifetime
of the msi.  Which is wrong.

So in short summary I cannot use your msi_ops they are inappropriate for
i386, x86_64 and ia64.

So at the moment I am opposed to this code because as it sits it appears to
be a serious regression.

The additional bits that feel like this code was primarily targeted at supporting
the RTAS with real hardware support thrown in as an after thought just seem
to add insult to injury.  To date I have no information that indicates to me
that the RTAS model is at all sane or makes any sense to duplicate elsewhere.
If supporting the RTAS is what is obscuring your vision of what is really
needed to support MSI I don't want to see RTAS support in a patch set
until we get a good multiple platform architecture, merged into the kernel.

Supporting the RTAS first and breaking everyone who actually has real
hardware seems like very much the wrong approach to get a good
multiple platform solution.

After I get some sleep I will see if I can up with some constructive
criticism on how we can make things work.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-25  8:34 ` [RFC/PATCH 14/16] MPIC MSI backend Michael Ellerman
@ 2007-01-26  6:43   ` Grant Grundler
  2007-01-26  7:02     ` Eric W. Biederman
  2007-01-26 20:41     ` Benjamin Herrenschmidt
  2007-01-26  9:11   ` Segher Boessenkool
  1 sibling, 2 replies; 178+ messages in thread
From: Grant Grundler @ 2007-01-26  6:43 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Eric W.Biederman, shaohua.li, linux-pci, David S.Miller,
	Brice Goglin

On Thu, Jan 25, 2007 at 07:34:16PM +1100, Michael Ellerman wrote:
> MPIC MSI backend. Based on code from Segher, heavily hacked by me.
> Renamed to mpic_htmsi, as it only deals with MSI over Hypertransport.
...
> +		/* FIXME should we save the existing type */
> +		set_irq_type(virq, IRQ_TYPE_EDGE_RISING);

What exactly does the "virq" represent here?
I'd like to understand if the FIXME comment could be dropped (or not).

I don't get the impression it's related to a PCI IRQ line.
Maybe irq_create_mapping() has comments that describe hwirq and virq?
If not, it would be useful if those terms were described.

thanks
grant

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-26  6:18 ` Eric W. Biederman
@ 2007-01-26  6:56   ` Grant Grundler
  2007-01-26  7:15     ` Eric W. Biederman
                       ` (2 more replies)
  2007-01-26 21:24   ` Benjamin Herrenschmidt
  2007-01-27  5:41   ` Michael Ellerman
  2 siblings, 3 replies; 178+ messages in thread
From: Grant Grundler @ 2007-01-26  6:56 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

On Thu, Jan 25, 2007 at 11:18:20PM -0700, Eric W. Biederman wrote:
> You code appears to be nice simple clean and to not support MSI in
> a useful way.  I may be reading too quickly but at the moment your
> infrastructure appears useless if you are on a platform that doesn't
> enforce MSI's get filtered with a legacy interrupt controller.

Hrm?
Isn't the point of MSI to avoid any sort of interrupt controller?

> You don't have MSI-X support (which is the interesting case) and you
> don't have suspend/resume support.

I saw save/restore entry points.
I expected suspend/resume code would use those.
Do you agree (or not)?

> You don't support the MSI mask bit.
> 
> Looking at your msi_ops it does not map to what I am doing on x86.  There
> is the implicit assumption that the msi_message is fixed for the lifetime
> of the msi.  Which is wrong.

Erm...wouldn't changing the message also effectively change which handler
ends up catching the interrupt?
I always understood the addr/msg were a pair that HW would map to a handler.
Can you explain what you mean by "lifetime" and "fixed"?
What event would change the message? system Suspend/resume?

...
> After I get some sleep I will see if I can up with some constructive
> criticism on how we can make things work.

Well, I hope the questions I pose above help lead the discussion in
that direction.

thanks,
grant

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26  6:43   ` Grant Grundler
@ 2007-01-26  7:02     ` Eric W. Biederman
  2007-01-26  8:47       ` Segher Boessenkool
                         ` (2 more replies)
  2007-01-26 20:41     ` Benjamin Herrenschmidt
  1 sibling, 3 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-26  7:02 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S.Miller

Grant Grundler <grundler@parisc-linux.org> writes:

> On Thu, Jan 25, 2007 at 07:34:16PM +1100, Michael Ellerman wrote:
>> MPIC MSI backend. Based on code from Segher, heavily hacked by me.
>> Renamed to mpic_htmsi, as it only deals with MSI over Hypertransport.
> ...
>> +		/* FIXME should we save the existing type */
>> +		set_irq_type(virq, IRQ_TYPE_EDGE_RISING);
>
> What exactly does the "virq" represent here?
> I'd like to understand if the FIXME comment could be dropped (or not).
>
> I don't get the impression it's related to a PCI IRQ line.
> Maybe irq_create_mapping() has comments that describe hwirq and virq?
> If not, it would be useful if those terms were described.


I don't have a clue why it is called virq.  But looking at the
usage it must be a linux irq number as shown in /proc/interrupts and
as such there need be no connection with hardware.

I believe the ppc model is to allocate an interrupt source on their
existing interrupt controller and use that instead of the normal x86
case of having the MSI interrupt go transparently to the cpu.

Both set_irq_type, and entries.vector take a linux irq number.
Darn we should change that field name, it is misleading.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-26  6:56   ` Grant Grundler
@ 2007-01-26  7:15     ` Eric W. Biederman
  2007-01-26  7:48       ` Grant Grundler
  2007-01-26  8:57     ` Segher Boessenkool
  2007-01-26 20:57     ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-26  7:15 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Grant Grundler <grundler@parisc-linux.org> writes:

> On Thu, Jan 25, 2007 at 11:18:20PM -0700, Eric W. Biederman wrote:
>> You code appears to be nice simple clean and to not support MSI in
>> a useful way.  I may be reading too quickly but at the moment your
>> infrastructure appears useless if you are on a platform that doesn't
>> enforce MSI's get filtered with a legacy interrupt controller.
>
> Hrm?
> Isn't the point of MSI to avoid any sort of interrupt controller?

That is how the supported platforms were designed.  But something needs
to translate a pci message to a cpu interrupt and on PPC apparently
they implemented this in their interrupt controller.

To be fair there are also ioapic on x86 which can do this as well
they just haven't been sufficiently interesting to support.

The interesting case will be when there is the equivalent of an
iommu for msi interrupts (and quite possibly it will be the iommu)
that filters iommu for hardware isolation purposes.

>> You don't have MSI-X support (which is the interesting case) and you
>> don't have suspend/resume support.
>
> I saw save/restore entry points.
> I expected suspend/resume code would use those.
> Do you agree (or not)?

Mostly for that bit I was relying on the documented part that said
they don't work yet.

>> You don't support the MSI mask bit.
>> 
>> Looking at your msi_ops it does not map to what I am doing on x86.  There
>> is the implicit assumption that the msi_message is fixed for the lifetime
>> of the msi.  Which is wrong.
>
> Erm...wouldn't changing the message also effectively change which handler
> ends up catching the interrupt?
> I always understood the addr/msg were a pair that HW would map to a handler.
> Can you explain what you mean by "lifetime" and "fixed"?
> What event would change the message? system Suspend/resume?

Suspend/resume and irq migration.  Currently the architecture code pushes
what it thinks best at the controller, instead of the pull model in
Micheal Ellerman's patch.

> ...
>> After I get some sleep I will see if I can up with some constructive
>> criticism on how we can make things work.
>
> Well, I hope the questions I pose above help lead the discussion in
> that direction.

We will see.  My current observation seems to be that problems that are
currently solved and the problems that Michael needed to solve to support
ppc are almost disjoint.  Making it a challenge to understand what the
other architecture is doing.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-26  7:15     ` Eric W. Biederman
@ 2007-01-26  7:48       ` Grant Grundler
  2007-01-26 15:26         ` Eric W. Biederman
  2007-01-26 21:58         ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 178+ messages in thread
From: Grant Grundler @ 2007-01-26  7:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller

On Fri, Jan 26, 2007 at 12:15:40AM -0700, Eric W. Biederman wrote:
> > Isn't the point of MSI to avoid any sort of interrupt controller?
> 
> That is how the supported platforms were designed.  But something needs
> to translate a pci message to a cpu interrupt and on PPC apparently
> they implemented this in their interrupt controller.
> 
> To be fair there are also ioapic on x86 which can do this as well
> they just haven't been sufficiently interesting to support.

Sorry, I'm not understanding your point...it's past my bedtime now
and maybe tomorrow it will make more sense.

I get the impression you are confusing Local-APIC with IO-APIC.
The former catches MSI's on behalf of the CPU and
the latter generates the equivalent of MSI's for IRQ lines.
Any CPU that can support MSI's has either a Local-APIC or it's equivalent.
(e.g. parisc or alpha)

> 
> The interesting case will be when there is the equivalent of an
> iommu for msi interrupts (and quite possibly it will be the iommu)
> that filters iommu for hardware isolation purposes.

IOMMU might give us better control over devices can write as long
as it's always used and not bypassed (e.g. ZX1 chipset).
Also, not all IOMMU can direct MSI transactions to a CPU.
I thought some IOMMU implementation can only deal with cacheline
sized transactions and I never had the impression MSI filled out
a full cacheline. Anyway, I expect the "Virtual Machine Monitor"
will own the IOMMU and expect that's the case in the IBM machines (RTAS
boxen).

> >> You don't have MSI-X support (which is the interesting case) and you
> >> don't have suspend/resume support.
> >
> > I saw save/restore entry points.
> > I expected suspend/resume code would use those.
> > Do you agree (or not)?
> 
> Mostly for that bit I was relying on the documented part that said
> they don't work yet.

ok.

> 
> >> You don't support the MSI mask bit.
> >> 
> >> Looking at your msi_ops it does not map to what I am doing on x86.  There
> >> is the implicit assumption that the msi_message is fixed for the lifetime
> >> of the msi.  Which is wrong.
> >
> > Erm...wouldn't changing the message also effectively change which handler
> > ends up catching the interrupt?
> > I always understood the addr/msg were a pair that HW would map to a handler.
> > Can you explain what you mean by "lifetime" and "fixed"?
> > What event would change the message? system Suspend/resume?
> 
> Suspend/resume and irq migration.

Hrm ok. IRQ migration shouldn't surprise anyone.
I expect the "virq" (linux IRQ #) would hide the values changing
in a Suspend/resume event. If the code isn't doing that for platforms
that support suspend/resume, then I agree it's broken.

> Currently the architecture code pushes
> what it thinks best at the controller, instead of the pull model in
> Micheal Ellerman's patch.

I need to look at the code more to get this. thanks for the observation.

thanks,
grant

> 
> > ...
> >> After I get some sleep I will see if I can up with some constructive
> >> criticism on how we can make things work.
> >
> > Well, I hope the questions I pose above help lead the discussion in
> > that direction.
> 
> We will see.  My current observation seems to be that problems that are
> currently solved and the problems that Michael needed to solve to support
> ppc are almost disjoint.  Making it a challenge to understand what the
> other architecture is doing.
> 
> Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26  7:02     ` Eric W. Biederman
@ 2007-01-26  8:47       ` Segher Boessenkool
  2007-01-26 16:32         ` Eric W. Biederman
  2007-01-26 20:50       ` Benjamin Herrenschmidt
  2007-01-26 22:46       ` Paul Mackerras
  2 siblings, 1 reply; 178+ messages in thread
From: Segher Boessenkool @ 2007-01-26  8:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S.Miller

>>> MPIC MSI backend. Based on code from Segher, heavily hacked by me.
>>> Renamed to mpic_htmsi, as it only deals with MSI over  
>>> Hypertransport.
>> ...
>>> +		/* FIXME should we save the existing type */
>>> +		set_irq_type(virq, IRQ_TYPE_EDGE_RISING);
>>
>> What exactly does the "virq" represent here?
>> I'd like to understand if the FIXME comment could be dropped (or  
>> not).

Now that we don't reuse existing PCI vectors on MPIC (or
do you still do that, Michael?) this FIXME can go.

>> I don't get the impression it's related to a PCI IRQ line.
>> Maybe irq_create_mapping() has comments that describe hwirq and virq?
>> If not, it would be useful if those terms were described.

The code you comment on lives in arch/powerpc/.  virq and
hwirq are used in there all over the place.  Have a look.

> I don't have a clue why it is called virq.  But looking at the
> usage it must be a linux irq number as shown in /proc/interrupts and
> as such there need be no connection with hardware.

Well of course it's connected to real hardware.  The virq
numbers are a flat space; hwirqs are not (those are relative
to one certain interrupt controller) so virqs are easier
in use.

> I believe the ppc model is to allocate an interrupt source on their
> existing interrupt controller and use that instead of the normal x86
> case of having the MSI interrupt go transparently to the cpu.

That's not the "PowerPC model".  On PowerPC, there is
really only one external interrupt to the CPU.  This
is usually connected to a "master interrupt controller",
in this case, the MPIC on the U3/U4 system controller.

This specific controller implements MSIs (just like
HT interrupts really) by mapping HT writes to certain
addresses to an IRQ input on the MPIC.  The only thing
this code does is set the sense/polarity for this IRQ
input.

You really don't need to know any of this if what you
care about is x86.  I really wonder how you can call
the x86 case "normal" or what you mean by "transparently"
btw ;-)


Segher

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-26  6:56   ` Grant Grundler
  2007-01-26  7:15     ` Eric W. Biederman
@ 2007-01-26  8:57     ` Segher Boessenkool
  2007-01-26 17:27       ` Grant Grundler
  2007-01-26 20:57     ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 178+ messages in thread
From: Segher Boessenkool @ 2007-01-26  8:57 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller, Eric W. Biederman

>> You code appears to be nice simple clean and to not support MSI in
>> a useful way.  I may be reading too quickly but at the moment your
>> infrastructure appears useless if you are on a platform that doesn't
>> enforce MSI's get filtered with a legacy interrupt controller.
>
> Hrm?
> Isn't the point of MSI to avoid any sort of interrupt controller?

No, the point of MSI is that it travels in the normal data
stream (and stays ordered with it).  In the end it *has* to
touch an interrupt controller (maybe the CPU-internal one).


Segher

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-25  8:34 ` [RFC/PATCH 14/16] MPIC MSI backend Michael Ellerman
  2007-01-26  6:43   ` Grant Grundler
@ 2007-01-26  9:11   ` Segher Boessenkool
  2007-01-27  6:33     ` Michael Ellerman
  1 sibling, 1 reply; 178+ messages in thread
From: Segher Boessenkool @ 2007-01-26  9:11 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller, Eric W. Biederman

> MPIC MSI backend. Based on code from Segher, heavily hacked by me.
> Renamed to mpic_htmsi, as it only deals with MSI over Hypertransport.

More exactly: it only deals with MSIs that are translated
by some whatever-to-HT bridge into a normal HT interrupt.

> We properly discover the HT magic address by reading the config space.

...config space of that bridge.

> Now we have an irq allocator we can support > 1 MSI, and we don't  
> reuse
> the LSI.

Right, that's what I asked about in the other thread, so
the FIXME in htmsi_alloc() can indeed go.

Why is MSI-X still unsupported?  Simply because of lack
of testing?  (See htmsi_check()).

Looks good,


Segher

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-26  7:48       ` Grant Grundler
@ 2007-01-26 15:26         ` Eric W. Biederman
  2007-01-26 21:58         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-26 15:26 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Grant Grundler <grundler@parisc-linux.org> writes:

> On Fri, Jan 26, 2007 at 12:15:40AM -0700, Eric W. Biederman wrote:
>> > Isn't the point of MSI to avoid any sort of interrupt controller?
>> 
>> That is how the supported platforms were designed.  But something needs
>> to translate a pci message to a cpu interrupt and on PPC apparently
>> they implemented this in their interrupt controller.
>> 
>> To be fair there are also ioapic on x86 which can do this as well
>> they just haven't been sufficiently interesting to support.
>
> Sorry, I'm not understanding your point...it's past my bedtime now
> and maybe tomorrow it will make more sense.
>
> I get the impression you are confusing Local-APIC with IO-APIC.
> The former catches MSI's on behalf of the CPU and
> the latter generates the equivalent of MSI's for IRQ lines.
> Any CPU that can support MSI's has either a Local-APIC or it's equivalent.
> (e.g. parisc or alpha)

Nope.  I'm talking about a feature that we don't use of some of Intels
IOAPICs.  All it requires is a range of addresses you can aim an
MSI at.

>> The interesting case will be when there is the equivalent of an
>> iommu for msi interrupts (and quite possibly it will be the iommu)
>> that filters iommu for hardware isolation purposes.
>
> IOMMU might give us better control over devices can write as long
> as it's always used and not bypassed (e.g. ZX1 chipset).
> Also, not all IOMMU can direct MSI transactions to a CPU.
> I thought some IOMMU implementation can only deal with cacheline
> sized transactions and I never had the impression MSI filled out
> a full cacheline. Anyway, I expect the "Virtual Machine Monitor"
> will own the IOMMU and expect that's the case in the IBM machines (RTAS
> boxen).

When you have a virtual machine monitor yes, but you might not always
have one.  The other use for that kind of hardware is isolating your
hardware devices so they can't do bad things to the rest of the system
no matter how buggy the hardware or the driver.

Anyway the point of the above was that there are things that makes sense
to exist between the device and the cpu.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26  8:47       ` Segher Boessenkool
@ 2007-01-26 16:32         ` Eric W. Biederman
  2007-01-26 17:19           ` Grant Grundler
  2007-01-26 22:08           ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-26 16:32 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S.Miller

Segher Boessenkool <segher@kernel.crashing.org> writes:

>> I don't have a clue why it is called virq.  But looking at the
>> usage it must be a linux irq number as shown in /proc/interrupts and
>> as such there need be no connection with hardware.
>
> Well of course it's connected to real hardware.  The virq
> numbers are a flat space; hwirqs are not (those are relative
> to one certain interrupt controller) so virqs are easier
> in use.

Any sane architecture will allocate the irq numbers this way.  However
they are the linux abstraction of the hardware and as such a useful
mapping to the hardware is not required.  ia64 is the strong culprit
in this regard, and simply picks the next free number it can use
when a device asks for an irq.

>> I believe the ppc model is to allocate an interrupt source on their
>> existing interrupt controller and use that instead of the normal x86
>> case of having the MSI interrupt go transparently to the cpu.
>
> That's not the "PowerPC model".  On PowerPC, there is
> really only one external interrupt to the CPU.  This
> is usually connected to a "master interrupt controller",
> in this case, the MPIC on the U3/U4 system controller.

Thanks that brings thing into a little more perspective.

> This specific controller implements MSIs (just like
> HT interrupts really) by mapping HT writes to certain
> addresses to an IRQ input on the MPIC.  The only thing
> this code does is set the sense/polarity for this IRQ
> input.
>
> You really don't need to know any of this if what you
> care about is x86.  I really wonder how you can call
> the x86 case "normal" or what you mean by "transparently"
> btw ;-)


The minimum silicon version of the destination of an MSI really only
needs the ability to record that it happened.  A prioiri setup of the
controller (in hardware) for each individual MSI source interrupt
seems to imply extra hardware logic, and limit the total number of
MSI's the system can handle for no apparent reason.  For that
reason I expect more systems to do things closer to how x86 does it.
If for no other reason than because it is less logic to validate.

On x86 the only hardware we have to deal with is the 8 bit number
delivered to the cpu at interrupt time and the MSI registers.  All of
the rest of the x86 logic needed to translate MSI interrupts to
processor bus messages and the like has no registers we can set and
always behaves exactly the same way so is for all intents and purposes
transparent.  The PCI-HT bridge logic for MSI is the most visible our
logic for MSI ever becomes.  As for the destination window it is an
architecturally defined target with fixed meanings for all of the bits
on every system.  So by transparent I mean that we don't have to
perform any per irq setup in the hardware except the pci card to make
MSI's work.

The big difference here between what you have and what x86 has
is that on x86 I can easily setup a pool of locations usable
by MSI allocate a location, and then independently associate
that with an MSI irq.  Apparently PPC cannot do that, although
from what little I have heard about the MPIC just now I don't 
understand why not.  Any clue where I can find a MPIC datasheet?

I care about more than x86 but x86 and derivatives is the platform
I primarily work with, have test hardware for, and understand all
of the details of.  To make an abstraction that works across all
platforms and to help maintain that I need to understand all of the
relevant details so I do care about ppc.  Especially when I have ppc
people I can work with.

Likewise what is different about x86 needs some explaining so it becomes
clear why msi_ops do not handle what x86 is doing today.  The big
difference there comes with irq migration because when we migrate an
irq we must reprogram the msi registers on the cards themselves,
likewise when we mask an irq we must mask it using the msi registers.

>From that comes our need for a data structure to map from an irq to a
msi data structure in a generic fashion, because we don't just program
the pci card and forget about it.  From those requirements comes our
need for a little bit more complete support of the features of the
hardware that Michaels implementation.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26 16:32         ` Eric W. Biederman
@ 2007-01-26 17:19           ` Grant Grundler
  2007-01-26 17:56             ` Eric W. Biederman
  2007-01-26 22:40             ` Benjamin Herrenschmidt
  2007-01-26 22:08           ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 178+ messages in thread
From: Grant Grundler @ 2007-01-26 17:19 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S.Miller

On Fri, Jan 26, 2007 at 09:32:33AM -0700, Eric W. Biederman wrote:
> > Well of course it's connected to real hardware.  The virq
> > numbers are a flat space; hwirqs are not (those are relative
> > to one certain interrupt controller) so virqs are easier
> > in use.
> 
> ....However they are the linux abstraction of the hardware and
> as such a useful mapping to the hardware is not required.

What?!!! The whole point of the abstraction ("flat space") is
to be able to do reverse lookups for additional information.

> ia64 is the strong culprit
> in this regard, and simply picks the next free number it can use
> when a device asks for an irq.

I think this is the only viable aproach to support MSI migration.
Basing the "virq" value on bits in the addr/data pair can't migrate.

...
> The minimum silicon version of the destination of an MSI really only
> needs the ability to record that it happened.

"it" == record the data value sent to a specific address.

If the IRQ handler lookup is done in HW it can save us a substantial number
of CPU cycles before we invoke the corresponding handler.

> A prioiri setup of the
> controller (in hardware) for each individual MSI source interrupt
> seems to imply extra hardware logic, and limit the total number of
> MSI's the system can handle for no apparent reason.  For that
> reason I expect more systems to do things closer to how x86 does it.
> If for no other reason than because it is less logic to validate.

It doesn't matter how many systems "do things closer to how x86"
works since 95% (or more) of the systems running linux are x86.
Linux MSI support must work on x86.

Helping Michael make it work would be a constructive way forward.
I think Michael has the abstraction correct so it's NOT x86 centric
but still works optimally on x86.

> On x86 the only hardware we have to deal with is the 8 bit number
> delivered to the cpu at interrupt time and the MSI registers.

8 bit number? That's the Intel Interrupt architecture definition.
The PCI spec defines 16-bit messages for MSI. The chipsets
can implement any number of bits they want up to that limits.

> All of
> the rest of the x86 logic needed to translate MSI interrupts to
> processor bus messages and the like has no registers we can set

Are the EID and ID fields defined in Intel adrresses not programmable?
Those are part of the MSI address.

> and
> always behaves exactly the same way so is for all intents and purposes
> transparent.  The PCI-HT bridge logic for MSI is the most visible our
> logic for MSI ever becomes.  As for the destination window it is an
> architecturally defined target with fixed meanings for all of the bits
> on every system.  So by transparent I mean that we don't have to
> perform any per irq setup in the hardware except the pci card to make
> MSI's work.

I had the impression "we" was the OS and the setup was being done by BIOS.
IIRC, main reason for doing setup in BIOS was to enable existing OS versions
to run new HW without any changes. Paying customers like that sort of thing.

thanks,
grant

> The big difference here between what you have and what x86 has
> is that on x86 I can easily setup a pool of locations usable
> by MSI allocate a location, and then independently associate
> that with an MSI irq.  Apparently PPC cannot do that, although
> from what little I have heard about the MPIC just now I don't 
> understand why not.  Any clue where I can find a MPIC datasheet?
> 
> I care about more than x86 but x86 and derivatives is the platform
> I primarily work with, have test hardware for, and understand all
> of the details of.  To make an abstraction that works across all
> platforms and to help maintain that I need to understand all of the
> relevant details so I do care about ppc.  Especially when I have ppc
> people I can work with.
> 
> Likewise what is different about x86 needs some explaining so it becomes
> clear why msi_ops do not handle what x86 is doing today.  The big
> difference there comes with irq migration because when we migrate an
> irq we must reprogram the msi registers on the cards themselves,
> likewise when we mask an irq we must mask it using the msi registers.
> 
> >From that comes our need for a data structure to map from an irq to a
> msi data structure in a generic fashion, because we don't just program
> the pci card and forget about it.  From those requirements comes our
> need for a little bit more complete support of the features of the
> hardware that Michaels implementation.
> 
> Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-26  8:57     ` Segher Boessenkool
@ 2007-01-26 17:27       ` Grant Grundler
  0 siblings, 0 replies; 178+ messages in thread
From: Grant Grundler @ 2007-01-26 17:27 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller,
	Eric W. Biederman

On Fri, Jan 26, 2007 at 09:57:48AM +0100, Segher Boessenkool wrote:
> >>You code appears to be nice simple clean and to not support MSI in
> >>a useful way.  I may be reading too quickly but at the moment your
> >>infrastructure appears useless if you are on a platform that doesn't
> >>enforce MSI's get filtered with a legacy interrupt controller.
> >
> >Hrm?
> >Isn't the point of MSI to avoid any sort of interrupt controller?
> 
> No, the point of MSI is that it travels in the normal data
> stream (and stays ordered with it).  In the end it *has* to
> touch an interrupt controller (maybe the CPU-internal one).

Yes, sorry. I thinking about the properties of a "legacy interrupt
controller" and didn't intend that to mean the physical device:
    out-of-band delivery
    each IRQ line routed to one CPU on the motherboard,
    one IRQ line per device.

thanks,
grant

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26 17:19           ` Grant Grundler
@ 2007-01-26 17:56             ` Eric W. Biederman
  2007-01-26 22:48               ` Benjamin Herrenschmidt
  2007-01-27  7:01               ` Michael Ellerman
  2007-01-26 22:40             ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-26 17:56 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S.Miller, Eric W. Biederman

Grant Grundler <grundler@parisc-linux.org> writes:

> On Fri, Jan 26, 2007 at 09:32:33AM -0700, Eric W. Biederman wrote:
>> > Well of course it's connected to real hardware.  The virq
>> > numbers are a flat space; hwirqs are not (those are relative
>> > to one certain interrupt controller) so virqs are easier
>> > in use.
>> 
>> ....However they are the linux abstraction of the hardware and
>> as such a useful mapping to the hardware is not required.
>
> What?!!! The whole point of the abstraction ("flat space") is
> to be able to do reverse lookups for additional information.

Yes.  But that doesn't mean the number has any useful meaning
in and of itself.  Just that you can index into a table and
get that meaning.  (i.e. If you are a human being or anything
outside of the kernel you may not be able to do a reverse lookup
because you don't have the table.)

>> ia64 is the strong culprit
>> in this regard, and simply picks the next free number it can use
>> when a device asks for an irq.
>
> I think this is the only viable aproach to support MSI migration.
> Basing the "virq" value on bits in the addr/data pair can't migrate.

Thus my initial surprise at people not liking create_irq().

If the irq controller the msi arrives at can redirect the irq the
bits in the msi message could have some connection to the irq number.
Likewise if some of those bits have nothing to do with migration.

For irqs going across traces on a motherboard and into interrupt pins
you can embed a lot of that knowledge into the irq number.  For MSI
with arbitrary programmable connections the numbers have less meaning
and less need of meaning in that sense.

> ...
>> The minimum silicon version of the destination of an MSI really only
>> needs the ability to record that it happened.
>
> "it" == record the data value sent to a specific address.

The data value the address something.  You don't have to reply msi's
are edge triggered and non-acknowledged so you just need to record
enough for the software to figure it out.

> If the IRQ handler lookup is done in HW it can save us a substantial number
> of CPU cycles before we invoke the corresponding handler.

Maybe.  I would love to see a useful implementation of that.

>> A prioiri setup of the
>> controller (in hardware) for each individual MSI source interrupt
>> seems to imply extra hardware logic, and limit the total number of
>> MSI's the system can handle for no apparent reason.  For that
>> reason I expect more systems to do things closer to how x86 does it.
>> If for no other reason than because it is less logic to validate.
>
> It doesn't matter how many systems "do things closer to how x86"
> works since 95% (or more) of the systems running linux are x86.
> Linux MSI support must work on x86.
>
> Helping Michael make it work would be a constructive way forward.
> I think Michael has the abstraction correct so it's NOT x86 centric
> but still works optimally on x86.

NO NO NO NO Michaels abstraction does not work on x86.
Which is a big part of the my problem.
Michaels abstraction does not allow me to migrate irqs on x86.

setup_msi_msg only gets called when you enable the msi.  Nothing
gets called when irqbalaced changes the cpu mask, and there is no
support that would allow that with Michael's msi ops.

I can't use Michaels msi_ops as they stand.

They also have the problem of trying to exist at two different levels
of the interrupt hierarchy setup hierarchy simultaneously which is
another part of the problem.

Micaheal's code is simple beautiful and doesn't work on x86, because
he has not implemented what needs to be there.

That is why I have asked for an evolutionary approach and not this
stupid drop and replace attempt.

Sorry for the rant I'm just a little annoyed that you hadn't hurd that
what Micahel is doing does not work on x86.

>> On x86 the only hardware we have to deal with is the 8 bit number
>> delivered to the cpu at interrupt time and the MSI registers.
>
> 8 bit number? That's the Intel Interrupt architecture definition.
> The PCI spec defines 16-bit messages for MSI. The chipsets
> can implement any number of bits they want up to that limits.

I said on x86.  The cpu receives a 8 bit number.

>> All of
>> the rest of the x86 logic needed to translate MSI interrupts to
>> processor bus messages and the like has no registers we can set
>
> Are the EID and ID fields defined in Intel adrresses not programmable?
> Those are part of the MSI address.

All msi address on x86 by definition are of the form 0xffe????? if I
have remembered the address correctly.  ia64 doesn't have that rule.

>> and
>> always behaves exactly the same way so is for all intents and purposes
>> transparent.  The PCI-HT bridge logic for MSI is the most visible our
>> logic for MSI ever becomes.  As for the destination window it is an
>> architecturally defined target with fixed meanings for all of the bits
>> on every system.  So by transparent I mean that we don't have to
>> perform any per irq setup in the hardware except the pci card to make
>> MSI's work.
>
> I had the impression "we" was the OS and the setup was being done by BIOS.
> IIRC, main reason for doing setup in BIOS was to enable existing OS versions
> to run new HW without any changes. Paying customers like that sort of thing.

There is an architectural definition of how irq work on x86.  The BIOS
sets up the hardware to match that definition if there are any registers
to setup.  Things like the PCI-HT bridge registers.  There are no
registers that need to be setup on a per msi basis.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26  6:43   ` Grant Grundler
  2007-01-26  7:02     ` Eric W. Biederman
@ 2007-01-26 20:41     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-26 20:41 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Eric W.Biederman, shaohua.li, linux-pci, David S.Miller,
	Brice Goglin

On Thu, 2007-01-25 at 23:43 -0700, Grant Grundler wrote:
> On Thu, Jan 25, 2007 at 07:34:16PM +1100, Michael Ellerman wrote:
> > MPIC MSI backend. Based on code from Segher, heavily hacked by me.
> > Renamed to mpic_htmsi, as it only deals with MSI over Hypertransport.
> ...
> > +		/* FIXME should we save the existing type */
> > +		set_irq_type(virq, IRQ_TYPE_EDGE_RISING);
> 
> What exactly does the "virq" represent here?
> I'd like to understand if the FIXME comment could be dropped (or not).
> 
> I don't get the impression it's related to a PCI IRQ line.
> Maybe irq_create_mapping() has comments that describe hwirq and virq?
> If not, it would be useful if those terms were described.

Well, this is a ppc specific backend, so I'm not sure we need to
describe in there the way ppc interrupts work but heh ;-) Basically, on
powerpc nowadays, we disconnect "linux" irqs (virtual irqs) and
"hardware" irq numbers.

linux irqs are allocated dynamically and bound to a given PIC/hw irq
pair via irq_create_mapping() or one of the other superset of that
function.

In the case of something like the MPIC MSI backend, we first allocate a
HW vector (we have a bitmap of free vectors), then we map it to a new
virq with irq_create_mapping(). I think the FIXME is not needed.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26  7:02     ` Eric W. Biederman
  2007-01-26  8:47       ` Segher Boessenkool
@ 2007-01-26 20:50       ` Benjamin Herrenschmidt
  2007-01-26 22:46       ` Paul Mackerras
  2 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-26 20:50 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S.Miller

On Fri, 2007-01-26 at 00:02 -0700, Eric W. Biederman wrote:

> I don't have a clue why it is called virq.  But looking at the
> usage it must be a linux irq number as shown in /proc/interrupts and
> as such there need be no connection with hardware.

Indeed.

> I believe the ppc model is to allocate an interrupt source on their
> existing interrupt controller and use that instead of the normal x86
> case of having the MSI interrupt go transparently to the cpu.

Not quite, see my other reply.

> Both set_irq_type, and entries.vector take a linux irq number.
> Darn we should change that field name, it is misleading.

Yes ;-)

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-26  6:56   ` Grant Grundler
  2007-01-26  7:15     ` Eric W. Biederman
  2007-01-26  8:57     ` Segher Boessenkool
@ 2007-01-26 20:57     ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-26 20:57 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller, Eric W. Biederman

On Thu, 2007-01-25 at 23:56 -0700, Grant Grundler wrote:
> On Thu, Jan 25, 2007 at 11:18:20PM -0700, Eric W. Biederman wrote:
> > You code appears to be nice simple clean and to not support MSI in
> > a useful way.  I may be reading too quickly but at the moment your
> > infrastructure appears useless if you are on a platform that doesn't
> > enforce MSI's get filtered with a legacy interrupt controller.
> 
> Hrm?
> Isn't the point of MSI to avoid any sort of interrupt controller?

Depends how it's implemented :-) In cases like powerpc where the
processor only has a single interrupt line, you need an interrupt
controller anyway.

> > You don't have MSI-X support (which is the interesting case) and you
> > don't have suspend/resume support.
> 
> I saw save/restore entry points.
> I expected suspend/resume code would use those.
> Do you agree (or not)?

MSI-X should be coming soon, I actually though Michael had that sorted
out already, I'll check with him.

Suspend resume should just hook to the backend.

> > You don't support the MSI mask bit.
> > 
> > Looking at your msi_ops it does not map to what I am doing on x86.  There
> > is the implicit assumption that the msi_message is fixed for the lifetime
> > of the msi.  Which is wrong.

The mask bit is not necessary when hooking onto an existing PIC,
however, we should provide a set of mask/unmask for use by the backend
using the mask bit, I agree there.

> Erm...wouldn't changing the message also effectively change which handler
> ends up catching the interrupt?
> I always understood the addr/msg were a pair that HW would map to a handler.
> Can you explain what you mean by "lifetime" and "fixed"?
> What event would change the message? system Suspend/resume?

Intel, in it's great wisdom, defined some bits of the message to have a
special meaning (in addition to the message address being used as a cpu
mask of destinations). I suppose there are cases where they want to
change things in there, not too sure though. I don't see anything that
can't be done by the backend though.

> ...
> > After I get some sleep I will see if I can up with some constructive
> > criticism on how we can make things work.
> 
> Well, I hope the questions I pose above help lead the discussion in
> that direction.

Well, our basic model of alloc/enable/disable/free should map pretty
much anything, and then we provide "raw" functions for the backend to
use rather than re-implement them.

If your backend needs something not provided by those (like mask/unmask
using the mask bit(s)), then either add it to the backend itself or add
new "generic" functions for optional use by the backend.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-26  6:18 ` Eric W. Biederman
  2007-01-26  6:56   ` Grant Grundler
@ 2007-01-26 21:24   ` Benjamin Herrenschmidt
  2007-01-27  5:41   ` Michael Ellerman
  2 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-26 21:24 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller


> I haven't done more than skim the patches yet but I am distressed.
> 
> You code appears to be nice simple clean and to not support MSI in
> a useful way.  I may be reading too quickly but at the moment your infrastructure
> appears useless if you are on a platform that doesn't enforce MSI's get filtered
> with a legacy interrupt controller.

How so ?

We have at least 3 models we had in mind when designing this:

 - the MPIC backend : MSIs are handled by a decoder that turns them into
vector inputs on the MPIC interrupt controller

 - the Cell "AXON" backend (not finished yet) : MSI messages are written
into a ring buffer in main memory (DMAed) and an IRQ to the interrupt
controller is additionally sent when such a message is written

 - the RTAS model which is a toplevel hook and thus easy

In addition, I had in mind what x86 does and I don't see how it woudn't
fit the model.

What I see it working in a way around those lines, let me know if I
missed something:

 - alloc : allocates a vector and a linux irq, sets up the irq desc
with an irq chip mostly equivalent to what you have now, using
mask/unmask routines that toggle the mask bit (note that we should have
those in the generic code for optional use by backends) 

 - enable / disable : use the generic routines

 - setup_msi_msg : returns the appropriate address/data with all the
bits specific to the intel platform and the appropriate default
affinity.

 - free : well, should i explain ? :-)

I don't see off-hand what in that model doesn't "fit" the x86 needs.
Nowhere there is a requirement of being hooked to a separate irq
controller. One of the important ideas is that alloc() is supposed to
return a linux irq number with a fully initialized irq_desc/irq_chip.
However, that irq_chip doesn't -have- to be the one of a legacy
interrupt controller, it could be one local to the backend specific for
MSIs which uses the mask bit for mask/unmask etc... the way x86 does it.

In a way, this is similar to what we will be doing for Cell (backend not
there yet, sorry).

> You don't have MSI-X support (which is the interesting case) and you don't have
> suspend/resume support.

MSI-X is the main issue right now indeed. For suspend/resume, I was
thinking about just adding a pair of hooks to the ops, but it looks like
Michael didn't have time to add them just yet.

Those are the reasons why aren't trying to -replace- the existing code
just yet :-) And why Michael's initial implementation sat in
arch/powerpc and not in a generic place, as we felt that while it fit
our immediate needs, it wasn't quite ready to take over the world yet.

(immediate needs = pci express support, there are already devices that
don't support anything but MSI, so without at least that standard MSI
support , those devices simply don't work at all... that's also why we
have some sense of "urgency" in getting that up as without it, those
devices won't work on any PCIe powerpc machine).

> You don't support the MSI mask bit.

At first I though that could stay in the backend, but it looks like I'll
use that in the cell backend too, so we can just made a pair of generic
mask/unmask (well, two actually, one for MSI and one for MSI-X) for use
by irq_chip's in backends that use the mask bit. Should be trivial
enough.

> Looking at your msi_ops it does not map to what I am doing on x86.  There
> is the implicit assumption that the msi_message is fixed for the lifetime
> of the msi.  Which is wrong.

That's some aspect I've missed of the x86 code... in which circumstances
do you modify the message ? I know you modify the address for affinity
setting, but the message I'm not sure what for.

For affinity, you can have your set_affinity() just call back
msi_raw_enable(), or if you want, we can export an msi_raw_update...

> So in short summary I cannot use your msi_ops they are inappropriate for
> i386, x86_64 and ia64.

I think they aren't -that- inappropriate as I mostly explained above,
but let me know if you think we missed something important.

> So at the moment I am opposed to this code because as it sits it appears to
> be a serious regression.

Only if it was to replace the intel code as of today, as it does indeed
lacks some functionalities that we haven't completed yet. I don't think
the overall design is though and I do think it's saner especially when
having to deal with 3 or 4 different backends in the same kernel as we
do on powerpc (and I'm sure other archs will have similar need,
especially in the embedded field where all sort of crazy things happen).

> The additional bits that feel like this code was primarily targeted at supporting
> the RTAS with real hardware support thrown in as an after thought just seem
> to add insult to injury. 

That is both unfair and untrue. It was mostly designed around some
initial MPIC backend and with the 3 cases I described above in mind
(along with whatever I understood of the x86 case back then). We then
tweaked things a bit to make the RTAS backend "fit", mostly by defining
that enable/disable/setup can be optional, in which case alloc/free are
doing all the job (and thus become top-level hooks suitable for RTAS).

> Supporting the RTAS first and breaking everyone who actually has real
> hardware seems like very much the wrong approach to get a good
> multiple platform solution.

We have tested the MPIC (non-RTAS) model more than we have the RTAS so
this is unfair as well.
 
> After I get some sleep I will see if I can up with some constructive
> criticism on how we can make things work.

That would be great :-)

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-26  7:48       ` Grant Grundler
  2007-01-26 15:26         ` Eric W. Biederman
@ 2007-01-26 21:58         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-26 21:58 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller, Eric W. Biederman

> Hrm ok. IRQ migration shouldn't surprise anyone.
> I expect the "virq" (linux IRQ #) would hide the values changing
> in a Suspend/resume event. If the code isn't doing that for platforms
> that support suspend/resume, then I agree it's broken.

All if this is entirely backend business at this point. Michael's latest
code drop is apparently still missing suspend/resume hooks though, we
need to fix that. What those hooks do is backend specific. For example,
on MPIC machines, the vector will stay the same and it's really only a
matter of re-programming the HW with the same values. On RTAS (if we
ever implement suspend-to-something on pSeries), we would probably get
different vectors and thus have to update the virq->vector mapping.

Since virtual IRQs are something implemented by each platform
differently, none of that can be put in the generic code anyway.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26 16:32         ` Eric W. Biederman
  2007-01-26 17:19           ` Grant Grundler
@ 2007-01-26 22:08           ` Benjamin Herrenschmidt
  2007-01-27  6:54             ` Michael Ellerman
  1 sibling, 1 reply; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-26 22:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S.Miller


> The big difference here between what you have and what x86 has
> is that on x86 I can easily setup a pool of locations usable
> by MSI allocate a location, and then independently associate
> that with an MSI irq.  Apparently PPC cannot do that, although
> from what little I have heard about the MPIC just now I don't 
> understand why not.  Any clue where I can find a MPIC datasheet?

The MPIC MSI backend is specific to a given MPIC implementation (the
Apple one in U4 which IBM also uses for some machines). It's not a
generic OpenPIC/MPIC feature.

The MPIC has a certain amount of sources (the Apple one typically about
124). That's the MPIC cell in the chip. Now, those sources (wires) can
be connected internally to either physical lines or other internal
devies inside that chip, that is about 7 of them. The rest is hooked
(well, I don't know the HW details, but from a software perspective,
that's how it looks) to a pair of decoders that decode HT irq messages
and MSIs, and use the number in the message to toggle that source on the
MPIC. 

Thus we basically can allocate for an MSI any vector that isn't already
used by somebody else (an HT interrupt or an internal physical line).

> I care about more than x86 but x86 and derivatives is the platform
> I primarily work with, have test hardware for, and understand all
> of the details of.  To make an abstraction that works across all
> platforms and to help maintain that I need to understand all of the
> relevant details so I do care about ppc.  Especially when I have ppc
> people I can work with.

Well, MSI is only -one- of the possible backends on powerpc. RTAS is
another. I briefly described the one we have in the Axon bridge for cell
which DMAs messages to memory and toggle an IRQ when a new message
arrives (funnily enough, that IRQ itself is routed through an MPIC :-)
But in this case, we implement it as a cascaded controller).

Other examples are Toshiba Spider MSIs which I'm not too sure how they
work off the top of my mind (we might implement them, or not ...
depends) but I think they boil down to 16 lines into the Spider PIC
interrupt controller (thus similar to MPIC). Then, various embedded
processors are now showing up with PCIe support, and thus I expect MSIs,
and while we don't support MSIs on them just yet, I can already tell you
that every single of them will do things differently :-)

Due to the fact that the PowerPC however has 1 interrupt exception in
the programming model, there is always a toplevel IRQ controller, and
thus MSIs will always be routed to that in a way or another.

> Likewise what is different about x86 needs some explaining so it becomes
> clear why msi_ops do not handle what x86 is doing today.  The big
> difference there comes with irq migration because when we migrate an
> irq we must reprogram the msi registers on the cards themselves,
> likewise when we mask an irq we must mask it using the msi registers.

For these, I think the best is to have the backend use the raw_*
functions directly, we just need to add the missing ones (raw_msi_mask,
raw_msix_mask, raw_msi_update, etc...)

> >From that comes our need for a data structure to map from an irq to a
> msi data structure in a generic fashion, because we don't just program
> the pci card and forget about it.  From those requirements comes our
> need for a little bit more complete support of the features of the
> hardware that Michaels implementation.

Well, we need to go from pci_dev -> MSI and from linux irq_desc ->
MSI... for now, we can get away without the later on powerpc, but I
understand that intel needs that. The initial intel code used an array
of NR_IRQs, but that sucks. What I remember of your patch is that they
were using chip_data in irq_desc.

The problem with using chip_data is that it will conflict with our case
where MSIs are hooked to an existing irq controller that already uses
chip_data.

Note that this is a non-issue if the usage of chip_data is kept local to
the backend. However, if we want to push that irq_desc -> msi info to
the generic code, them we'll need to find a different way, possibly
adding a pointer to irq_desc..

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26 17:19           ` Grant Grundler
  2007-01-26 17:56             ` Eric W. Biederman
@ 2007-01-26 22:40             ` Benjamin Herrenschmidt
  2007-01-27  2:11               ` David Miller
  1 sibling, 1 reply; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-26 22:40 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S.Miller, Eric W. Biederman


> What?!!! The whole point of the abstraction ("flat space") is
> to be able to do reverse lookups for additional information.

You may want to look at the virtual irq scheme we implemented for
powerpc, I think it could be useful for other architectures as well in
fact... One mistake I did was to put the documentation in the .h instead
of near the code though :-) asm-powerpc/irq.h is a good start to read.

The main reasons we did it in the first place are two fold:

 - On pSeries and to some extent with other hypervisors, IRQ numbers can
be pretty big, from encoding the geographical informations about the
slot/irq to just being an opaque 64 bits "token" from the hypervisor. So
we need the ability to map that to/from linux smaller and flatter space.

 - On a lot of machines, especially embedded (but not limited to), we
have all sort of crazy setups of cascaded controllers on cascaded
controllers. Maintaining a flat irq model covering all cases is
basically hopeless. So our remapper is designed such that each irq
"host" (or domain) defines it's own HW irq space and linux irqs can be
dynamically assigned to a pair host/hw_number.

The core provides the direct mapping linux irq (or virq) - > host/hw via
a simple array. It also provides 4 different types of reverse mapping
that the controller code can choose from for each controller:

 - Legacy: Since we decided to avoid problems that linux irq 0 is always
illegal and 1...15 area always "reserved" for a 8259 if any is present
in the machine, that's the option that the 8259 uses :-) It provides a
direct 1:1 mapping of 1...15 (enables them for use basically).

 - No reverse mapping: Some hypervisors are nice enough to let you
provide your virq numbers and they return them to you, so you can ask
for nothing

 - Linear reverse maping: for use by things like mpic where a simple
table is good enough

 - Radix tree reverse mapping: for things like pSeries with a very large
HW number space.

> > ia64 is the strong culprit
> > in this regard, and simply picks the next free number it can use
> > when a device asks for an irq.
> 
> I think this is the only viable aproach to support MSI migration.
> Basing the "virq" value on bits in the addr/data pair can't migrate.

Yes. On PowerPC, the virq will stay the same, though we can change
everything underneath (HW number, addr/data pair, etc...).

> It doesn't matter how many systems "do things closer to how x86"
> works since 95% (or more) of the systems running linux are x86.
> Linux MSI support must work on x86.

Most certainly :-)

> Helping Michael make it work would be a constructive way forward.
> I think Michael has the abstraction correct so it's NOT x86 centric
> but still works optimally on x86.

I think too.

> > On x86 the only hardware we have to deal with is the 8 bit number
> > delivered to the cpu at interrupt time and the MSI registers.
> 
> 8 bit number? That's the Intel Interrupt architecture definition.
> The PCI spec defines 16-bit messages for MSI. The chipsets
> can implement any number of bits they want up to that limits.

Indeed and we have MSI controllers that can deal with the full 16 bits
(the Cell Axon one for example).

> > All of
> > the rest of the x86 logic needed to translate MSI interrupts to
> > processor bus messages and the like has no registers we can set
> 
> Are the EID and ID fields defined in Intel adrresses not programmable?
> Those are part of the MSI address.

And thus the logic for doing that is platform specific and in the
backend with Michael's code, I don't see where the problem is there. I
agree Michael's code is missing a few things, mostly helpers for use by
the backend for masking/unmasking via config space and "updating" the
message/address, mostly things to add to the "raw" helpers. Oh, and
MSI-X of course need to be finished.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26  7:02     ` Eric W. Biederman
  2007-01-26  8:47       ` Segher Boessenkool
  2007-01-26 20:50       ` Benjamin Herrenschmidt
@ 2007-01-26 22:46       ` Paul Mackerras
  2007-01-27  2:46         ` Eric W. Biederman
  2007-01-27 18:30         ` Grant Grundler
  2 siblings, 2 replies; 178+ messages in thread
From: Paul Mackerras @ 2007-01-26 22:46 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S.Miller

Eric W. Biederman writes:

> I believe the ppc model is to allocate an interrupt source on their
> existing interrupt controller and use that instead of the normal x86
> case of having the MSI interrupt go transparently to the cpu.

Do you mean that x86 cpus themselves can actually be the target of a
write on the bus?  That's the first time I've heard of the CPU itself
being a target for a bus operation.

Or do you mean there is some piece of hardware in the northbridge (or
elsewhere) that accepts the MSI message writes and asserts an
interrupt line to the CPU?  That is basically what we have on PPC.

Paul.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26 17:56             ` Eric W. Biederman
@ 2007-01-26 22:48               ` Benjamin Herrenschmidt
  2007-01-27  7:01               ` Michael Ellerman
  1 sibling, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-26 22:48 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S.Miller


> > I think this is the only viable aproach to support MSI migration.
> > Basing the "virq" value on bits in the addr/data pair can't migrate.
> 
> Thus my initial surprise at people not liking create_irq().

The main reason _I_ don't like it is because I think I have something
better already :-) And thus I'm annoyed if something else starts
becoming a "generic" API. But I agree on the principle.

I'd be happy to help having the remapping core of arch/powerpc become
generic code though :-)

> If the irq controller the msi arrives at can redirect the irq the
> bits in the msi message could have some connection to the irq number.
> Likewise if some of those bits have nothing to do with migration.
> 
> For irqs going across traces on a motherboard and into interrupt pins
> you can embed a lot of that knowledge into the irq number.  For MSI
> with arbitrary programmable connections the numbers have less meaning
> and less need of meaning in that sense.

Sure, but that is totally local to the backend anyway, and thus pretty
much irrelevant to whether Michael model fits or not as it's totally
agnostic to what your backend choses to put in, or what HW vectors it
uses underneath. That's why I defined alloc() as returning a linux IRQ
number with a pre-initialized irq_desc/irq_chip. That's how the backend
does it's arch specific salad of allocating a linux irq, possibly a
vector too, and picking up the appropriate irq_chip for MSIs.

I don't see how x86 wouldn't fit nicely in that model.

> > Helping Michael make it work would be a constructive way forward.
> > I think Michael has the abstraction correct so it's NOT x86 centric
> > but still works optimally on x86.
> 
> NO NO NO NO Michaels abstraction does not work on x86.
> Which is a big part of the my problem.
> Michaels abstraction does not allow me to migrate irqs on x86.

How so ? It's certainly missing a raw_msi_update() to allow you to
change the addr/data but appart from that, what is the problem ?

> setup_msi_msg only gets called when you enable the msi.  Nothing
> gets called when irqbalaced changes the cpu mask, and there is no
> support that would allow that with Michael's msi ops.

irq_chip->set_affinity() which, along the rest of irq_chip callbacks, is
setup by your backend at alloc() time, and can do what it wants. There
is absolutely no point in doing it differently as the migration mecanism
is totally implementation dependant.

As I said, there is no design problem with the ops, only an small
implementation issue in that it lacks a raw_msi_udpate() to let you
udpate the addr/data from within your set_affinity() callback.

> I can't use Michaels msi_ops as they stand.

You can use the ops, you just need a few more helpers that aren't there
yet because we haven't needed them yet on powerpc.

> They also have the problem of trying to exist at two different levels
> of the interrupt hierarchy setup hierarchy simultaneously which is
> another part of the problem.

I don't understand the above.

> Micaheal's code is simple beautiful and doesn't work on x86, because
> he has not implemented what needs to be there.

We certainly haven't implemented everything that is needed for x86, that
is true, and that is why we aren't aiming at replacing x86 code just
yet, but again, I don't see what in the -model- prevents that and what
prevents x86 from fitting nicely in the model.
 
> That is why I have asked for an evolutionary approach and not this
> stupid drop and replace attempt.

Because while I think our model can fit x86 nicely, the current code
doesn't fit out needs at all and I strongly beleive it's the wrong
abstraction.

> Sorry for the rant I'm just a little annoyed that you hadn't hurd that
> what Micahel is doing does not work on x86.

It does not work out of the box, but you haven't yet convinced me that
there is anything fundamental in Michael's design that prevents it from
working.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26 22:40             ` Benjamin Herrenschmidt
@ 2007-01-27  2:11               ` David Miller
  0 siblings, 0 replies; 178+ messages in thread
From: David Miller @ 2007-01-27  2:11 UTC (permalink / raw)
  To: benh
  Cc: grundler, greg, kyle, linuxppc-dev, brice, shaohua.li, linux-pci,
	ebiederm

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Sat, 27 Jan 2007 09:40:57 +1100

> 
> > What?!!! The whole point of the abstraction ("flat space") is
> > to be able to do reverse lookups for additional information.
> 
> You may want to look at the virtual irq scheme we implemented for
> powerpc, I think it could be useful for other architectures as well in
> fact... One mistake I did was to put the documentation in the .h instead
> of near the code though :-) asm-powerpc/irq.h is a good start to read.
> 
> The main reasons we did it in the first place are two fold:
> 
>  - On pSeries and to some extent with other hypervisors, IRQ numbers can
> be pretty big, from encoding the geographical informations about the
> slot/irq to just being an opaque 64 bits "token" from the hypervisor. So
> we need the ability to map that to/from linux smaller and flatter space.

We do the same thing on sparc64 btw, for the same reasons.  The
hypervisor on Niagara specifies IRQ numbers as opaque 64-bit
quantities.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26 22:46       ` Paul Mackerras
@ 2007-01-27  2:46         ` Eric W. Biederman
  2007-01-27  3:02           ` David Miller
  2007-01-27 18:30         ` Grant Grundler
  1 sibling, 1 reply; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-27  2:46 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S.Miller

Paul Mackerras <paulus@samba.org> writes:

> Eric W. Biederman writes:
>
>> I believe the ppc model is to allocate an interrupt source on their
>> existing interrupt controller and use that instead of the normal x86
>> case of having the MSI interrupt go transparently to the cpu.
>
> Do you mean that x86 cpus themselves can actually be the target of a
> write on the bus?  That's the first time I've heard of the CPU itself
> being a target for a bus operation.

Yes.  The cpu front side bus is packet based on all modern x86 processors,
and an irq message is one type of packet. 

> Or do you mean there is some piece of hardware in the northbridge (or
> elsewhere) that accepts the MSI message writes and asserts an
> interrupt line to the CPU?  That is basically what we have on PPC.

Nope, modern x86 cpus do not use external interrupt lines for normal
interrupts.

AMD cpus directly consume hypertransport and Intel cpus have a
proprietary but similarly capable protocol.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-27  2:46         ` Eric W. Biederman
@ 2007-01-27  3:02           ` David Miller
  2007-01-27  4:28             ` Eric W. Biederman
  0 siblings, 1 reply; 178+ messages in thread
From: David Miller @ 2007-01-27  3:02 UTC (permalink / raw)
  To: ebiederm
  Cc: grundler, greg, kyle, linuxppc-dev, paulus, brice, shaohua.li, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Fri, 26 Jan 2007 19:46:22 -0700

> Paul Mackerras <paulus@samba.org> writes:
> 
> > Eric W. Biederman writes:
> >
> >> I believe the ppc model is to allocate an interrupt source on their
> >> existing interrupt controller and use that instead of the normal x86
> >> case of having the MSI interrupt go transparently to the cpu.
> >
> > Do you mean that x86 cpus themselves can actually be the target of a
> > write on the bus?  That's the first time I've heard of the CPU itself
> > being a target for a bus operation.
> 
> Yes.  The cpu front side bus is packet based on all modern x86 processors,
> and an irq message is one type of packet. 

Interesting.

This is exactly how all sparc64 chips have always worked too.  On
sparc64 the cpu can actually read in the packets and process them.
Can the x86 interrupt handler get at the full packet data?

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-27  3:02           ` David Miller
@ 2007-01-27  4:28             ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-27  4:28 UTC (permalink / raw)
  To: David Miller
  Cc: grundler, greg, kyle, linuxppc-dev, paulus, brice, shaohua.li, linux-pci

David Miller <davem@davemloft.net> writes:

>
> Interesting.
>
> This is exactly how all sparc64 chips have always worked too.  On
> sparc64 the cpu can actually read in the packets and process them.
> Can the x86 interrupt handler get at the full packet data?

I wish.  All it can get is a single byte of the packet, for selecting
what to do.  The rest of the information encodes irq type and which cpu
to send interrupt to.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
                   ` (17 preceding siblings ...)
  2007-01-26  6:18 ` Eric W. Biederman
@ 2007-01-27  4:59 ` Michael Ellerman
  2007-01-28 19:40   ` Eric W. Biederman
  19 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-27  4:59 UTC (permalink / raw)
  To: linux-pci
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, David S.Miller, Eric W.Biederman

[-- Attachment #1: Type: text/plain, Size: 648 bytes --]

On Thu, 2007-01-25 at 19:34 +1100, Michael Ellerman wrote:
> OK, here's a first cut at moving ops based MSI into the generic code. I'm
> posting this now to make sure I'm not heading off into the weeds.

Perhaps I wasn't clear enough here, by "first cut" I mean "work in
progress", "not finished yet", "preliminary", "doesn't work yet" .. etc.
It even says "RFC" in the subject line!

:)

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-26  6:18 ` Eric W. Biederman
  2007-01-26  6:56   ` Grant Grundler
  2007-01-26 21:24   ` Benjamin Herrenschmidt
@ 2007-01-27  5:41   ` Michael Ellerman
  2007-01-28  6:16     ` Eric W. Biederman
  2 siblings, 1 reply; 178+ messages in thread
From: Michael Ellerman @ 2007-01-27  5:41 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

[-- Attachment #1: Type: text/plain, Size: 3153 bytes --]

On Thu, 2007-01-25 at 23:18 -0700, Eric W. Biederman wrote:
> Michael Ellerman <michael@ellerman.id.au> writes:
> 
> > OK, here's a first cut at moving ops based MSI into the generic code. I'm
> > posting this now to make sure I'm not heading off into the weeds.
> 
> First thanks for copying me on this.  I really appreciate it.

No worries, thanks for looking at it.

> I haven't done more than skim the patches yet but I am distressed.
> 
> You code appears to be nice simple clean and to not support MSI in
> a useful way.  I may be reading too quickly but at the moment your infrastructure
> appears useless if you are on a platform that doesn't enforce MSI's get filtered
> with a legacy interrupt controller.

That's what PowerPC does, but I don't think there's anything in the top
level interface that requires that - it's all up to the alloc routine.

> You don't have MSI-X support (which is the interesting case) and you don't have
> suspend/resume support.

We have MSI-X support on RTAS ;), but that's cheating. I have 90% of
what I need for MSI-X, but I haven't implemented it yet, I will as part
of the port of the Intel code.

> You don't support the MSI mask bit.

IMHO that's a backend detail.

> Looking at your msi_ops it does not map to what I am doing on x86.  There
> is the implicit assumption that the msi_message is fixed for the lifetime
> of the msi.  Which is wrong.

Again, that's how PowerPC does it, but I don't think it's assumed. If
your backend needs to change the message then we can support that
reasonably easily I think.

> So in short summary I cannot use your msi_ops they are inappropriate for
> i386, x86_64 and ia64.
> 
> So at the moment I am opposed to this code because as it sits it appears to
> be a serious regression.
> 
> The additional bits that feel like this code was primarily targeted at supporting
> the RTAS with real hardware support thrown in as an after thought just seem
> to add insult to injury.  To date I have no information that indicates to me
> that the RTAS model is at all sane or makes any sense to duplicate elsewhere.
> If supporting the RTAS is what is obscuring your vision of what is really
> needed to support MSI I don't want to see RTAS support in a patch set
> until we get a good multiple platform architecture, merged into the kernel.
> 
> Supporting the RTAS first and breaking everyone who actually has real
> hardware seems like very much the wrong approach to get a good
> multiple platform solution.

I agree. We didn't design it to be a multi platform solution, we
designed it to work for us. It does that. Now we're hoping to expand it
to work for the Intel case as well.

I guess I wasn't clear enough in my original post, but I fully expect
that I'll need to tweak parts of the core to make Intel fit. That's
still a work in progress.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26  9:11   ` Segher Boessenkool
@ 2007-01-27  6:33     ` Michael Ellerman
  0 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-27  6:33 UTC (permalink / raw)
  To: Segher Boessenkool
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller, Eric W. Biederman

[-- Attachment #1: Type: text/plain, Size: 1194 bytes --]

On Fri, 2007-01-26 at 10:11 +0100, Segher Boessenkool wrote:
> > MPIC MSI backend. Based on code from Segher, heavily hacked by me.
> > Renamed to mpic_htmsi, as it only deals with MSI over Hypertransport.
> 
> More exactly: it only deals with MSIs that are translated
> by some whatever-to-HT bridge into a normal HT interrupt.
> 
> > We properly discover the HT magic address by reading the config space.
> 
> ...config space of that bridge.
> 
> > Now we have an irq allocator we can support > 1 MSI, and we don't  
> > reuse
> > the LSI.
> 
> Right, that's what I asked about in the other thread, so
> the FIXME in htmsi_alloc() can indeed go.

No we don't reuse the LSI irq, so the FIXME can go.

> Why is MSI-X still unsupported?  Simply because of lack
> of testing?  (See htmsi_check()).

Yeah. There's four drivers in mainline that call pci_enable_msix(), and
I don't have hardware for any of them.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26 22:08           ` Benjamin Herrenschmidt
@ 2007-01-27  6:54             ` Michael Ellerman
  0 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-27  6:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Eric W. Biederman, shaohua.li, linux-pci, David S.Miller,
	Brice Goglin

[-- Attachment #1: Type: text/plain, Size: 1303 bytes --]

On Sat, 2007-01-27 at 09:08 +1100, Benjamin Herrenschmidt wrote:
> > >From that comes our need for a data structure to map from an irq to a
> > msi data structure in a generic fashion, because we don't just program
> > the pci card and forget about it.  From those requirements comes our
> > need for a little bit more complete support of the features of the
> > hardware that Michaels implementation.
> 
> Well, we need to go from pci_dev -> MSI and from linux irq_desc ->
> MSI... for now, we can get away without the later on powerpc, but I
> understand that intel needs that. The initial intel code used an array
> of NR_IRQs, but that sucks. What I remember of your patch is that they
> were using chip_data in irq_desc.

There's still an array of NR_IRQs msi_desc*, which are also attached to
the irq_desc.

The msi_desc is allocated and immediately attached to the irq_desc. Then
later it gets hooked into the array (attach_msi_entry) - I don't
understand why both methods are necessary - they both require the irq.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26 17:56             ` Eric W. Biederman
  2007-01-26 22:48               ` Benjamin Herrenschmidt
@ 2007-01-27  7:01               ` Michael Ellerman
  1 sibling, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-27  7:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S.Miller

[-- Attachment #1: Type: text/plain, Size: 1456 bytes --]

On Fri, 2007-01-26 at 10:56 -0700, Eric W. Biederman wrote:
> Grant Grundler <grundler@parisc-linux.org> writes:
> > Helping Michael make it work would be a constructive way forward.
> > I think Michael has the abstraction correct so it's NOT x86 centric
> > but still works optimally on x86.
> 
> NO NO NO NO Michaels abstraction does not work on x86.
> Which is a big part of the my problem.
> Michaels abstraction does not allow me to migrate irqs on x86.
> 
> setup_msi_msg only gets called when you enable the msi.  Nothing
> gets called when irqbalaced changes the cpu mask, and there is no
> support that would allow that with Michael's msi ops.

That's all part of the backend. You just give me an irq_desc attached to
a chip with set_affinity = set_msi_irq_affinity, exactly like the
current code.

> That is why I have asked for an evolutionary approach and not this
> stupid drop and replace attempt.

I don't intend to drop and replace, I agree that's stupid. My hope was
to have the two implementations coexist for a kernel release, giving us
time to find all the bugs on PowerPC - where we have lots fewer MSI
users to piss off - and then port Intel over.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-26 22:46       ` Paul Mackerras
  2007-01-27  2:46         ` Eric W. Biederman
@ 2007-01-27 18:30         ` Grant Grundler
  2007-01-27 20:02           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 178+ messages in thread
From: Grant Grundler @ 2007-01-27 18:30 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S.Miller,
	Eric W. Biederman

On Sat, Jan 27, 2007 at 09:46:11AM +1100, Paul Mackerras wrote:
> Do you mean that x86 cpus themselves can actually be the target of a
> write on the bus?  That's the first time I've heard of the CPU itself
> being a target for a bus operation.

Though Eric gave a complete answer, I thought it was the Local-APIC
(onboard each CPU) is the target of the bus transaction. Intel
publishes the "Intel Interrupt Architecture" document and it describes
the API to the Local-APIC.  IA64 also uses an on-chip Local-APIC.

PA-RISC CPU (google for "PA-RISC External Interrupt Request Register")
is the target of _all_ IPI and IO interrupts (including MSI).
I think you'd find some of the comments in the PA-RISC interrupt
handling code interesting.
Look for txn_alloc_irq() in arch/parisc/kernel/irq.c.

My impression was any CPU that uses an IO-SAPIC (or -xAPIC) is
using bus transactions to communicate interrupts even if they
aren't using MSI. BIOS typically hides all the setup.

Alpha also uses bus transactions for IO interrupts. But I've read
through my ancient alpha reference manual and don't understand
exactly if the vector is part of the "DMA" transaction or is read
by the CPU off the I/O Bridge ("hose").

> Or do you mean there is some piece of hardware in the northbridge (or
> elsewhere) that accepts the MSI message writes and asserts an
> interrupt line to the CPU?  That is basically what we have on PPC.

*grin* PPC in this case looks more like "legacy x86" than x86 does today.
/me hides

hth,
grant

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 14/16] MPIC MSI backend
  2007-01-27 18:30         ` Grant Grundler
@ 2007-01-27 20:02           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-27 20:02 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Paul Mackerras,
	Brice Goglin, shaohua.li, linux-pci, David S.Miller,
	Eric W. Biederman


> My impression was any CPU that uses an IO-SAPIC (or -xAPIC) is
> using bus transactions to communicate interrupts even if they
> aren't using MSI. BIOS typically hides all the setup.
> 
> Alpha also uses bus transactions for IO interrupts. But I've read
> through my ancient alpha reference manual and don't understand
> exactly if the vector is part of the "DMA" transaction or is read
> by the CPU off the I/O Bridge ("hose").
> 
> > Or do you mean there is some piece of hardware in the northbridge (or
> > elsewhere) that accepts the MSI message writes and asserts an
> > interrupt line to the CPU?  That is basically what we have on PPC.
> 
> *grin* PPC in this case looks more like "legacy x86" than x86 does today.
> /me hides

Well, actually, Cell also has interrupts as packets on the bus :-)

(Though the way it's done on cell, you typically still need an external
interrupt controller for anything that's not on-chip unfortunately).

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-27  5:41   ` Michael Ellerman
@ 2007-01-28  6:16     ` Eric W. Biederman
  2007-01-28  8:12       ` Michael Ellerman
  0 siblings, 1 reply; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28  6:16 UTC (permalink / raw)
  To: michael
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Michael Ellerman <michael@ellerman.id.au> writes:

> I guess I wasn't clear enough in my original post, but I fully expect
> that I'll need to tweak parts of the core to make Intel fit. That's
> still a work in progress.

Ok.  To be very clear.

Any plan that does not involve using drivers/pci/msi.c for the
raw hardware operations is flawed.  Yes that code is a mess
but it works today, and appears to capture all of the requirements.
Where there are issues that code should be fixed not ignored.

The architecture specific bits of the current msi code roughly
correspond to your alloc and free routines.  All that is
needed going from generic code to architecture specific code
is the ability to allocate and free an msi irq.  You have
a lot more operations than that and it is overkill.

As a practical measure you current operations are such a bad fit
for the architectures a port would be very difficult.  Basically
setup_msi_message is simply a bad idea.  You need to use a
write_msi_message call from the architecture to the generic code
instead.

I have some patches cooking to cleanup msi.c so it can be used
as is.  I'm pretty much their but it looks like I need to slay
msi_lock to make things sane.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28  6:16     ` Eric W. Biederman
@ 2007-01-28  8:12       ` Michael Ellerman
  2007-01-28  8:36         ` Eric W. Biederman
  0 siblings, 1 reply; 178+ messages in thread
From: Michael Ellerman @ 2007-01-28  8:12 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

[-- Attachment #1: Type: text/plain, Size: 2756 bytes --]

On Sat, 2007-01-27 at 23:16 -0700, Eric W. Biederman wrote:
> Michael Ellerman <michael@ellerman.id.au> writes:
> 
> > I guess I wasn't clear enough in my original post, but I fully expect
> > that I'll need to tweak parts of the core to make Intel fit. That's
> > still a work in progress.
> 
> Ok.  To be very clear.
> 
> Any plan that does not involve using drivers/pci/msi.c for the
> raw hardware operations is flawed.  Yes that code is a mess
> but it works today, and appears to capture all of the requirements.
> Where there are issues that code should be fixed not ignored.

Which is what I plan to do. I already have a patch which turns the
current code into a backend for my code, its ugly as hell, it maintains
msi_info and the msi_descs which is stupid, but it seems to work.

We should probably just stop talking until I've got that series worked
out and posted, and then you can tell me what you think of it :)

> The architecture specific bits of the current msi code roughly
> correspond to your alloc and free routines.  All that is
> needed going from generic code to architecture specific code
> is the ability to allocate and free an msi irq.  You have
> a lot more operations than that and it is overkill.

Except you keep ignoring the hypervisor case, which we have to support.
I realise you'd rather not think about it, sure it's ugly, but that's
our reality. We could isolate all of that in arch/powerpc, but Greg has
said he doesn't want two implementations, and I think in the long term
that's the right approach - we should be able to come up with a common
implementation.

> As a practical measure you current operations are such a bad fit
> for the architectures a port would be very difficult.  Basically
> setup_msi_message is simply a bad idea.  You need to use a
> write_msi_message call from the architecture to the generic code
> instead.

i386's msi_compose_msg() would just become setup_msi_message(), the
setup of the irq chip etc. would go in alloc. For irq affinity, for now
we'll just keep exporting read/write_msi_msg(). But I don't see what the
fundamental problem is.

> I have some patches cooking to cleanup msi.c so it can be used
> as is.  I'm pretty much their but it looks like I need to slay
> msi_lock to make things sane.

If you can post them soon that would be good. I'm already heavily
hacking the intel code to work as a backend for me so anything you do
will conflict with that work.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 4/16] Abstract MSI suspend
  2007-01-25  8:34 ` [RFC/PATCH 4/16] Abstract MSI suspend Michael Ellerman
  2007-01-25 22:33   ` patch msi-abstract-msi-suspend.patch added to gregkh-2.6 tree gregkh
@ 2007-01-28  8:27   ` Eric W. Biederman
  2007-01-29  7:22     ` Michael Ellerman
  1 sibling, 1 reply; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28  8:27 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Michael Ellerman <michael@ellerman.id.au> writes:

> Currently pci_disable_device() disables MSI on a device by twiddling
> bits in config space via disable_msi_mode().
>
> On some platforms that may not be appropriate, so abstract the MSI
> suspend logic into pci_disable_device_msi().

>
> Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
> ---
>
>  drivers/pci/msi.c |   11 +++++++++++
>  drivers/pci/pci.c |    7 +------
>  drivers/pci/pci.h |    2 ++
>  3 files changed, 14 insertions(+), 6 deletions(-)
>
> Index: msi/drivers/pci/msi.c
> ===================================================================
> --- msi.orig/drivers/pci/msi.c
> +++ msi/drivers/pci/msi.c
> @@ -271,6 +271,17 @@ void disable_msi_mode(struct pci_dev *de
>  	pci_intx(dev, 1);  /* enable intx */
>  }
>  
> +void pci_disable_device_msi(struct pci_dev *dev)
> +{
> +	if (dev->msi_enabled)
> +		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
> +			PCI_CAP_ID_MSI);
> +
> +	if (dev->msix_enabled)
> +		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
> +			PCI_CAP_ID_MSIX);

Just a quick note. This is wrong.  It should be PCI_CAP_ID_MSIX.
The code that is being moved is buggy.  So the patch itself doesn't
make the situation any worse.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28  8:12       ` Michael Ellerman
@ 2007-01-28  8:36         ` Eric W. Biederman
  2007-01-28 20:14           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28  8:36 UTC (permalink / raw)
  To: michael
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Michael Ellerman <michael@ellerman.id.au> writes:

> We should probably just stop talking until I've got that series worked
> out and posted, and then you can tell me what you think of it :)

Sounds like a plan.  A series that kills the worst of the current code
as far as multiple architecture stuff, which should make your series
prettier.  I just finished testing and I'm heading for bed now.

When I'm alert enough to rebase my changes onto Greg's tree 
instead of linus-current.  I'll submit it.

I don't think there are going to be any conflicts with your first
4 patches that Greg sucked up but I figure it is best to check.

Anyway for architecture hooks I have it down to just:
/*
 * The arch hook for setup up msi irqs
 */
int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
void arch_teardown_msi_irq(unsigned int irq);

Which should be good enough to handle everything but RTAS.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [PATCH 0/6] MSI portability cleanups
  2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
@ 2007-01-28 19:40   ` Eric W. Biederman
  2007-01-25  8:34 ` [RFC/PATCH 3/16] Combine pci_(save|restore)_msi/msix_state Michael Ellerman
                     ` (18 subsequent siblings)
  19 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:40 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


This patchset is against gregkh-pci but except for the context around
msi_lookup_irq being completely different it applies cleanly to 2.6.20-rc6
as well.

When I first looked at this problem I thought no big deal it will one
or two simple patches and that is it.

When I looked more closely I discovered that to be certain of not introducing
bugs I would have to kill msi_lock, which made the problem a little more
difficult.

The result of this patchset is that architecture hooks become:

int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
void arch_teardown_msi_irq(unsigned int irq);

and are responsible for allocating and freeing the irq as well
as setting it up.

This touches the architecture code for i386, x86_64, and ia64 to
accomplish this.

Since I couldn't test ia64 I reviewed the code closely, and compile
tested it.

The other big change is that I added a field to irq_desc to point
at the msi_desc.  This removes the conflicts with the existing pointer
fields and makes the irq -> msi_desc mapping useable outside of msi.c

The only architecture problem that isn't solvable in this context is
the problem of supporting the crazy hypervisor on the ppc RTAS, which
asks us to drive the hardware but does not give us access to the
hardware registers.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [PATCH 0/6] MSI portability cleanups
@ 2007-01-28 19:40   ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:40 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Tony Luck, Grant Grundler, Ingo Molnar, linux-kernel,
	Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller


This patchset is against gregkh-pci but except for the context around
msi_lookup_irq being completely different it applies cleanly to 2.6.20-rc6
as well.

When I first looked at this problem I thought no big deal it will one
or two simple patches and that is it.

When I looked more closely I discovered that to be certain of not introducing
bugs I would have to kill msi_lock, which made the problem a little more
difficult.

The result of this patchset is that architecture hooks become:

int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
void arch_teardown_msi_irq(unsigned int irq);

and are responsible for allocating and freeing the irq as well
as setting it up.

This touches the architecture code for i386, x86_64, and ia64 to
accomplish this.

Since I couldn't test ia64 I reviewed the code closely, and compile
tested it.

The other big change is that I added a field to irq_desc to point
at the msi_desc.  This removes the conflicts with the existing pointer
fields and makes the irq -> msi_desc mapping useable outside of msi.c

The only architecture problem that isn't solvable in this context is
the problem of supporting the crazy hypervisor on the ppc RTAS, which
asks us to drive the hardware but does not give us access to the
hardware registers.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* [PATCH 1/6] msi: Kill msi_lookup_irq
  2007-01-28 19:40   ` Eric W. Biederman
@ 2007-01-28 19:42     ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:42 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


The function msi_lookup_irq was horrible.  As a side effect of running
it changed dev->irq, and then the callers would need to change it
back.  In addition it does a global scan through all of the irqs,
which seems to be the sole justification of the msi_lock.

To remove the neede for msi_lookup_irq I added first_msi_irq to struct
pci_dev.  Then depending on the context I replaced msi_lookup_irq with
dev->first_msi_irq, dev->msi_enabled, or dev->msix_enabled.

msi_enabled and msix_enabled were already present in pci_dev for other
reasons.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/pci/msi.c   |  149 ++++++++++++++++++++-------------------------------
 include/linux/pci.h |    3 +
 2 files changed, 62 insertions(+), 90 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index bca5a8a..71080c9 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -283,28 +283,6 @@ void pci_disable_device_msi(struct pci_dev *dev)
 			PCI_CAP_ID_MSIX);
 }
 
-static int msi_lookup_irq(struct pci_dev *dev, int type)
-{
-	int irq;
-	unsigned long flags;
-
-	spin_lock_irqsave(&msi_lock, flags);
-	for (irq = 0; irq < NR_IRQS; irq++) {
-		if (!msi_desc[irq] || msi_desc[irq]->dev != dev ||
-			msi_desc[irq]->msi_attrib.type != type ||
-			msi_desc[irq]->msi_attrib.default_irq != dev->irq)
-			continue;
-		spin_unlock_irqrestore(&msi_lock, flags);
-		/* This pre-assigned MSI irq for this device
-		   already exists. Override dev->irq with this irq */
-		dev->irq = irq;
-		return 0;
-	}
-	spin_unlock_irqrestore(&msi_lock, flags);
-
-	return -EACCES;
-}
-
 #ifdef CONFIG_PM
 static int __pci_save_msi_state(struct pci_dev *dev)
 {
@@ -375,11 +353,13 @@ static void __pci_restore_msi_state(struct pci_dev *dev)
 static int __pci_save_msix_state(struct pci_dev *dev)
 {
 	int pos;
-	int temp;
 	int irq, head, tail = 0;
 	u16 control;
 	struct pci_cap_saved_state *save_state;
 
+	if (!dev->msix_enabled)
+		return 0;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
 	if (pos <= 0 || dev->no_msi)
 		return 0;
@@ -397,13 +377,7 @@ static int __pci_save_msix_state(struct pci_dev *dev)
 	*((u16 *)&save_state->data[0]) = control;
 
 	/* save the table */
-	temp = dev->irq;
-	if (msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
-		kfree(save_state);
-		return -EINVAL;
-	}
-
-	irq = head = dev->irq;
+	irq = head = dev->first_msi_irq;
 	while (head != tail) {
 		struct msi_desc *entry;
 
@@ -413,7 +387,6 @@ static int __pci_save_msix_state(struct pci_dev *dev)
 		tail = msi_desc[irq]->link.tail;
 		irq = tail;
 	}
-	dev->irq = temp;
 
 	save_state->cap_nr = PCI_CAP_ID_MSIX;
 	pci_add_saved_cap(dev, save_state);
@@ -439,9 +412,11 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
 	int pos;
 	int irq, head, tail = 0;
 	struct msi_desc *entry;
-	int temp;
 	struct pci_cap_saved_state *save_state;
 
+	if (!dev->msix_enabled)
+		return;
+
 	save_state = pci_find_saved_cap(dev, PCI_CAP_ID_MSIX);
 	if (!save_state)
 		return;
@@ -454,10 +429,7 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
 		return;
 
 	/* route the table */
-	temp = dev->irq;
-	if (msi_lookup_irq(dev, PCI_CAP_ID_MSIX))
-		return;
-	irq = head = dev->irq;
+	irq = head = dev->first_msi_irq;
 	while (head != tail) {
 		entry = msi_desc[irq];
 		write_msi_msg(irq, &entry->msg_save);
@@ -465,7 +437,6 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
 		tail = msi_desc[irq]->link.tail;
 		irq = tail;
 	}
-	dev->irq = temp;
 
 	pci_write_config_word(dev, msi_control_reg(pos), save);
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
@@ -535,6 +506,7 @@ static int msi_capability_init(struct pci_dev *dev)
 		return status;
 	}
 
+	dev->first_msi_irq = irq;
 	attach_msi_entry(entry, irq);
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
@@ -631,6 +603,7 @@ static int msix_capability_init(struct pci_dev *dev,
 			avail = -EBUSY;
 		return avail;
 	}
+	dev->first_msi_irq = entries[0].vector;
 	/* Set MSI-X enabled bits */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
 
@@ -678,13 +651,11 @@ int pci_msi_supported(struct pci_dev * dev)
  **/
 int pci_enable_msi(struct pci_dev* dev)
 {
-	int pos, temp, status;
+	int pos, status;
 
 	if (pci_msi_supported(dev) < 0)
 		return -EINVAL;
 
-	temp = dev->irq;
-
 	status = msi_init();
 	if (status < 0)
 		return status;
@@ -693,15 +664,14 @@ int pci_enable_msi(struct pci_dev* dev)
 	if (!pos)
 		return -EINVAL;
 
-	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSI));
+	WARN_ON(!!dev->msi_enabled);
 
 	/* Check whether driver already requested for MSI-X irqs */
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
-	if (pos > 0 && !msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
+	if (pos > 0 && dev->msix_enabled) {
 			printk(KERN_INFO "PCI: %s: Can't enable MSI.  "
-			       "Device already has MSI-X irq assigned\n",
+			       "Device already has MSI-X enabled\n",
 			       pci_name(dev));
-			dev->irq = temp;
 			return -EINVAL;
 	}
 	status = msi_capability_init(dev);
@@ -720,6 +690,9 @@ void pci_disable_msi(struct pci_dev* dev)
 	if (!dev)
 		return;
 
+	if (!dev->msi_enabled)
+		return;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
 	if (!pos)
 		return;
@@ -728,28 +701,30 @@ void pci_disable_msi(struct pci_dev* dev)
 	if (!(control & PCI_MSI_FLAGS_ENABLE))
 		return;
 
+
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
 	spin_lock_irqsave(&msi_lock, flags);
-	entry = msi_desc[dev->irq];
+	entry = msi_desc[dev->first_msi_irq];
 	if (!entry || !entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI) {
 		spin_unlock_irqrestore(&msi_lock, flags);
 		return;
 	}
-	if (irq_has_action(dev->irq)) {
+	if (irq_has_action(dev->first_msi_irq)) {
 		spin_unlock_irqrestore(&msi_lock, flags);
 		printk(KERN_WARNING "PCI: %s: pci_disable_msi() called without "
 		       "free_irq() on MSI irq %d\n",
-		       pci_name(dev), dev->irq);
-		BUG_ON(irq_has_action(dev->irq));
+		       pci_name(dev), dev->first_msi_irq);
+		BUG_ON(irq_has_action(dev->first_msi_irq));
 	} else {
 		default_irq = entry->msi_attrib.default_irq;
 		spin_unlock_irqrestore(&msi_lock, flags);
-		msi_free_irq(dev, dev->irq);
+		msi_free_irq(dev, dev->first_msi_irq);
 
 		/* Restore dev->irq to its default pin-assertion irq */
 		dev->irq = default_irq;
 	}
+	dev->first_msi_irq = 0;
 }
 
 static int msi_free_irq(struct pci_dev* dev, int irq)
@@ -808,7 +783,7 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 {
 	int status, pos, nr_entries;
-	int i, j, temp;
+	int i, j;
 	u16 control;
 
 	if (!entries || pci_msi_supported(dev) < 0)
@@ -836,16 +811,14 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 				return -EINVAL;	/* duplicate entry */
 		}
 	}
-	temp = dev->irq;
-	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSIX));
+	WARN_ON(!!dev->msix_enabled);
 
 	/* Check whether driver already requested for MSI irq */
    	if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0 &&
-		!msi_lookup_irq(dev, PCI_CAP_ID_MSI)) {
+		dev->msi_enabled) {
 		printk(KERN_INFO "PCI: %s: Can't enable MSI-X.  "
 		       "Device already has an MSI irq assigned\n",
 		       pci_name(dev));
-		dev->irq = temp;
 		return -EINVAL;
 	}
 	status = msix_capability_init(dev, entries, nvec);
@@ -854,7 +827,9 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 
 void pci_disable_msix(struct pci_dev* dev)
 {
-	int pos, temp;
+	int irq, head, tail = 0, warning = 0;
+	unsigned long flags;
+	int pos;
 	u16 control;
 
 	if (!pci_msi_enable)
@@ -862,6 +837,9 @@ void pci_disable_msix(struct pci_dev* dev)
 	if (!dev)
 		return;
 
+	if (!dev->msix_enabled)
+		return;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
 	if (!pos)
 		return;
@@ -872,31 +850,25 @@ void pci_disable_msix(struct pci_dev* dev)
 
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
 
-	temp = dev->irq;
-	if (!msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
-		int irq, head, tail = 0, warning = 0;
-		unsigned long flags;
-
-		irq = head = dev->irq;
-		dev->irq = temp;			/* Restore pin IRQ */
-		while (head != tail) {
-			spin_lock_irqsave(&msi_lock, flags);
-			tail = msi_desc[irq]->link.tail;
-			spin_unlock_irqrestore(&msi_lock, flags);
-			if (irq_has_action(irq))
-				warning = 1;
-			else if (irq != head)	/* Release MSI-X irq */
-				msi_free_irq(dev, irq);
-			irq = tail;
-		}
-		msi_free_irq(dev, irq);
-		if (warning) {
-			printk(KERN_WARNING "PCI: %s: pci_disable_msix() called without "
-			       "free_irq() on all MSI-X irqs\n",
-			       pci_name(dev));
-			BUG_ON(warning > 0);
-		}
+	irq = head = dev->first_msi_irq;
+	while (head != tail) {
+		spin_lock_irqsave(&msi_lock, flags);
+		tail = msi_desc[irq]->link.tail;
+		spin_unlock_irqrestore(&msi_lock, flags);
+		if (irq_has_action(irq))
+			warning = 1;
+		else if (irq != head)	/* Release MSI-X irq */
+			msi_free_irq(dev, irq);
+		irq = tail;
+	}
+	msi_free_irq(dev, irq);
+	if (warning) {
+		printk(KERN_WARNING "PCI: %s: pci_disable_msix() called without "
+			"free_irq() on all MSI-X irqs\n",
+			pci_name(dev));
+		BUG_ON(warning > 0);
 	}
+	dev->first_msi_irq = 0;
 }
 
 /**
@@ -910,30 +882,28 @@ void pci_disable_msix(struct pci_dev* dev)
  **/
 void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 {
-	int pos, temp;
+	int pos;
 	unsigned long flags;
 
 	if (!pci_msi_enable || !dev)
  		return;
 
-	temp = dev->irq;		/* Save IOAPIC IRQ */
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
-	if (pos > 0 && !msi_lookup_irq(dev, PCI_CAP_ID_MSI)) {
-		if (irq_has_action(dev->irq)) {
+	if (pos > 0 && dev->msi_enabled) {
+		if (irq_has_action(dev->first_msi_irq)) {
 			printk(KERN_WARNING "PCI: %s: msi_remove_pci_irq_vectors() "
 			       "called without free_irq() on MSI irq %d\n",
-			       pci_name(dev), dev->irq);
-			BUG_ON(irq_has_action(dev->irq));
+			       pci_name(dev), dev->first_msi_irq);
+			BUG_ON(irq_has_action(dev->first_msi_irq));
 		} else /* Release MSI irq assigned to this device */
-			msi_free_irq(dev, dev->irq);
-		dev->irq = temp;		/* Restore IOAPIC IRQ */
+			msi_free_irq(dev, dev->first_msi_irq);
 	}
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
-	if (pos > 0 && !msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
+	if (pos > 0 && dev->msix_enabled) {
 		int irq, head, tail = 0, warning = 0;
 		void __iomem *base = NULL;
 
-		irq = head = dev->irq;
+		irq = head = dev->first_msi_irq;
 		while (head != tail) {
 			spin_lock_irqsave(&msi_lock, flags);
 			tail = msi_desc[irq]->link.tail;
@@ -953,7 +923,6 @@ void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 			       pci_name(dev));
 			BUG_ON(warning > 0);
 		}
-		dev->irq = temp;		/* Restore IOAPIC IRQ */
 	}
 }
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 29765e2..1507f8c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -174,6 +174,9 @@ struct pci_dev {
 	struct bin_attribute *rom_attr; /* attribute descriptor for sysfs ROM entry */
 	int rom_attr_enabled;		/* has display of the rom attribute been enabled? */
 	struct bin_attribute *res_attr[DEVICE_COUNT_RESOURCE]; /* sysfs file for resources */
+#ifdef CONFIG_PCI_MSI
+	unsigned int first_msi_irq;
+#endif
 };
 
 #define pci_dev_g(n) list_entry(n, struct pci_dev, global_list)
-- 
1.4.4.1.g278f


^ permalink raw reply related	[flat|nested] 178+ messages in thread

* [PATCH 1/6] msi: Kill msi_lookup_irq
@ 2007-01-28 19:42     ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:42 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Tony Luck, Grant Grundler, Ingo Molnar, linux-kernel,
	Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller


The function msi_lookup_irq was horrible.  As a side effect of running
it changed dev->irq, and then the callers would need to change it
back.  In addition it does a global scan through all of the irqs,
which seems to be the sole justification of the msi_lock.

To remove the neede for msi_lookup_irq I added first_msi_irq to struct
pci_dev.  Then depending on the context I replaced msi_lookup_irq with
dev->first_msi_irq, dev->msi_enabled, or dev->msix_enabled.

msi_enabled and msix_enabled were already present in pci_dev for other
reasons.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/pci/msi.c   |  149 ++++++++++++++++++++-------------------------------
 include/linux/pci.h |    3 +
 2 files changed, 62 insertions(+), 90 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index bca5a8a..71080c9 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -283,28 +283,6 @@ void pci_disable_device_msi(struct pci_dev *dev)
 			PCI_CAP_ID_MSIX);
 }
 
-static int msi_lookup_irq(struct pci_dev *dev, int type)
-{
-	int irq;
-	unsigned long flags;
-
-	spin_lock_irqsave(&msi_lock, flags);
-	for (irq = 0; irq < NR_IRQS; irq++) {
-		if (!msi_desc[irq] || msi_desc[irq]->dev != dev ||
-			msi_desc[irq]->msi_attrib.type != type ||
-			msi_desc[irq]->msi_attrib.default_irq != dev->irq)
-			continue;
-		spin_unlock_irqrestore(&msi_lock, flags);
-		/* This pre-assigned MSI irq for this device
-		   already exists. Override dev->irq with this irq */
-		dev->irq = irq;
-		return 0;
-	}
-	spin_unlock_irqrestore(&msi_lock, flags);
-
-	return -EACCES;
-}
-
 #ifdef CONFIG_PM
 static int __pci_save_msi_state(struct pci_dev *dev)
 {
@@ -375,11 +353,13 @@ static void __pci_restore_msi_state(struct pci_dev *dev)
 static int __pci_save_msix_state(struct pci_dev *dev)
 {
 	int pos;
-	int temp;
 	int irq, head, tail = 0;
 	u16 control;
 	struct pci_cap_saved_state *save_state;
 
+	if (!dev->msix_enabled)
+		return 0;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
 	if (pos <= 0 || dev->no_msi)
 		return 0;
@@ -397,13 +377,7 @@ static int __pci_save_msix_state(struct pci_dev *dev)
 	*((u16 *)&save_state->data[0]) = control;
 
 	/* save the table */
-	temp = dev->irq;
-	if (msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
-		kfree(save_state);
-		return -EINVAL;
-	}
-
-	irq = head = dev->irq;
+	irq = head = dev->first_msi_irq;
 	while (head != tail) {
 		struct msi_desc *entry;
 
@@ -413,7 +387,6 @@ static int __pci_save_msix_state(struct pci_dev *dev)
 		tail = msi_desc[irq]->link.tail;
 		irq = tail;
 	}
-	dev->irq = temp;
 
 	save_state->cap_nr = PCI_CAP_ID_MSIX;
 	pci_add_saved_cap(dev, save_state);
@@ -439,9 +412,11 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
 	int pos;
 	int irq, head, tail = 0;
 	struct msi_desc *entry;
-	int temp;
 	struct pci_cap_saved_state *save_state;
 
+	if (!dev->msix_enabled)
+		return;
+
 	save_state = pci_find_saved_cap(dev, PCI_CAP_ID_MSIX);
 	if (!save_state)
 		return;
@@ -454,10 +429,7 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
 		return;
 
 	/* route the table */
-	temp = dev->irq;
-	if (msi_lookup_irq(dev, PCI_CAP_ID_MSIX))
-		return;
-	irq = head = dev->irq;
+	irq = head = dev->first_msi_irq;
 	while (head != tail) {
 		entry = msi_desc[irq];
 		write_msi_msg(irq, &entry->msg_save);
@@ -465,7 +437,6 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
 		tail = msi_desc[irq]->link.tail;
 		irq = tail;
 	}
-	dev->irq = temp;
 
 	pci_write_config_word(dev, msi_control_reg(pos), save);
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
@@ -535,6 +506,7 @@ static int msi_capability_init(struct pci_dev *dev)
 		return status;
 	}
 
+	dev->first_msi_irq = irq;
 	attach_msi_entry(entry, irq);
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
@@ -631,6 +603,7 @@ static int msix_capability_init(struct pci_dev *dev,
 			avail = -EBUSY;
 		return avail;
 	}
+	dev->first_msi_irq = entries[0].vector;
 	/* Set MSI-X enabled bits */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
 
@@ -678,13 +651,11 @@ int pci_msi_supported(struct pci_dev * dev)
  **/
 int pci_enable_msi(struct pci_dev* dev)
 {
-	int pos, temp, status;
+	int pos, status;
 
 	if (pci_msi_supported(dev) < 0)
 		return -EINVAL;
 
-	temp = dev->irq;
-
 	status = msi_init();
 	if (status < 0)
 		return status;
@@ -693,15 +664,14 @@ int pci_enable_msi(struct pci_dev* dev)
 	if (!pos)
 		return -EINVAL;
 
-	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSI));
+	WARN_ON(!!dev->msi_enabled);
 
 	/* Check whether driver already requested for MSI-X irqs */
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
-	if (pos > 0 && !msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
+	if (pos > 0 && dev->msix_enabled) {
 			printk(KERN_INFO "PCI: %s: Can't enable MSI.  "
-			       "Device already has MSI-X irq assigned\n",
+			       "Device already has MSI-X enabled\n",
 			       pci_name(dev));
-			dev->irq = temp;
 			return -EINVAL;
 	}
 	status = msi_capability_init(dev);
@@ -720,6 +690,9 @@ void pci_disable_msi(struct pci_dev* dev)
 	if (!dev)
 		return;
 
+	if (!dev->msi_enabled)
+		return;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
 	if (!pos)
 		return;
@@ -728,28 +701,30 @@ void pci_disable_msi(struct pci_dev* dev)
 	if (!(control & PCI_MSI_FLAGS_ENABLE))
 		return;
 
+
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
 	spin_lock_irqsave(&msi_lock, flags);
-	entry = msi_desc[dev->irq];
+	entry = msi_desc[dev->first_msi_irq];
 	if (!entry || !entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI) {
 		spin_unlock_irqrestore(&msi_lock, flags);
 		return;
 	}
-	if (irq_has_action(dev->irq)) {
+	if (irq_has_action(dev->first_msi_irq)) {
 		spin_unlock_irqrestore(&msi_lock, flags);
 		printk(KERN_WARNING "PCI: %s: pci_disable_msi() called without "
 		       "free_irq() on MSI irq %d\n",
-		       pci_name(dev), dev->irq);
-		BUG_ON(irq_has_action(dev->irq));
+		       pci_name(dev), dev->first_msi_irq);
+		BUG_ON(irq_has_action(dev->first_msi_irq));
 	} else {
 		default_irq = entry->msi_attrib.default_irq;
 		spin_unlock_irqrestore(&msi_lock, flags);
-		msi_free_irq(dev, dev->irq);
+		msi_free_irq(dev, dev->first_msi_irq);
 
 		/* Restore dev->irq to its default pin-assertion irq */
 		dev->irq = default_irq;
 	}
+	dev->first_msi_irq = 0;
 }
 
 static int msi_free_irq(struct pci_dev* dev, int irq)
@@ -808,7 +783,7 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 {
 	int status, pos, nr_entries;
-	int i, j, temp;
+	int i, j;
 	u16 control;
 
 	if (!entries || pci_msi_supported(dev) < 0)
@@ -836,16 +811,14 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 				return -EINVAL;	/* duplicate entry */
 		}
 	}
-	temp = dev->irq;
-	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSIX));
+	WARN_ON(!!dev->msix_enabled);
 
 	/* Check whether driver already requested for MSI irq */
    	if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0 &&
-		!msi_lookup_irq(dev, PCI_CAP_ID_MSI)) {
+		dev->msi_enabled) {
 		printk(KERN_INFO "PCI: %s: Can't enable MSI-X.  "
 		       "Device already has an MSI irq assigned\n",
 		       pci_name(dev));
-		dev->irq = temp;
 		return -EINVAL;
 	}
 	status = msix_capability_init(dev, entries, nvec);
@@ -854,7 +827,9 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 
 void pci_disable_msix(struct pci_dev* dev)
 {
-	int pos, temp;
+	int irq, head, tail = 0, warning = 0;
+	unsigned long flags;
+	int pos;
 	u16 control;
 
 	if (!pci_msi_enable)
@@ -862,6 +837,9 @@ void pci_disable_msix(struct pci_dev* dev)
 	if (!dev)
 		return;
 
+	if (!dev->msix_enabled)
+		return;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
 	if (!pos)
 		return;
@@ -872,31 +850,25 @@ void pci_disable_msix(struct pci_dev* dev)
 
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
 
-	temp = dev->irq;
-	if (!msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
-		int irq, head, tail = 0, warning = 0;
-		unsigned long flags;
-
-		irq = head = dev->irq;
-		dev->irq = temp;			/* Restore pin IRQ */
-		while (head != tail) {
-			spin_lock_irqsave(&msi_lock, flags);
-			tail = msi_desc[irq]->link.tail;
-			spin_unlock_irqrestore(&msi_lock, flags);
-			if (irq_has_action(irq))
-				warning = 1;
-			else if (irq != head)	/* Release MSI-X irq */
-				msi_free_irq(dev, irq);
-			irq = tail;
-		}
-		msi_free_irq(dev, irq);
-		if (warning) {
-			printk(KERN_WARNING "PCI: %s: pci_disable_msix() called without "
-			       "free_irq() on all MSI-X irqs\n",
-			       pci_name(dev));
-			BUG_ON(warning > 0);
-		}
+	irq = head = dev->first_msi_irq;
+	while (head != tail) {
+		spin_lock_irqsave(&msi_lock, flags);
+		tail = msi_desc[irq]->link.tail;
+		spin_unlock_irqrestore(&msi_lock, flags);
+		if (irq_has_action(irq))
+			warning = 1;
+		else if (irq != head)	/* Release MSI-X irq */
+			msi_free_irq(dev, irq);
+		irq = tail;
+	}
+	msi_free_irq(dev, irq);
+	if (warning) {
+		printk(KERN_WARNING "PCI: %s: pci_disable_msix() called without "
+			"free_irq() on all MSI-X irqs\n",
+			pci_name(dev));
+		BUG_ON(warning > 0);
 	}
+	dev->first_msi_irq = 0;
 }
 
 /**
@@ -910,30 +882,28 @@ void pci_disable_msix(struct pci_dev* dev)
  **/
 void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 {
-	int pos, temp;
+	int pos;
 	unsigned long flags;
 
 	if (!pci_msi_enable || !dev)
  		return;
 
-	temp = dev->irq;		/* Save IOAPIC IRQ */
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
-	if (pos > 0 && !msi_lookup_irq(dev, PCI_CAP_ID_MSI)) {
-		if (irq_has_action(dev->irq)) {
+	if (pos > 0 && dev->msi_enabled) {
+		if (irq_has_action(dev->first_msi_irq)) {
 			printk(KERN_WARNING "PCI: %s: msi_remove_pci_irq_vectors() "
 			       "called without free_irq() on MSI irq %d\n",
-			       pci_name(dev), dev->irq);
-			BUG_ON(irq_has_action(dev->irq));
+			       pci_name(dev), dev->first_msi_irq);
+			BUG_ON(irq_has_action(dev->first_msi_irq));
 		} else /* Release MSI irq assigned to this device */
-			msi_free_irq(dev, dev->irq);
-		dev->irq = temp;		/* Restore IOAPIC IRQ */
+			msi_free_irq(dev, dev->first_msi_irq);
 	}
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
-	if (pos > 0 && !msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
+	if (pos > 0 && dev->msix_enabled) {
 		int irq, head, tail = 0, warning = 0;
 		void __iomem *base = NULL;
 
-		irq = head = dev->irq;
+		irq = head = dev->first_msi_irq;
 		while (head != tail) {
 			spin_lock_irqsave(&msi_lock, flags);
 			tail = msi_desc[irq]->link.tail;
@@ -953,7 +923,6 @@ void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 			       pci_name(dev));
 			BUG_ON(warning > 0);
 		}
-		dev->irq = temp;		/* Restore IOAPIC IRQ */
 	}
 }
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 29765e2..1507f8c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -174,6 +174,9 @@ struct pci_dev {
 	struct bin_attribute *rom_attr; /* attribute descriptor for sysfs ROM entry */
 	int rom_attr_enabled;		/* has display of the rom attribute been enabled? */
 	struct bin_attribute *res_attr[DEVICE_COUNT_RESOURCE]; /* sysfs file for resources */
+#ifdef CONFIG_PCI_MSI
+	unsigned int first_msi_irq;
+#endif
 };
 
 #define pci_dev_g(n) list_entry(n, struct pci_dev, global_list)
-- 
1.4.4.1.g278f

^ permalink raw reply related	[flat|nested] 178+ messages in thread

* [PATCH 2/6] msi: Remove msi_lock.
  2007-01-28 19:42     ` Eric W. Biederman
@ 2007-01-28 19:44       ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


With the removal of msi_lookup_irq all of the functions using msi_lock
operated on a single device and none of them could reasonably be
called on that device at the same time. 

Since what little synchronization that needs to happen needs to happen
outside of the msi functions, msi_lock could never be contended and as
such is useless and just complicates the code.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/pci/msi.c |   20 --------------------
 1 files changed, 0 insertions(+), 20 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 71080c9..5e7a187 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -24,7 +24,6 @@
 #include "pci.h"
 #include "msi.h"
 
-static DEFINE_SPINLOCK(msi_lock);
 static struct msi_desc* msi_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = NULL };
 static struct kmem_cache* msi_cachep;
 
@@ -196,11 +195,7 @@ static struct msi_desc* alloc_msi_entry(void)
 
 static void attach_msi_entry(struct msi_desc *entry, int irq)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&msi_lock, flags);
 	msi_desc[irq] = entry;
-	spin_unlock_irqrestore(&msi_lock, flags);
 }
 
 static int create_msi_irq(void)
@@ -683,7 +678,6 @@ void pci_disable_msi(struct pci_dev* dev)
 	struct msi_desc *entry;
 	int pos, default_irq;
 	u16 control;
-	unsigned long flags;
 
 	if (!pci_msi_enable)
 		return;
@@ -704,21 +698,17 @@ void pci_disable_msi(struct pci_dev* dev)
 
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
-	spin_lock_irqsave(&msi_lock, flags);
 	entry = msi_desc[dev->first_msi_irq];
 	if (!entry || !entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI) {
-		spin_unlock_irqrestore(&msi_lock, flags);
 		return;
 	}
 	if (irq_has_action(dev->first_msi_irq)) {
-		spin_unlock_irqrestore(&msi_lock, flags);
 		printk(KERN_WARNING "PCI: %s: pci_disable_msi() called without "
 		       "free_irq() on MSI irq %d\n",
 		       pci_name(dev), dev->first_msi_irq);
 		BUG_ON(irq_has_action(dev->first_msi_irq));
 	} else {
 		default_irq = entry->msi_attrib.default_irq;
-		spin_unlock_irqrestore(&msi_lock, flags);
 		msi_free_irq(dev, dev->first_msi_irq);
 
 		/* Restore dev->irq to its default pin-assertion irq */
@@ -732,14 +722,11 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	struct msi_desc *entry;
 	int head, entry_nr, type;
 	void __iomem *base;
-	unsigned long flags;
 
 	arch_teardown_msi_irq(irq);
 
-	spin_lock_irqsave(&msi_lock, flags);
 	entry = msi_desc[irq];
 	if (!entry || entry->dev != dev) {
-		spin_unlock_irqrestore(&msi_lock, flags);
 		return -EINVAL;
 	}
 	type = entry->msi_attrib.type;
@@ -750,7 +737,6 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	msi_desc[entry->link.tail]->link.head = entry->link.head;
 	entry->dev = NULL;
 	msi_desc[irq] = NULL;
-	spin_unlock_irqrestore(&msi_lock, flags);
 
 	destroy_msi_irq(irq);
 
@@ -828,7 +814,6 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 void pci_disable_msix(struct pci_dev* dev)
 {
 	int irq, head, tail = 0, warning = 0;
-	unsigned long flags;
 	int pos;
 	u16 control;
 
@@ -852,9 +837,7 @@ void pci_disable_msix(struct pci_dev* dev)
 
 	irq = head = dev->first_msi_irq;
 	while (head != tail) {
-		spin_lock_irqsave(&msi_lock, flags);
 		tail = msi_desc[irq]->link.tail;
-		spin_unlock_irqrestore(&msi_lock, flags);
 		if (irq_has_action(irq))
 			warning = 1;
 		else if (irq != head)	/* Release MSI-X irq */
@@ -883,7 +866,6 @@ void pci_disable_msix(struct pci_dev* dev)
 void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 {
 	int pos;
-	unsigned long flags;
 
 	if (!pci_msi_enable || !dev)
  		return;
@@ -905,10 +887,8 @@ void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 
 		irq = head = dev->first_msi_irq;
 		while (head != tail) {
-			spin_lock_irqsave(&msi_lock, flags);
 			tail = msi_desc[irq]->link.tail;
 			base = msi_desc[irq]->mask_base;
-			spin_unlock_irqrestore(&msi_lock, flags);
 			if (irq_has_action(irq))
 				warning = 1;
 			else if (irq != head) /* Release MSI-X irq */
-- 
1.4.4.1.g278f


^ permalink raw reply related	[flat|nested] 178+ messages in thread

* [PATCH 2/6] msi: Remove msi_lock.
@ 2007-01-28 19:44       ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Tony Luck, Grant Grundler, Ingo Molnar, linux-kernel,
	Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller


With the removal of msi_lookup_irq all of the functions using msi_lock
operated on a single device and none of them could reasonably be
called on that device at the same time. 

Since what little synchronization that needs to happen needs to happen
outside of the msi functions, msi_lock could never be contended and as
such is useless and just complicates the code.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/pci/msi.c |   20 --------------------
 1 files changed, 0 insertions(+), 20 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 71080c9..5e7a187 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -24,7 +24,6 @@
 #include "pci.h"
 #include "msi.h"
 
-static DEFINE_SPINLOCK(msi_lock);
 static struct msi_desc* msi_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = NULL };
 static struct kmem_cache* msi_cachep;
 
@@ -196,11 +195,7 @@ static struct msi_desc* alloc_msi_entry(void)
 
 static void attach_msi_entry(struct msi_desc *entry, int irq)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&msi_lock, flags);
 	msi_desc[irq] = entry;
-	spin_unlock_irqrestore(&msi_lock, flags);
 }
 
 static int create_msi_irq(void)
@@ -683,7 +678,6 @@ void pci_disable_msi(struct pci_dev* dev)
 	struct msi_desc *entry;
 	int pos, default_irq;
 	u16 control;
-	unsigned long flags;
 
 	if (!pci_msi_enable)
 		return;
@@ -704,21 +698,17 @@ void pci_disable_msi(struct pci_dev* dev)
 
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
-	spin_lock_irqsave(&msi_lock, flags);
 	entry = msi_desc[dev->first_msi_irq];
 	if (!entry || !entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI) {
-		spin_unlock_irqrestore(&msi_lock, flags);
 		return;
 	}
 	if (irq_has_action(dev->first_msi_irq)) {
-		spin_unlock_irqrestore(&msi_lock, flags);
 		printk(KERN_WARNING "PCI: %s: pci_disable_msi() called without "
 		       "free_irq() on MSI irq %d\n",
 		       pci_name(dev), dev->first_msi_irq);
 		BUG_ON(irq_has_action(dev->first_msi_irq));
 	} else {
 		default_irq = entry->msi_attrib.default_irq;
-		spin_unlock_irqrestore(&msi_lock, flags);
 		msi_free_irq(dev, dev->first_msi_irq);
 
 		/* Restore dev->irq to its default pin-assertion irq */
@@ -732,14 +722,11 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	struct msi_desc *entry;
 	int head, entry_nr, type;
 	void __iomem *base;
-	unsigned long flags;
 
 	arch_teardown_msi_irq(irq);
 
-	spin_lock_irqsave(&msi_lock, flags);
 	entry = msi_desc[irq];
 	if (!entry || entry->dev != dev) {
-		spin_unlock_irqrestore(&msi_lock, flags);
 		return -EINVAL;
 	}
 	type = entry->msi_attrib.type;
@@ -750,7 +737,6 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	msi_desc[entry->link.tail]->link.head = entry->link.head;
 	entry->dev = NULL;
 	msi_desc[irq] = NULL;
-	spin_unlock_irqrestore(&msi_lock, flags);
 
 	destroy_msi_irq(irq);
 
@@ -828,7 +814,6 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 void pci_disable_msix(struct pci_dev* dev)
 {
 	int irq, head, tail = 0, warning = 0;
-	unsigned long flags;
 	int pos;
 	u16 control;
 
@@ -852,9 +837,7 @@ void pci_disable_msix(struct pci_dev* dev)
 
 	irq = head = dev->first_msi_irq;
 	while (head != tail) {
-		spin_lock_irqsave(&msi_lock, flags);
 		tail = msi_desc[irq]->link.tail;
-		spin_unlock_irqrestore(&msi_lock, flags);
 		if (irq_has_action(irq))
 			warning = 1;
 		else if (irq != head)	/* Release MSI-X irq */
@@ -883,7 +866,6 @@ void pci_disable_msix(struct pci_dev* dev)
 void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 {
 	int pos;
-	unsigned long flags;
 
 	if (!pci_msi_enable || !dev)
  		return;
@@ -905,10 +887,8 @@ void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 
 		irq = head = dev->first_msi_irq;
 		while (head != tail) {
-			spin_lock_irqsave(&msi_lock, flags);
 			tail = msi_desc[irq]->link.tail;
 			base = msi_desc[irq]->mask_base;
-			spin_unlock_irqrestore(&msi_lock, flags);
 			if (irq_has_action(irq))
 				warning = 1;
 			else if (irq != head) /* Release MSI-X irq */
-- 
1.4.4.1.g278f

^ permalink raw reply related	[flat|nested] 178+ messages in thread

* [PATCH 3/6] msi: Fix msi_remove_pci_irq_vectors.
  2007-01-28 19:44       ` Eric W. Biederman
@ 2007-01-28 19:45         ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:45 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


Since msi_remove_pci_irq_vectors is designed to be called during
hotplug remove it is actively wrong to query the hardware and expect
meaningful results back. 

To that end remove the pci_find_capability calls.  Testing
dev->msi_enabled and dev->msix_enabled gives us all of the information
we need.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/pci/msi.c |    8 ++------
 1 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 5e7a187..db9c1d7 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -865,13 +865,10 @@ void pci_disable_msix(struct pci_dev* dev)
  **/
 void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 {
-	int pos;
-
 	if (!pci_msi_enable || !dev)
  		return;
 
-	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
-	if (pos > 0 && dev->msi_enabled) {
+	if (dev->msi_enabled) {
 		if (irq_has_action(dev->first_msi_irq)) {
 			printk(KERN_WARNING "PCI: %s: msi_remove_pci_irq_vectors() "
 			       "called without free_irq() on MSI irq %d\n",
@@ -880,8 +877,7 @@ void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 		} else /* Release MSI irq assigned to this device */
 			msi_free_irq(dev, dev->first_msi_irq);
 	}
-	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
-	if (pos > 0 && dev->msix_enabled) {
+	if (dev->msix_enabled) {
 		int irq, head, tail = 0, warning = 0;
 		void __iomem *base = NULL;
 
-- 
1.4.4.1.g278f


^ permalink raw reply related	[flat|nested] 178+ messages in thread

* [PATCH 3/6] msi: Fix msi_remove_pci_irq_vectors.
@ 2007-01-28 19:45         ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:45 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Tony Luck, Grant Grundler, Ingo Molnar, linux-kernel,
	Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller


Since msi_remove_pci_irq_vectors is designed to be called during
hotplug remove it is actively wrong to query the hardware and expect
meaningful results back. 

To that end remove the pci_find_capability calls.  Testing
dev->msi_enabled and dev->msix_enabled gives us all of the information
we need.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/pci/msi.c |    8 ++------
 1 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 5e7a187..db9c1d7 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -865,13 +865,10 @@ void pci_disable_msix(struct pci_dev* dev)
  **/
 void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 {
-	int pos;
-
 	if (!pci_msi_enable || !dev)
  		return;
 
-	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
-	if (pos > 0 && dev->msi_enabled) {
+	if (dev->msi_enabled) {
 		if (irq_has_action(dev->first_msi_irq)) {
 			printk(KERN_WARNING "PCI: %s: msi_remove_pci_irq_vectors() "
 			       "called without free_irq() on MSI irq %d\n",
@@ -880,8 +877,7 @@ void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 		} else /* Release MSI irq assigned to this device */
 			msi_free_irq(dev, dev->first_msi_irq);
 	}
-	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
-	if (pos > 0 && dev->msix_enabled) {
+	if (dev->msix_enabled) {
 		int irq, head, tail = 0, warning = 0;
 		void __iomem *base = NULL;
 
-- 
1.4.4.1.g278f

^ permalink raw reply related	[flat|nested] 178+ messages in thread

* [PATCH 4/6] msi: Remove attach_msi_entry.
  2007-01-28 19:45         ` Eric W. Biederman
@ 2007-01-28 19:47           ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:47 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


The attach_msi_entry has been reduced to a single simple assignment,
so for simplicity remove the abstraction and directory perform the
assignment.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/pci/msi.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index db9c1d7..b994012 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -193,11 +193,6 @@ static struct msi_desc* alloc_msi_entry(void)
 	return entry;
 }
 
-static void attach_msi_entry(struct msi_desc *entry, int irq)
-{
-	msi_desc[irq] = entry;
-}
-
 static int create_msi_irq(void)
 {
 	struct msi_desc *entry;
@@ -502,7 +497,7 @@ static int msi_capability_init(struct pci_dev *dev)
 	}
 
 	dev->first_msi_irq = irq;
-	attach_msi_entry(entry, irq);
+	msi_desc[irq] = entry;
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
@@ -581,7 +576,7 @@ static int msix_capability_init(struct pci_dev *dev,
 			break;
 		}
 
-		attach_msi_entry(entry, irq);
+		msi_desc[irq] = entry;
 	}
 	if (i != nvec) {
 		int avail = i - 1;
-- 
1.4.4.1.g278f


^ permalink raw reply related	[flat|nested] 178+ messages in thread

* [PATCH 4/6] msi: Remove attach_msi_entry.
@ 2007-01-28 19:47           ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:47 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Tony Luck, Grant Grundler, Ingo Molnar, linux-kernel,
	Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller


The attach_msi_entry has been reduced to a single simple assignment,
so for simplicity remove the abstraction and directory perform the
assignment.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/pci/msi.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index db9c1d7..b994012 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -193,11 +193,6 @@ static struct msi_desc* alloc_msi_entry(void)
 	return entry;
 }
 
-static void attach_msi_entry(struct msi_desc *entry, int irq)
-{
-	msi_desc[irq] = entry;
-}
-
 static int create_msi_irq(void)
 {
 	struct msi_desc *entry;
@@ -502,7 +497,7 @@ static int msi_capability_init(struct pci_dev *dev)
 	}
 
 	dev->first_msi_irq = irq;
-	attach_msi_entry(entry, irq);
+	msi_desc[irq] = entry;
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
@@ -581,7 +576,7 @@ static int msix_capability_init(struct pci_dev *dev,
 			break;
 		}
 
-		attach_msi_entry(entry, irq);
+		msi_desc[irq] = entry;
 	}
 	if (i != nvec) {
 		int avail = i - 1;
-- 
1.4.4.1.g278f

^ permalink raw reply related	[flat|nested] 178+ messages in thread

* [PATCH 5/6] msi: Kill the msi_desc array.
  2007-01-28 19:47           ` Eric W. Biederman
@ 2007-01-28 19:52             ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:52 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


We need to be able to get from an irq number to a struct msi_desc.
The msi_desc array in msi.c had several short comings the big one was
that it could not be used outside of msi.c.  Using irq_data in struct
irq_desc almost worked except on some architectures irq_data needs to
be used for something else. 

So this patch adds a msi_desc pointer to irq_desc, adds the appropriate
wrappers and changes all of the msi code to use them.

The dynamic_irq_init/cleanup code was tweaked to ensure the new
field is left in a well defined state.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 arch/ia64/sn/kernel/msi_sn.c |    2 +-
 drivers/pci/msi.c            |   44 ++++++++++++++++++++---------------------
 include/linux/irq.h          |    4 +++
 kernel/irq/chip.c            |   28 ++++++++++++++++++++++++++
 4 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/arch/ia64/sn/kernel/msi_sn.c b/arch/ia64/sn/kernel/msi_sn.c
index b3a435f..31fbb85 100644
--- a/arch/ia64/sn/kernel/msi_sn.c
+++ b/arch/ia64/sn/kernel/msi_sn.c
@@ -74,7 +74,7 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	struct pcibus_bussoft *bussoft = SN_PCIDEV_BUSSOFT(pdev);
 	struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
 
-	entry = get_irq_data(irq);
+	entry = get_irq_msi(irq);
 	if (!entry->msi_attrib.is_64)
 		return -EINVAL;
 
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index b994012..d7a2259 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -24,7 +24,6 @@
 #include "pci.h"
 #include "msi.h"
 
-static struct msi_desc* msi_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = NULL };
 static struct kmem_cache* msi_cachep;
 
 static int pci_msi_enable = 1;
@@ -43,7 +42,7 @@ static void msi_set_mask_bit(unsigned int irq, int flag)
 {
 	struct msi_desc *entry;
 
-	entry = msi_desc[irq];
+	entry = get_irq_msi(irq);
 	BUG_ON(!entry || !entry->dev);
 	switch (entry->msi_attrib.type) {
 	case PCI_CAP_ID_MSI:
@@ -73,7 +72,7 @@ static void msi_set_mask_bit(unsigned int irq, int flag)
 
 void read_msi_msg(unsigned int irq, struct msi_msg *msg)
 {
-	struct msi_desc *entry = get_irq_data(irq);
+	struct msi_desc *entry = get_irq_msi(irq);
 	switch(entry->msi_attrib.type) {
 	case PCI_CAP_ID_MSI:
 	{
@@ -112,7 +111,7 @@ void read_msi_msg(unsigned int irq, struct msi_msg *msg)
 
 void write_msi_msg(unsigned int irq, struct msi_msg *msg)
 {
-	struct msi_desc *entry = get_irq_data(irq);
+	struct msi_desc *entry = get_irq_msi(irq);
 	switch (entry->msi_attrib.type) {
 	case PCI_CAP_ID_MSI:
 	{
@@ -208,7 +207,7 @@ static int create_msi_irq(void)
 		return -EBUSY;
 	}
 
-	set_irq_data(irq, entry);
+	set_irq_msi(irq, entry);
 
 	return irq;
 }
@@ -217,9 +216,9 @@ static void destroy_msi_irq(unsigned int irq)
 {
 	struct msi_desc *entry;
 
-	entry = get_irq_data(irq);
+	entry = get_irq_msi(irq);
 	set_irq_chip(irq, NULL);
-	set_irq_data(irq, NULL);
+	set_irq_msi(irq, NULL);
 	destroy_irq(irq);
 	kmem_cache_free(msi_cachep, entry);
 }
@@ -371,10 +370,10 @@ static int __pci_save_msix_state(struct pci_dev *dev)
 	while (head != tail) {
 		struct msi_desc *entry;
 
-		entry = msi_desc[irq];
+		entry = get_irq_msi(irq);
 		read_msi_msg(irq, &entry->msg_save);
 
-		tail = msi_desc[irq]->link.tail;
+		tail = entry->link.tail;
 		irq = tail;
 	}
 
@@ -421,10 +420,10 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
 	/* route the table */
 	irq = head = dev->first_msi_irq;
 	while (head != tail) {
-		entry = msi_desc[irq];
+		entry = get_irq_msi(irq);
 		write_msi_msg(irq, &entry->msg_save);
 
-		tail = msi_desc[irq]->link.tail;
+		tail = entry->link.tail;
 		irq = tail;
 	}
 
@@ -462,7 +461,7 @@ static int msi_capability_init(struct pci_dev *dev)
 	if (irq < 0)
 		return irq;
 
-	entry = get_irq_data(irq);
+	entry = get_irq_msi(irq);
 	entry->link.head = irq;
 	entry->link.tail = irq;
 	entry->msi_attrib.type = PCI_CAP_ID_MSI;
@@ -497,7 +496,7 @@ static int msi_capability_init(struct pci_dev *dev)
 	}
 
 	dev->first_msi_irq = irq;
-	msi_desc[irq] = entry;
+	set_irq_msi(irq, entry);
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
@@ -546,7 +545,7 @@ static int msix_capability_init(struct pci_dev *dev,
 		if (irq < 0)
 			break;
 
-		entry = get_irq_data(irq);
+		entry = get_irq_msi(irq);
  		j = entries[i].entry;
  		entries[i].vector = irq;
 		entry->msi_attrib.type = PCI_CAP_ID_MSIX;
@@ -576,7 +575,7 @@ static int msix_capability_init(struct pci_dev *dev,
 			break;
 		}
 
-		msi_desc[irq] = entry;
+		set_irq_msi(irq, entry);
 	}
 	if (i != nvec) {
 		int avail = i - 1;
@@ -693,7 +692,7 @@ void pci_disable_msi(struct pci_dev* dev)
 
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
-	entry = msi_desc[dev->first_msi_irq];
+	entry = get_irq_msi(dev->first_msi_irq);
 	if (!entry || !entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI) {
 		return;
 	}
@@ -720,7 +719,7 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 
 	arch_teardown_msi_irq(irq);
 
-	entry = msi_desc[irq];
+	entry = get_irq_msi(irq);
 	if (!entry || entry->dev != dev) {
 		return -EINVAL;
 	}
@@ -728,10 +727,9 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	entry_nr = entry->msi_attrib.entry_nr;
 	head = entry->link.head;
 	base = entry->mask_base;
-	msi_desc[entry->link.head]->link.tail = entry->link.tail;
-	msi_desc[entry->link.tail]->link.head = entry->link.head;
+	get_irq_msi(entry->link.head)->link.tail = entry->link.tail;
+	get_irq_msi(entry->link.tail)->link.head = entry->link.head;
 	entry->dev = NULL;
-	msi_desc[irq] = NULL;
 
 	destroy_msi_irq(irq);
 
@@ -832,7 +830,7 @@ void pci_disable_msix(struct pci_dev* dev)
 
 	irq = head = dev->first_msi_irq;
 	while (head != tail) {
-		tail = msi_desc[irq]->link.tail;
+		tail = get_irq_msi(irq)->link.tail;
 		if (irq_has_action(irq))
 			warning = 1;
 		else if (irq != head)	/* Release MSI-X irq */
@@ -878,8 +876,8 @@ void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 
 		irq = head = dev->first_msi_irq;
 		while (head != tail) {
-			tail = msi_desc[irq]->link.tail;
-			base = msi_desc[irq]->mask_base;
+			tail = get_irq_msi(irq)->link.tail;
+			base = get_irq_msi(irq)->mask_base;
 			if (irq_has_action(irq))
 				warning = 1;
 			else if (irq != head) /* Release MSI-X irq */
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 52fc405..5504b67 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -68,6 +68,7 @@ typedef	void fastcall (*irq_flow_handler_t)(unsigned int irq,
 #define IRQ_MOVE_PENDING	0x40000000	/* need to re-target IRQ destination */
 
 struct proc_dir_entry;
+struct msi_desc;
 
 /**
  * struct irq_chip - hardware interrupt chip descriptor
@@ -148,6 +149,7 @@ struct irq_chip {
 struct irq_desc {
 	irq_flow_handler_t	handle_irq;
 	struct irq_chip		*chip;
+	struct msi_desc		*msi_desc;
 	void			*handler_data;
 	void			*chip_data;
 	struct irqaction	*action;	/* IRQ action list */
@@ -373,10 +375,12 @@ extern int set_irq_chip(unsigned int irq, struct irq_chip *chip);
 extern int set_irq_data(unsigned int irq, void *data);
 extern int set_irq_chip_data(unsigned int irq, void *data);
 extern int set_irq_type(unsigned int irq, unsigned int type);
+extern int set_irq_msi(unsigned int irq, struct msi_desc *entry);
 
 #define get_irq_chip(irq)	(irq_desc[irq].chip)
 #define get_irq_chip_data(irq)	(irq_desc[irq].chip_data)
 #define get_irq_data(irq)	(irq_desc[irq].handler_data)
+#define get_irq_msi(irq)	(irq_desc[irq].msi_desc)
 
 #endif /* CONFIG_GENERIC_HARDIRQS */
 
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index d27b258..475e8a7 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -39,6 +39,7 @@ void dynamic_irq_init(unsigned int irq)
 	desc->chip = &no_irq_chip;
 	desc->handle_irq = handle_bad_irq;
 	desc->depth = 1;
+	desc->msi_desc = NULL;
 	desc->handler_data = NULL;
 	desc->chip_data = NULL;
 	desc->action = NULL;
@@ -74,6 +75,9 @@ void dynamic_irq_cleanup(unsigned int irq)
 		WARN_ON(1);
 		return;
 	}
+	desc->msi_desc = NULL;
+	desc->handler_data = NULL;
+	desc->chip_data = NULL;
 	desc->handle_irq = handle_bad_irq;
 	desc->chip = &no_irq_chip;
 	spin_unlock_irqrestore(&desc->lock, flags);
@@ -162,6 +166,30 @@ int set_irq_data(unsigned int irq, void *data)
 EXPORT_SYMBOL(set_irq_data);
 
 /**
+ *	set_irq_data - set irq type data for an irq
+ *	@irq:	Interrupt number
+ *	@data:	Pointer to interrupt specific data
+ *
+ *	Set the hardware irq controller data for an irq
+ */
+int set_irq_msi(unsigned int irq, struct msi_desc *entry)
+{
+	struct irq_desc *desc;
+	unsigned long flags;
+
+	if (irq >= NR_IRQS) {
+		printk(KERN_ERR
+		       "Trying to install msi data for IRQ%d\n", irq);
+		return -EINVAL;
+	}
+	desc = irq_desc + irq;
+	spin_lock_irqsave(&desc->lock, flags);
+	desc->msi_desc = entry;
+	spin_unlock_irqrestore(&desc->lock, flags);
+	return 0;
+}
+
+/**
  *	set_irq_chip_data - set irq chip data for an irq
  *	@irq:	Interrupt number
  *	@data:	Pointer to chip specific data
-- 
1.4.4.1.g278f


^ permalink raw reply related	[flat|nested] 178+ messages in thread

* [PATCH 5/6] msi: Kill the msi_desc array.
@ 2007-01-28 19:52             ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:52 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Tony Luck, Grant Grundler, Ingo Molnar, linux-kernel,
	Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller


We need to be able to get from an irq number to a struct msi_desc.
The msi_desc array in msi.c had several short comings the big one was
that it could not be used outside of msi.c.  Using irq_data in struct
irq_desc almost worked except on some architectures irq_data needs to
be used for something else. 

So this patch adds a msi_desc pointer to irq_desc, adds the appropriate
wrappers and changes all of the msi code to use them.

The dynamic_irq_init/cleanup code was tweaked to ensure the new
field is left in a well defined state.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 arch/ia64/sn/kernel/msi_sn.c |    2 +-
 drivers/pci/msi.c            |   44 ++++++++++++++++++++---------------------
 include/linux/irq.h          |    4 +++
 kernel/irq/chip.c            |   28 ++++++++++++++++++++++++++
 4 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/arch/ia64/sn/kernel/msi_sn.c b/arch/ia64/sn/kernel/msi_sn.c
index b3a435f..31fbb85 100644
--- a/arch/ia64/sn/kernel/msi_sn.c
+++ b/arch/ia64/sn/kernel/msi_sn.c
@@ -74,7 +74,7 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	struct pcibus_bussoft *bussoft = SN_PCIDEV_BUSSOFT(pdev);
 	struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
 
-	entry = get_irq_data(irq);
+	entry = get_irq_msi(irq);
 	if (!entry->msi_attrib.is_64)
 		return -EINVAL;
 
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index b994012..d7a2259 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -24,7 +24,6 @@
 #include "pci.h"
 #include "msi.h"
 
-static struct msi_desc* msi_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = NULL };
 static struct kmem_cache* msi_cachep;
 
 static int pci_msi_enable = 1;
@@ -43,7 +42,7 @@ static void msi_set_mask_bit(unsigned int irq, int flag)
 {
 	struct msi_desc *entry;
 
-	entry = msi_desc[irq];
+	entry = get_irq_msi(irq);
 	BUG_ON(!entry || !entry->dev);
 	switch (entry->msi_attrib.type) {
 	case PCI_CAP_ID_MSI:
@@ -73,7 +72,7 @@ static void msi_set_mask_bit(unsigned int irq, int flag)
 
 void read_msi_msg(unsigned int irq, struct msi_msg *msg)
 {
-	struct msi_desc *entry = get_irq_data(irq);
+	struct msi_desc *entry = get_irq_msi(irq);
 	switch(entry->msi_attrib.type) {
 	case PCI_CAP_ID_MSI:
 	{
@@ -112,7 +111,7 @@ void read_msi_msg(unsigned int irq, struct msi_msg *msg)
 
 void write_msi_msg(unsigned int irq, struct msi_msg *msg)
 {
-	struct msi_desc *entry = get_irq_data(irq);
+	struct msi_desc *entry = get_irq_msi(irq);
 	switch (entry->msi_attrib.type) {
 	case PCI_CAP_ID_MSI:
 	{
@@ -208,7 +207,7 @@ static int create_msi_irq(void)
 		return -EBUSY;
 	}
 
-	set_irq_data(irq, entry);
+	set_irq_msi(irq, entry);
 
 	return irq;
 }
@@ -217,9 +216,9 @@ static void destroy_msi_irq(unsigned int irq)
 {
 	struct msi_desc *entry;
 
-	entry = get_irq_data(irq);
+	entry = get_irq_msi(irq);
 	set_irq_chip(irq, NULL);
-	set_irq_data(irq, NULL);
+	set_irq_msi(irq, NULL);
 	destroy_irq(irq);
 	kmem_cache_free(msi_cachep, entry);
 }
@@ -371,10 +370,10 @@ static int __pci_save_msix_state(struct pci_dev *dev)
 	while (head != tail) {
 		struct msi_desc *entry;
 
-		entry = msi_desc[irq];
+		entry = get_irq_msi(irq);
 		read_msi_msg(irq, &entry->msg_save);
 
-		tail = msi_desc[irq]->link.tail;
+		tail = entry->link.tail;
 		irq = tail;
 	}
 
@@ -421,10 +420,10 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
 	/* route the table */
 	irq = head = dev->first_msi_irq;
 	while (head != tail) {
-		entry = msi_desc[irq];
+		entry = get_irq_msi(irq);
 		write_msi_msg(irq, &entry->msg_save);
 
-		tail = msi_desc[irq]->link.tail;
+		tail = entry->link.tail;
 		irq = tail;
 	}
 
@@ -462,7 +461,7 @@ static int msi_capability_init(struct pci_dev *dev)
 	if (irq < 0)
 		return irq;
 
-	entry = get_irq_data(irq);
+	entry = get_irq_msi(irq);
 	entry->link.head = irq;
 	entry->link.tail = irq;
 	entry->msi_attrib.type = PCI_CAP_ID_MSI;
@@ -497,7 +496,7 @@ static int msi_capability_init(struct pci_dev *dev)
 	}
 
 	dev->first_msi_irq = irq;
-	msi_desc[irq] = entry;
+	set_irq_msi(irq, entry);
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
@@ -546,7 +545,7 @@ static int msix_capability_init(struct pci_dev *dev,
 		if (irq < 0)
 			break;
 
-		entry = get_irq_data(irq);
+		entry = get_irq_msi(irq);
  		j = entries[i].entry;
  		entries[i].vector = irq;
 		entry->msi_attrib.type = PCI_CAP_ID_MSIX;
@@ -576,7 +575,7 @@ static int msix_capability_init(struct pci_dev *dev,
 			break;
 		}
 
-		msi_desc[irq] = entry;
+		set_irq_msi(irq, entry);
 	}
 	if (i != nvec) {
 		int avail = i - 1;
@@ -693,7 +692,7 @@ void pci_disable_msi(struct pci_dev* dev)
 
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
-	entry = msi_desc[dev->first_msi_irq];
+	entry = get_irq_msi(dev->first_msi_irq);
 	if (!entry || !entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI) {
 		return;
 	}
@@ -720,7 +719,7 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 
 	arch_teardown_msi_irq(irq);
 
-	entry = msi_desc[irq];
+	entry = get_irq_msi(irq);
 	if (!entry || entry->dev != dev) {
 		return -EINVAL;
 	}
@@ -728,10 +727,9 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	entry_nr = entry->msi_attrib.entry_nr;
 	head = entry->link.head;
 	base = entry->mask_base;
-	msi_desc[entry->link.head]->link.tail = entry->link.tail;
-	msi_desc[entry->link.tail]->link.head = entry->link.head;
+	get_irq_msi(entry->link.head)->link.tail = entry->link.tail;
+	get_irq_msi(entry->link.tail)->link.head = entry->link.head;
 	entry->dev = NULL;
-	msi_desc[irq] = NULL;
 
 	destroy_msi_irq(irq);
 
@@ -832,7 +830,7 @@ void pci_disable_msix(struct pci_dev* dev)
 
 	irq = head = dev->first_msi_irq;
 	while (head != tail) {
-		tail = msi_desc[irq]->link.tail;
+		tail = get_irq_msi(irq)->link.tail;
 		if (irq_has_action(irq))
 			warning = 1;
 		else if (irq != head)	/* Release MSI-X irq */
@@ -878,8 +876,8 @@ void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 
 		irq = head = dev->first_msi_irq;
 		while (head != tail) {
-			tail = msi_desc[irq]->link.tail;
-			base = msi_desc[irq]->mask_base;
+			tail = get_irq_msi(irq)->link.tail;
+			base = get_irq_msi(irq)->mask_base;
 			if (irq_has_action(irq))
 				warning = 1;
 			else if (irq != head) /* Release MSI-X irq */
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 52fc405..5504b67 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -68,6 +68,7 @@ typedef	void fastcall (*irq_flow_handler_t)(unsigned int irq,
 #define IRQ_MOVE_PENDING	0x40000000	/* need to re-target IRQ destination */
 
 struct proc_dir_entry;
+struct msi_desc;
 
 /**
  * struct irq_chip - hardware interrupt chip descriptor
@@ -148,6 +149,7 @@ struct irq_chip {
 struct irq_desc {
 	irq_flow_handler_t	handle_irq;
 	struct irq_chip		*chip;
+	struct msi_desc		*msi_desc;
 	void			*handler_data;
 	void			*chip_data;
 	struct irqaction	*action;	/* IRQ action list */
@@ -373,10 +375,12 @@ extern int set_irq_chip(unsigned int irq, struct irq_chip *chip);
 extern int set_irq_data(unsigned int irq, void *data);
 extern int set_irq_chip_data(unsigned int irq, void *data);
 extern int set_irq_type(unsigned int irq, unsigned int type);
+extern int set_irq_msi(unsigned int irq, struct msi_desc *entry);
 
 #define get_irq_chip(irq)	(irq_desc[irq].chip)
 #define get_irq_chip_data(irq)	(irq_desc[irq].chip_data)
 #define get_irq_data(irq)	(irq_desc[irq].handler_data)
+#define get_irq_msi(irq)	(irq_desc[irq].msi_desc)
 
 #endif /* CONFIG_GENERIC_HARDIRQS */
 
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index d27b258..475e8a7 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -39,6 +39,7 @@ void dynamic_irq_init(unsigned int irq)
 	desc->chip = &no_irq_chip;
 	desc->handle_irq = handle_bad_irq;
 	desc->depth = 1;
+	desc->msi_desc = NULL;
 	desc->handler_data = NULL;
 	desc->chip_data = NULL;
 	desc->action = NULL;
@@ -74,6 +75,9 @@ void dynamic_irq_cleanup(unsigned int irq)
 		WARN_ON(1);
 		return;
 	}
+	desc->msi_desc = NULL;
+	desc->handler_data = NULL;
+	desc->chip_data = NULL;
 	desc->handle_irq = handle_bad_irq;
 	desc->chip = &no_irq_chip;
 	spin_unlock_irqrestore(&desc->lock, flags);
@@ -162,6 +166,30 @@ int set_irq_data(unsigned int irq, void *data)
 EXPORT_SYMBOL(set_irq_data);
 
 /**
+ *	set_irq_data - set irq type data for an irq
+ *	@irq:	Interrupt number
+ *	@data:	Pointer to interrupt specific data
+ *
+ *	Set the hardware irq controller data for an irq
+ */
+int set_irq_msi(unsigned int irq, struct msi_desc *entry)
+{
+	struct irq_desc *desc;
+	unsigned long flags;
+
+	if (irq >= NR_IRQS) {
+		printk(KERN_ERR
+		       "Trying to install msi data for IRQ%d\n", irq);
+		return -EINVAL;
+	}
+	desc = irq_desc + irq;
+	spin_lock_irqsave(&desc->lock, flags);
+	desc->msi_desc = entry;
+	spin_unlock_irqrestore(&desc->lock, flags);
+	return 0;
+}
+
+/**
  *	set_irq_chip_data - set irq chip data for an irq
  *	@irq:	Interrupt number
  *	@data:	Pointer to chip specific data
-- 
1.4.4.1.g278f

^ permalink raw reply related	[flat|nested] 178+ messages in thread

* [PATCH 6/6] msi: Make MSI useable more architectures
  2007-01-28 19:52             ` Eric W. Biederman
@ 2007-01-28 19:56               ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:56 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


The arch hooks arch_setup_msi_irq and arch_teardown_msi_irq are now
responsible for allocating and freeing the linux irq in addition to
setting up the the linux irq to work with the interrupt.

arch_setup_msi_irq now takes a pci_device and a msi_desc and returns
an irq.

With this change in place this code should be useable by all platforms
except those that won't let the OS touch the hardware like ppc RTAS.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 arch/i386/kernel/io_apic.c   |   17 ++++++---
 arch/ia64/kernel/msi_ia64.c  |   19 ++++++----
 arch/ia64/sn/kernel/msi_sn.c |   20 +++++++---
 arch/x86_64/kernel/io_apic.c |   17 ++++++---
 drivers/pci/msi.c            |   80 +++++++++++------------------------------
 include/asm-ia64/machvec.h   |    3 +-
 include/linux/msi.h          |    2 +-
 7 files changed, 75 insertions(+), 83 deletions(-)

diff --git a/arch/i386/kernel/io_apic.c b/arch/i386/kernel/io_apic.c
index 2424cc9..9ba4f99 100644
--- a/arch/i386/kernel/io_apic.c
+++ b/arch/i386/kernel/io_apic.c
@@ -2600,25 +2600,32 @@ static struct irq_chip msi_chip = {
 	.retrigger	= ioapic_retrigger_irq,
 };
 
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *dev)
+int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc)
 {
 	struct msi_msg msg;
-	int ret;
+	int irq, ret;
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, desc);
 	ret = msi_compose_msg(dev, irq, &msg);
-	if (ret < 0)
+	if (ret < 0) {
+		destroy_irq(irq);
 		return ret;
+	}
 
 	write_msi_msg(irq, &msg);
 
 	set_irq_chip_and_handler_name(irq, &msi_chip, handle_edge_irq,
 				      "edge");
 
-	return 0;
+	return irq;
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
 {
-	return;
+	destroy_irq(irq);
 }
 
 #endif /* CONFIG_PCI_MSI */
diff --git a/arch/ia64/kernel/msi_ia64.c b/arch/ia64/kernel/msi_ia64.c
index 822e59a..0d05450 100644
--- a/arch/ia64/kernel/msi_ia64.c
+++ b/arch/ia64/kernel/msi_ia64.c
@@ -64,12 +64,17 @@ static void ia64_set_msi_irq_affinity(unsigned int irq, cpumask_t cpu_mask)
 }
 #endif /* CONFIG_SMP */
 
-int ia64_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
+int ia64_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *desc)
 {
 	struct msi_msg	msg;
 	unsigned long	dest_phys_id;
-	unsigned int	vector;
+	unsigned int	irq, vector;
 
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, desc);
 	dest_phys_id = cpu_physical_id(first_cpu(cpu_online_map));
 	vector = irq;
 
@@ -89,12 +94,12 @@ int ia64_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	write_msi_msg(irq, &msg);
 	set_irq_chip_and_handler(irq, &ia64_msi_chip, handle_edge_irq);
 
-	return 0;
+	return irq;
 }
 
 void ia64_teardown_msi_irq(unsigned int irq)
 {
-	return;		/* no-op */
+	destroy_irq(irq);
 }
 
 static void ia64_ack_msi_irq(unsigned int irq)
@@ -126,12 +131,12 @@ static struct irq_chip ia64_msi_chip = {
 };
 
 
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
+int arch_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *desc)
 {
 	if (platform_setup_msi_irq)
-		return platform_setup_msi_irq(irq, pdev);
+		return platform_setup_msi_irq(pdev, desc);
 
-	return ia64_setup_msi_irq(irq, pdev);
+	return ia64_setup_msi_irq(pdev, desc);
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
diff --git a/arch/ia64/sn/kernel/msi_sn.c b/arch/ia64/sn/kernel/msi_sn.c
index 31fbb85..ea3dc38 100644
--- a/arch/ia64/sn/kernel/msi_sn.c
+++ b/arch/ia64/sn/kernel/msi_sn.c
@@ -59,13 +59,12 @@ void sn_teardown_msi_irq(unsigned int irq)
 	sn_intr_free(nasid, widget, sn_irq_info);
 	sn_msi_info[irq].sn_irq_info = NULL;
 
-	return;
+	destroy_irq(irq);
 }
 
-int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
+int sn_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *entry)
 {
 	struct msi_msg msg;
-	struct msi_desc *entry;
 	int widget;
 	int status;
 	nasid_t nasid;
@@ -73,8 +72,8 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	struct sn_irq_info *sn_irq_info;
 	struct pcibus_bussoft *bussoft = SN_PCIDEV_BUSSOFT(pdev);
 	struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
+	int irq;
 
-	entry = get_irq_msi(irq);
 	if (!entry->msi_attrib.is_64)
 		return -EINVAL;
 
@@ -84,6 +83,11 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	if (provider == NULL || provider->dma_map_consistent == NULL)
 		return -EINVAL;
 
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, entry);
 	/*
 	 * Set up the vector plumbing.  Let the prom (via sn_intr_alloc)
 	 * decide which cpu to direct this msi at by default.
@@ -95,12 +99,15 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 			SWIN_WIDGETNUM(bussoft->bs_base);
 
 	sn_irq_info = kzalloc(sizeof(struct sn_irq_info), GFP_KERNEL);
-	if (! sn_irq_info)
+	if (! sn_irq_info) {
+		destroy_irq(irq);
 		return -ENOMEM;
+	}
 
 	status = sn_intr_alloc(nasid, widget, sn_irq_info, irq, -1, -1);
 	if (status) {
 		kfree(sn_irq_info);
+		destroy_irq(irq);
 		return -ENOMEM;
 	}
 
@@ -121,6 +128,7 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	if (! bus_addr) {
 		sn_intr_free(nasid, widget, sn_irq_info);
 		kfree(sn_irq_info);
+		destroy_irq(irq);
 		return -ENOMEM;
 	}
 
@@ -139,7 +147,7 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	write_msi_msg(irq, &msg);
 	set_irq_chip_and_handler(irq, &sn_msi_chip, handle_edge_irq);
 
-	return 0;
+	return irq;
 }
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86_64/kernel/io_apic.c b/arch/x86_64/kernel/io_apic.c
index d7bad90..6be6730 100644
--- a/arch/x86_64/kernel/io_apic.c
+++ b/arch/x86_64/kernel/io_apic.c
@@ -1956,24 +1956,31 @@ static struct irq_chip msi_chip = {
 	.retrigger	= ioapic_retrigger_irq,
 };
 
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *dev)
+int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc)
 {
 	struct msi_msg msg;
-	int ret;
+	int irq, ret;
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, desc);
 	ret = msi_compose_msg(dev, irq, &msg);
-	if (ret < 0)
+	if (ret < 0) {
+		destroy_irq(irq);
 		return ret;
+	}
 
 	write_msi_msg(irq, &msg);
 
 	set_irq_chip_and_handler_name(irq, &msi_chip, handle_edge_irq, "edge");
 
-	return 0;
+	return irq;
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
 {
-	return;
+	destroy_irq(irq);
 }
 
 #endif /* CONFIG_PCI_MSI */
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index d7a2259..c6a6d46 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -192,37 +192,6 @@ static struct msi_desc* alloc_msi_entry(void)
 	return entry;
 }
 
-static int create_msi_irq(void)
-{
-	struct msi_desc *entry;
-	int irq;
-
-	entry = alloc_msi_entry();
-	if (!entry)
-		return -ENOMEM;
-
-	irq = create_irq();
-	if (irq < 0) {
-		kmem_cache_free(msi_cachep, entry);
-		return -EBUSY;
-	}
-
-	set_irq_msi(irq, entry);
-
-	return irq;
-}
-
-static void destroy_msi_irq(unsigned int irq)
-{
-	struct msi_desc *entry;
-
-	entry = get_irq_msi(irq);
-	set_irq_chip(irq, NULL);
-	set_irq_msi(irq, NULL);
-	destroy_irq(irq);
-	kmem_cache_free(msi_cachep, entry);
-}
-
 static void enable_msi_mode(struct pci_dev *dev, int pos, int type)
 {
 	u16 control;
@@ -449,7 +418,6 @@ void pci_restore_msi_state(struct pci_dev *dev)
  **/
 static int msi_capability_init(struct pci_dev *dev)
 {
-	int status;
 	struct msi_desc *entry;
 	int pos, irq;
 	u16 control;
@@ -457,13 +425,10 @@ static int msi_capability_init(struct pci_dev *dev)
    	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
 	pci_read_config_word(dev, msi_control_reg(pos), &control);
 	/* MSI Entry Initialization */
-	irq = create_msi_irq();
-	if (irq < 0)
-		return irq;
+	entry = alloc_msi_entry();
+	if (!entry)
+		return -ENOMEM;
 
-	entry = get_irq_msi(irq);
-	entry->link.head = irq;
-	entry->link.tail = irq;
 	entry->msi_attrib.type = PCI_CAP_ID_MSI;
 	entry->msi_attrib.is_64 = is_64bit_address(control);
 	entry->msi_attrib.entry_nr = 0;
@@ -489,14 +454,16 @@ static int msi_capability_init(struct pci_dev *dev)
 			maskbits);
 	}
 	/* Configure MSI capability structure */
-	status = arch_setup_msi_irq(irq, dev);
-	if (status < 0) {
-		destroy_msi_irq(irq);
-		return status;
+	irq = arch_setup_msi_irq(dev, entry);
+	if (irq < 0) {
+		kmem_cache_free(msi_cachep, entry);
+		return irq;
 	}
-
+	entry->link.head = irq;
+	entry->link.tail = irq;
 	dev->first_msi_irq = irq;
 	set_irq_msi(irq, entry);
+
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
@@ -518,7 +485,6 @@ static int msix_capability_init(struct pci_dev *dev,
 				struct msix_entry *entries, int nvec)
 {
 	struct msi_desc *head = NULL, *tail = NULL, *entry = NULL;
-	int status;
 	int irq, pos, i, j, nr_entries, temp = 0;
 	unsigned long phys_addr;
 	u32 table_offset;
@@ -541,13 +507,11 @@ static int msix_capability_init(struct pci_dev *dev,
 
 	/* MSI-X Table Initialization */
 	for (i = 0; i < nvec; i++) {
-		irq = create_msi_irq();
-		if (irq < 0)
+		entry = alloc_msi_entry();
+		if (!entry)
 			break;
 
-		entry = get_irq_msi(irq);
  		j = entries[i].entry;
- 		entries[i].vector = irq;
 		entry->msi_attrib.type = PCI_CAP_ID_MSIX;
 		entry->msi_attrib.is_64 = 1;
 		entry->msi_attrib.entry_nr = j;
@@ -556,6 +520,14 @@ static int msix_capability_init(struct pci_dev *dev,
 		entry->msi_attrib.pos = pos;
 		entry->dev = dev;
 		entry->mask_base = base;
+
+		/* Configure MSI-X capability structure */
+		irq = arch_setup_msi_irq(dev, entry);
+		if (irq < 0) {
+			kmem_cache_free(msi_cachep, entry);
+			break;
+		}
+ 		entries[i].vector = irq;
 		if (!head) {
 			entry->link.head = irq;
 			entry->link.tail = irq;
@@ -568,12 +540,6 @@ static int msix_capability_init(struct pci_dev *dev,
 		}
 		temp = irq;
 		tail = entry;
-		/* Configure MSI-X capability structure */
-		status = arch_setup_msi_irq(irq, dev);
-		if (status < 0) {
-			destroy_msi_irq(irq);
-			break;
-		}
 
 		set_irq_msi(irq, entry);
 	}
@@ -717,8 +683,6 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	int head, entry_nr, type;
 	void __iomem *base;
 
-	arch_teardown_msi_irq(irq);
-
 	entry = get_irq_msi(irq);
 	if (!entry || entry->dev != dev) {
 		return -EINVAL;
@@ -729,9 +693,9 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	base = entry->mask_base;
 	get_irq_msi(entry->link.head)->link.tail = entry->link.tail;
 	get_irq_msi(entry->link.tail)->link.head = entry->link.head;
-	entry->dev = NULL;
 
-	destroy_msi_irq(irq);
+	arch_teardown_msi_irq(irq);
+	kmem_cache_free(msi_cachep, entry);
 
 	if (type == PCI_CAP_ID_MSIX) {
 		writel(1, base + entry_nr * PCI_MSIX_ENTRY_SIZE +
diff --git a/include/asm-ia64/machvec.h b/include/asm-ia64/machvec.h
index a3891eb..3c96ac1 100644
--- a/include/asm-ia64/machvec.h
+++ b/include/asm-ia64/machvec.h
@@ -21,6 +21,7 @@ struct mm_struct;
 struct pci_bus;
 struct task_struct;
 struct pci_dev;
+struct msi_desc;
 
 typedef void ia64_mv_setup_t (char **);
 typedef void ia64_mv_cpu_init_t (void);
@@ -79,7 +80,7 @@ typedef unsigned short ia64_mv_readw_relaxed_t (const volatile void __iomem *);
 typedef unsigned int ia64_mv_readl_relaxed_t (const volatile void __iomem *);
 typedef unsigned long ia64_mv_readq_relaxed_t (const volatile void __iomem *);
 
-typedef int ia64_mv_setup_msi_irq_t (unsigned int irq, struct pci_dev *pdev);
+typedef int ia64_mv_setup_msi_irq_t (struct pci_dev *pdev, struct msi_desc *);
 typedef void ia64_mv_teardown_msi_irq_t (unsigned int irq);
 
 static inline void
diff --git a/include/linux/msi.h b/include/linux/msi.h
index b99976b..74c8a2e 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -41,7 +41,7 @@ struct msi_desc {
 /*
  * The arch hook for setup up msi irqs
  */
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *dev);
+int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
 void arch_teardown_msi_irq(unsigned int irq);
 
 
-- 
1.4.4.1.g278f


^ permalink raw reply related	[flat|nested] 178+ messages in thread

* [PATCH 6/6] msi: Make MSI useable more architectures
@ 2007-01-28 19:56               ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:56 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Tony Luck, Grant Grundler, Ingo Molnar, linux-kernel,
	Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller


The arch hooks arch_setup_msi_irq and arch_teardown_msi_irq are now
responsible for allocating and freeing the linux irq in addition to
setting up the the linux irq to work with the interrupt.

arch_setup_msi_irq now takes a pci_device and a msi_desc and returns
an irq.

With this change in place this code should be useable by all platforms
except those that won't let the OS touch the hardware like ppc RTAS.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 arch/i386/kernel/io_apic.c   |   17 ++++++---
 arch/ia64/kernel/msi_ia64.c  |   19 ++++++----
 arch/ia64/sn/kernel/msi_sn.c |   20 +++++++---
 arch/x86_64/kernel/io_apic.c |   17 ++++++---
 drivers/pci/msi.c            |   80 +++++++++++------------------------------
 include/asm-ia64/machvec.h   |    3 +-
 include/linux/msi.h          |    2 +-
 7 files changed, 75 insertions(+), 83 deletions(-)

diff --git a/arch/i386/kernel/io_apic.c b/arch/i386/kernel/io_apic.c
index 2424cc9..9ba4f99 100644
--- a/arch/i386/kernel/io_apic.c
+++ b/arch/i386/kernel/io_apic.c
@@ -2600,25 +2600,32 @@ static struct irq_chip msi_chip = {
 	.retrigger	= ioapic_retrigger_irq,
 };
 
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *dev)
+int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc)
 {
 	struct msi_msg msg;
-	int ret;
+	int irq, ret;
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, desc);
 	ret = msi_compose_msg(dev, irq, &msg);
-	if (ret < 0)
+	if (ret < 0) {
+		destroy_irq(irq);
 		return ret;
+	}
 
 	write_msi_msg(irq, &msg);
 
 	set_irq_chip_and_handler_name(irq, &msi_chip, handle_edge_irq,
 				      "edge");
 
-	return 0;
+	return irq;
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
 {
-	return;
+	destroy_irq(irq);
 }
 
 #endif /* CONFIG_PCI_MSI */
diff --git a/arch/ia64/kernel/msi_ia64.c b/arch/ia64/kernel/msi_ia64.c
index 822e59a..0d05450 100644
--- a/arch/ia64/kernel/msi_ia64.c
+++ b/arch/ia64/kernel/msi_ia64.c
@@ -64,12 +64,17 @@ static void ia64_set_msi_irq_affinity(unsigned int irq, cpumask_t cpu_mask)
 }
 #endif /* CONFIG_SMP */
 
-int ia64_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
+int ia64_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *desc)
 {
 	struct msi_msg	msg;
 	unsigned long	dest_phys_id;
-	unsigned int	vector;
+	unsigned int	irq, vector;
 
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, desc);
 	dest_phys_id = cpu_physical_id(first_cpu(cpu_online_map));
 	vector = irq;
 
@@ -89,12 +94,12 @@ int ia64_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	write_msi_msg(irq, &msg);
 	set_irq_chip_and_handler(irq, &ia64_msi_chip, handle_edge_irq);
 
-	return 0;
+	return irq;
 }
 
 void ia64_teardown_msi_irq(unsigned int irq)
 {
-	return;		/* no-op */
+	destroy_irq(irq);
 }
 
 static void ia64_ack_msi_irq(unsigned int irq)
@@ -126,12 +131,12 @@ static struct irq_chip ia64_msi_chip = {
 };
 
 
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
+int arch_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *desc)
 {
 	if (platform_setup_msi_irq)
-		return platform_setup_msi_irq(irq, pdev);
+		return platform_setup_msi_irq(pdev, desc);
 
-	return ia64_setup_msi_irq(irq, pdev);
+	return ia64_setup_msi_irq(pdev, desc);
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
diff --git a/arch/ia64/sn/kernel/msi_sn.c b/arch/ia64/sn/kernel/msi_sn.c
index 31fbb85..ea3dc38 100644
--- a/arch/ia64/sn/kernel/msi_sn.c
+++ b/arch/ia64/sn/kernel/msi_sn.c
@@ -59,13 +59,12 @@ void sn_teardown_msi_irq(unsigned int irq)
 	sn_intr_free(nasid, widget, sn_irq_info);
 	sn_msi_info[irq].sn_irq_info = NULL;
 
-	return;
+	destroy_irq(irq);
 }
 
-int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
+int sn_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *entry)
 {
 	struct msi_msg msg;
-	struct msi_desc *entry;
 	int widget;
 	int status;
 	nasid_t nasid;
@@ -73,8 +72,8 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	struct sn_irq_info *sn_irq_info;
 	struct pcibus_bussoft *bussoft = SN_PCIDEV_BUSSOFT(pdev);
 	struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
+	int irq;
 
-	entry = get_irq_msi(irq);
 	if (!entry->msi_attrib.is_64)
 		return -EINVAL;
 
@@ -84,6 +83,11 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	if (provider == NULL || provider->dma_map_consistent == NULL)
 		return -EINVAL;
 
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, entry);
 	/*
 	 * Set up the vector plumbing.  Let the prom (via sn_intr_alloc)
 	 * decide which cpu to direct this msi at by default.
@@ -95,12 +99,15 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 			SWIN_WIDGETNUM(bussoft->bs_base);
 
 	sn_irq_info = kzalloc(sizeof(struct sn_irq_info), GFP_KERNEL);
-	if (! sn_irq_info)
+	if (! sn_irq_info) {
+		destroy_irq(irq);
 		return -ENOMEM;
+	}
 
 	status = sn_intr_alloc(nasid, widget, sn_irq_info, irq, -1, -1);
 	if (status) {
 		kfree(sn_irq_info);
+		destroy_irq(irq);
 		return -ENOMEM;
 	}
 
@@ -121,6 +128,7 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	if (! bus_addr) {
 		sn_intr_free(nasid, widget, sn_irq_info);
 		kfree(sn_irq_info);
+		destroy_irq(irq);
 		return -ENOMEM;
 	}
 
@@ -139,7 +147,7 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	write_msi_msg(irq, &msg);
 	set_irq_chip_and_handler(irq, &sn_msi_chip, handle_edge_irq);
 
-	return 0;
+	return irq;
 }
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86_64/kernel/io_apic.c b/arch/x86_64/kernel/io_apic.c
index d7bad90..6be6730 100644
--- a/arch/x86_64/kernel/io_apic.c
+++ b/arch/x86_64/kernel/io_apic.c
@@ -1956,24 +1956,31 @@ static struct irq_chip msi_chip = {
 	.retrigger	= ioapic_retrigger_irq,
 };
 
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *dev)
+int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc)
 {
 	struct msi_msg msg;
-	int ret;
+	int irq, ret;
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, desc);
 	ret = msi_compose_msg(dev, irq, &msg);
-	if (ret < 0)
+	if (ret < 0) {
+		destroy_irq(irq);
 		return ret;
+	}
 
 	write_msi_msg(irq, &msg);
 
 	set_irq_chip_and_handler_name(irq, &msi_chip, handle_edge_irq, "edge");
 
-	return 0;
+	return irq;
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
 {
-	return;
+	destroy_irq(irq);
 }
 
 #endif /* CONFIG_PCI_MSI */
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index d7a2259..c6a6d46 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -192,37 +192,6 @@ static struct msi_desc* alloc_msi_entry(void)
 	return entry;
 }
 
-static int create_msi_irq(void)
-{
-	struct msi_desc *entry;
-	int irq;
-
-	entry = alloc_msi_entry();
-	if (!entry)
-		return -ENOMEM;
-
-	irq = create_irq();
-	if (irq < 0) {
-		kmem_cache_free(msi_cachep, entry);
-		return -EBUSY;
-	}
-
-	set_irq_msi(irq, entry);
-
-	return irq;
-}
-
-static void destroy_msi_irq(unsigned int irq)
-{
-	struct msi_desc *entry;
-
-	entry = get_irq_msi(irq);
-	set_irq_chip(irq, NULL);
-	set_irq_msi(irq, NULL);
-	destroy_irq(irq);
-	kmem_cache_free(msi_cachep, entry);
-}
-
 static void enable_msi_mode(struct pci_dev *dev, int pos, int type)
 {
 	u16 control;
@@ -449,7 +418,6 @@ void pci_restore_msi_state(struct pci_dev *dev)
  **/
 static int msi_capability_init(struct pci_dev *dev)
 {
-	int status;
 	struct msi_desc *entry;
 	int pos, irq;
 	u16 control;
@@ -457,13 +425,10 @@ static int msi_capability_init(struct pci_dev *dev)
    	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
 	pci_read_config_word(dev, msi_control_reg(pos), &control);
 	/* MSI Entry Initialization */
-	irq = create_msi_irq();
-	if (irq < 0)
-		return irq;
+	entry = alloc_msi_entry();
+	if (!entry)
+		return -ENOMEM;
 
-	entry = get_irq_msi(irq);
-	entry->link.head = irq;
-	entry->link.tail = irq;
 	entry->msi_attrib.type = PCI_CAP_ID_MSI;
 	entry->msi_attrib.is_64 = is_64bit_address(control);
 	entry->msi_attrib.entry_nr = 0;
@@ -489,14 +454,16 @@ static int msi_capability_init(struct pci_dev *dev)
 			maskbits);
 	}
 	/* Configure MSI capability structure */
-	status = arch_setup_msi_irq(irq, dev);
-	if (status < 0) {
-		destroy_msi_irq(irq);
-		return status;
+	irq = arch_setup_msi_irq(dev, entry);
+	if (irq < 0) {
+		kmem_cache_free(msi_cachep, entry);
+		return irq;
 	}
-
+	entry->link.head = irq;
+	entry->link.tail = irq;
 	dev->first_msi_irq = irq;
 	set_irq_msi(irq, entry);
+
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
@@ -518,7 +485,6 @@ static int msix_capability_init(struct pci_dev *dev,
 				struct msix_entry *entries, int nvec)
 {
 	struct msi_desc *head = NULL, *tail = NULL, *entry = NULL;
-	int status;
 	int irq, pos, i, j, nr_entries, temp = 0;
 	unsigned long phys_addr;
 	u32 table_offset;
@@ -541,13 +507,11 @@ static int msix_capability_init(struct pci_dev *dev,
 
 	/* MSI-X Table Initialization */
 	for (i = 0; i < nvec; i++) {
-		irq = create_msi_irq();
-		if (irq < 0)
+		entry = alloc_msi_entry();
+		if (!entry)
 			break;
 
-		entry = get_irq_msi(irq);
  		j = entries[i].entry;
- 		entries[i].vector = irq;
 		entry->msi_attrib.type = PCI_CAP_ID_MSIX;
 		entry->msi_attrib.is_64 = 1;
 		entry->msi_attrib.entry_nr = j;
@@ -556,6 +520,14 @@ static int msix_capability_init(struct pci_dev *dev,
 		entry->msi_attrib.pos = pos;
 		entry->dev = dev;
 		entry->mask_base = base;
+
+		/* Configure MSI-X capability structure */
+		irq = arch_setup_msi_irq(dev, entry);
+		if (irq < 0) {
+			kmem_cache_free(msi_cachep, entry);
+			break;
+		}
+ 		entries[i].vector = irq;
 		if (!head) {
 			entry->link.head = irq;
 			entry->link.tail = irq;
@@ -568,12 +540,6 @@ static int msix_capability_init(struct pci_dev *dev,
 		}
 		temp = irq;
 		tail = entry;
-		/* Configure MSI-X capability structure */
-		status = arch_setup_msi_irq(irq, dev);
-		if (status < 0) {
-			destroy_msi_irq(irq);
-			break;
-		}
 
 		set_irq_msi(irq, entry);
 	}
@@ -717,8 +683,6 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	int head, entry_nr, type;
 	void __iomem *base;
 
-	arch_teardown_msi_irq(irq);
-
 	entry = get_irq_msi(irq);
 	if (!entry || entry->dev != dev) {
 		return -EINVAL;
@@ -729,9 +693,9 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	base = entry->mask_base;
 	get_irq_msi(entry->link.head)->link.tail = entry->link.tail;
 	get_irq_msi(entry->link.tail)->link.head = entry->link.head;
-	entry->dev = NULL;
 
-	destroy_msi_irq(irq);
+	arch_teardown_msi_irq(irq);
+	kmem_cache_free(msi_cachep, entry);
 
 	if (type == PCI_CAP_ID_MSIX) {
 		writel(1, base + entry_nr * PCI_MSIX_ENTRY_SIZE +
diff --git a/include/asm-ia64/machvec.h b/include/asm-ia64/machvec.h
index a3891eb..3c96ac1 100644
--- a/include/asm-ia64/machvec.h
+++ b/include/asm-ia64/machvec.h
@@ -21,6 +21,7 @@ struct mm_struct;
 struct pci_bus;
 struct task_struct;
 struct pci_dev;
+struct msi_desc;
 
 typedef void ia64_mv_setup_t (char **);
 typedef void ia64_mv_cpu_init_t (void);
@@ -79,7 +80,7 @@ typedef unsigned short ia64_mv_readw_relaxed_t (const volatile void __iomem *);
 typedef unsigned int ia64_mv_readl_relaxed_t (const volatile void __iomem *);
 typedef unsigned long ia64_mv_readq_relaxed_t (const volatile void __iomem *);
 
-typedef int ia64_mv_setup_msi_irq_t (unsigned int irq, struct pci_dev *pdev);
+typedef int ia64_mv_setup_msi_irq_t (struct pci_dev *pdev, struct msi_desc *);
 typedef void ia64_mv_teardown_msi_irq_t (unsigned int irq);
 
 static inline void
diff --git a/include/linux/msi.h b/include/linux/msi.h
index b99976b..74c8a2e 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -41,7 +41,7 @@ struct msi_desc {
 /*
  * The arch hook for setup up msi irqs
  */
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *dev);
+int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
 void arch_teardown_msi_irq(unsigned int irq);
 
 
-- 
1.4.4.1.g278f

^ permalink raw reply related	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28  8:36         ` Eric W. Biederman
@ 2007-01-28 20:14           ` Benjamin Herrenschmidt
  2007-01-28 20:53             ` Eric W. Biederman
  2007-01-28 23:25             ` David Miller
  0 siblings, 2 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-28 20:14 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller


> Anyway for architecture hooks I have it down to just:
> /*
>  * The arch hook for setup up msi irqs
>  */
> int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
> void arch_teardown_msi_irq(unsigned int irq);

Which we would have to turn into "ops" hooks right away on powerpc
anyway because we can have multiple implementations in a given kernel
image depending on a mix of platform and which bus the devie is on.

> Which should be good enough to handle everything but RTAS.

You keep ignoring the problem then... we -HAVE- to handle the RTAS case.
In addition, it's not unlikely that other virtualized environment will
provide a similar very high level APIs to MSIs.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 19:40   ` Eric W. Biederman
@ 2007-01-28 20:23     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-28 20:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Tony Luck, Grant Grundler, Ingo Molnar,
	linux-kernel, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller


> The other big change is that I added a field to irq_desc to point
> at the msi_desc.  This removes the conflicts with the existing pointer
> fields and makes the irq -> msi_desc mapping useable outside of msi.c

I'm not even sure we would have needed that with Michael's mecanism in
fact. One other reason why I prefer it.

Basically, backends like MPIC etc... don't need it. The irq chip
operations are normal MPIC operations and don't need to know they are
done on an MSI nor what MSI etc... and thus we don't need it. Same with
RTAS.

On the other hand, x86 needs it, but then, x86 uses it's own MSI
specific irq_chip, in which case it can use irq_desc->chip_data as long
as it does it within the backend.

So I may have missed a case where a given backend might need both that
irq -> msi_desc mapping -and- use irq_desc->chip_data for other things,
but that's one thing I was hoping we could avoid with Michael's code.

> The only architecture problem that isn't solvable in this context is
> the problem of supporting the crazy hypervisor on the ppc RTAS, which
> asks us to drive the hardware but does not give us access to the
> hardware registers.

So you are saying that we should use your model while admitting that it
can't solve our problems...

I really don't understand why you seem so totally opposed to Michael's
approach which definitely looks to me like the sane thing to do. Note
that in the end, Michael's approach isn't -that- different from yours,
just a bit more abstracted.

Ben.



^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-28 20:23     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-28 20:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tony Luck, Grant Grundler, David S. Miller, Greg Kroah-Hartman,
	linux-kernel, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, Ingo Molnar, linux-pci


> The other big change is that I added a field to irq_desc to point
> at the msi_desc.  This removes the conflicts with the existing pointer
> fields and makes the irq -> msi_desc mapping useable outside of msi.c

I'm not even sure we would have needed that with Michael's mecanism in
fact. One other reason why I prefer it.

Basically, backends like MPIC etc... don't need it. The irq chip
operations are normal MPIC operations and don't need to know they are
done on an MSI nor what MSI etc... and thus we don't need it. Same with
RTAS.

On the other hand, x86 needs it, but then, x86 uses it's own MSI
specific irq_chip, in which case it can use irq_desc->chip_data as long
as it does it within the backend.

So I may have missed a case where a given backend might need both that
irq -> msi_desc mapping -and- use irq_desc->chip_data for other things,
but that's one thing I was hoping we could avoid with Michael's code.

> The only architecture problem that isn't solvable in this context is
> the problem of supporting the crazy hypervisor on the ppc RTAS, which
> asks us to drive the hardware but does not give us access to the
> hardware registers.

So you are saying that we should use your model while admitting that it
can't solve our problems...

I really don't understand why you seem so totally opposed to Michael's
approach which definitely looks to me like the sane thing to do. Note
that in the end, Michael's approach isn't -that- different from yours,
just a bit more abstracted.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 20:23     ` Benjamin Herrenschmidt
@ 2007-01-28 20:47       ` Jeff Garzik
  -1 siblings, 0 replies; 178+ messages in thread
From: Jeff Garzik @ 2007-01-28 20:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Eric W. Biederman, Greg Kroah-Hartman, Tony Luck, Grant Grundler,
	Ingo Molnar, linux-kernel, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller

Benjamin Herrenschmidt wrote:
>> The only architecture problem that isn't solvable in this context is
>> the problem of supporting the crazy hypervisor on the ppc RTAS, which
>> asks us to drive the hardware but does not give us access to the
>> hardware registers.
> 
> So you are saying that we should use your model while admitting that it
> can't solve our problems...
> 
> I really don't understand why you seem so totally opposed to Michael's
> approach which definitely looks to me like the sane thing to do. Note
> that in the end, Michael's approach isn't -that- different from yours,
> just a bit more abstracted.


I think the high-level ops approach makes more sense.  It's more future 
proof, in addition to covering all existing implementations.

	Jeff



^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-28 20:47       ` Jeff Garzik
  0 siblings, 0 replies; 178+ messages in thread
From: Jeff Garzik @ 2007-01-28 20:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Tony Luck, Grant Grundler, David S. Miller, Greg Kroah-Hartman,
	linux-kernel, Kyle McMartin, linuxppc-dev, Eric W. Biederman,
	shaohua.li, Ingo Molnar, linux-pci, Brice Goglin

Benjamin Herrenschmidt wrote:
>> The only architecture problem that isn't solvable in this context is
>> the problem of supporting the crazy hypervisor on the ppc RTAS, which
>> asks us to drive the hardware but does not give us access to the
>> hardware registers.
> 
> So you are saying that we should use your model while admitting that it
> can't solve our problems...
> 
> I really don't understand why you seem so totally opposed to Michael's
> approach which definitely looks to me like the sane thing to do. Note
> that in the end, Michael's approach isn't -that- different from yours,
> just a bit more abstracted.


I think the high-level ops approach makes more sense.  It's more future 
proof, in addition to covering all existing implementations.

	Jeff

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28 20:14           ` Benjamin Herrenschmidt
@ 2007-01-28 20:53             ` Eric W. Biederman
  2007-01-28 21:17               ` Benjamin Herrenschmidt
  2007-01-28 23:26               ` David Miller
  2007-01-28 23:25             ` David Miller
  1 sibling, 2 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 20:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

>> Anyway for architecture hooks I have it down to just:
>> /*
>>  * The arch hook for setup up msi irqs
>>  */
>> int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
>> void arch_teardown_msi_irq(unsigned int irq);
>
> Which we would have to turn into "ops" hooks right away on powerpc
> anyway because we can have multiple implementations in a given kernel
> image depending on a mix of platform and which bus the devie is on.

Yes of some form.  Although only needing 2 ops instead of 6 is still
simpler.  Until we can agree on a point where the ops lookup is
generic I don't see the point in placing it in generic code.

In addition I am extremely uncomfortable with making the interface to
the architecture any wider than we need it to be, as refactoring code
across multiple architectures is hard as usually the developer does
not have the hardware to touch all of the code that is touched.

>> Which should be good enough to handle everything but RTAS.
>
> You keep ignoring the problem then... we -HAVE- to handle the RTAS case.
> In addition, it's not unlikely that other virtualized environment will
> provide a similar very high level APIs to MSIs.

No I'm postponing the problem in good unix fashion and delivering the
90% solution now.  Beyond that I'm taking the problem in small
comprehensible steps.  I'm not saying we have to stop there but
we need to pass through this point.

The argument that we need to support what the RTAS is doing to support
other hypervisors seems to be a fallacy.  What the RTAS is doing is
not sane from a hardware standpoint, so I do not expect it from other
virtualized/hypervisor style environments. 

If the hardware provides capabilities to isolate the MSI messages
properly it does not need to prevent us from touching the msi setup
registers.  If the hardware does not isolate the MSI messages properly
there is another problem.  Especially in the context of MSI-X where
the registers can be in the middle of any mmio bar I do not see a sane
way of keeping us from touching the hardware directly in the first
place.

However it is quite likely that supporting the RTAS is not going to 
require much code to support.  So I don't see an argument against not
supporting the RTAS.


There is the additional problem in all of this that our interface for
MSI-X to the drivers is quite likely the wrong interface.  I believe
we will want to incrementally allocate more irqs at run time as there
are work queues or the like which can be attached to them.  We can get
there with the current vector allocator by freeing and reallocating
all of the msi-x irqs when the driver wants more so the current
interface will suffice but it is far from optimal.

Also I'm not at all comfortable with the 32k msix_entry array
allocation we will need for a MSI-X device that pushes the limits,
of the number of irqs it can allocate, especially as this goes
up to 64k when we start using the proper types to hold the linux
irq number.

Small simple obviously correct steps.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28 20:53             ` Eric W. Biederman
@ 2007-01-28 21:17               ` Benjamin Herrenschmidt
  2007-01-28 22:36                 ` Eric W. Biederman
  2007-01-28 23:26               ` David Miller
  1 sibling, 1 reply; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-28 21:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller


> Yes of some form.  Although only needing 2 ops instead of 6 is still
> simpler.  Until we can agree on a point where the ops lookup is
> generic I don't see the point in placing it in generic code.

At least 4 actually, since we would need suspend/resume as well.
 
> In addition I am extremely uncomfortable with making the interface to
> the architecture any wider than we need it to be, as refactoring code
> across multiple architectures is hard as usually the developer does
> not have the hardware to touch all of the code that is touched.

The interface to the arch in our model is the function to get ops :-)
Most "normal" backends would just "plug" those ops with the provided
raw_ functions.

> The argument that we need to support what the RTAS is doing to support
> other hypervisors seems to be a fallacy.  What the RTAS is doing is
> not sane from a hardware standpoint, so I do not expect it from other
> virtualized/hypervisor style environments. 
>
> If the hardware provides capabilities to isolate the MSI messages
> properly it does not need to prevent us from touching the msi setup
> registers. 

It does isolate and it doesn't -prevent- config space access. However,
in order to enable MSIs, we have to configure the device -and- the IRQ
controller on the bus on which the device sits on, that is, to obtain
vectors from the HV, configure the controller to receive MSIs from that
device and route them to us, etc...., and the only API the HV provides
for doing so is that RTAS function that configures both in one call.

I don't see what's fundamentally wrong with that approach.

>  If the hardware does not isolate the MSI messages properly
> there is another problem.  Especially in the context of MSI-X where
> the registers can be in the middle of any mmio bar I do not see a sane
> way of keeping us from touching the hardware directly in the first
> place.

They are not blocked as I said above, at least not for most devices,
(though the controller/receiver side is). However, we don't have an API
to get the address/value to write into the device, nor to
configure/enable MSIs in the PIC. The only API we have is basically
called "change-msi" which can be use to enable MSI, MSI-X or disable
them (though we can provide how many we want to enable out of what is
requested by the device... we can't enable sparse MSI-X though, we can
only enable the N first ones).

> However it is quite likely that supporting the RTAS is not going to 
> require much code to support.  So I don't see an argument against not
> supporting the RTAS.

It would imply 2 or 3 more hooks at the toplevel... so we are going from
your 2 initial hooks to 4 (bcs we need to hook suspend/resume), now to 6
or 7.... 

> There is the additional problem in all of this that our interface for
> MSI-X to the drivers is quite likely the wrong interface.  I believe
> we will want to incrementally allocate more irqs at run time as there
> are work queues or the like which can be attached to them.  We can get
> there with the current vector allocator by freeing and reallocating
> all of the msi-x irqs when the driver wants more so the current
> interface will suffice but it is far from optimal.

Our hypervisor will not unfortunately let us do that. We can only use
RTAS "change-msi" to allocate more/less MSIs and that is disruptive of
the device function (we might lose pending interrupts when doing so, in
fact, in the initial HV interface definition, we could only do that with
the device actually disabled in the command register !).

In general, I'd rather have the device pre-allocate the MSI-X it needs,
though it can later on decide to use more or less.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 20:47       ` Jeff Garzik
@ 2007-01-28 21:20         ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 21:20 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Benjamin Herrenschmidt, Greg Kroah-Hartman, Tony Luck,
	Grant Grundler, Ingo Molnar, linux-kernel, Kyle McMartin,
	linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller

Jeff Garzik <jeff@garzik.org> writes:

> Benjamin Herrenschmidt wrote:
>>> The only architecture problem that isn't solvable in this context is
>>> the problem of supporting the crazy hypervisor on the ppc RTAS, which
>>> asks us to drive the hardware but does not give us access to the
>>> hardware registers.
>>
>> So you are saying that we should use your model while admitting that it
>> can't solve our problems...
>>
>> I really don't understand why you seem so totally opposed to Michael's
>> approach which definitely looks to me like the sane thing to do. Note
>> that in the end, Michael's approach isn't -that- different from yours,
>> just a bit more abstracted.
>
>
> I think the high-level ops approach makes more sense.  It's more future proof,
> in addition to covering all existing implementations.

I'm not arguing against an operations based approach.  I'm arguing for simple
obviously correct steps, and not throwing the baby out with the bath
water.

My patches should be a precursor to an operations based approach
because they are simple step from where we are now.

Every keeps telling me the operations approach is the right thing to
do and I see code that doesn't work, and can't work without extreme
difficulty on the architectures currently supported.  That makes me
irritated, and unfortunately much less accepting.

I see people pushing ridiculous interfaces like the RTAS hypervisor
interface at me, and saying we must support running firmware drivers
in the msi code.

I just ask for simple evolutionary change as I presented, so we don't
break things or loose requirements along the way.

Please argue with me on the details of what the ops based approach does
better, which specific problems does it solve. 

The proposed ops base approach mixes different kinds of operations
in the same structure:

We have the hardware operations:
+	/* enable - Enable the MSIs on the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs being requested.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * This routine enables the MSIs on the given PCI device.
+	 *
+	 * If the enable completes succesfully this routine must return 0.
+	 *
+	 * This callback is optional.
+	 */
+	int (*enable) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+
+	/* disable - disable the MSI for the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs to disable.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+         * This routine should perform the inverse of enable.
+	 */
+	void (*disable) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+

Which are either talking directly to the hardware, or are talking
to the hypervisor, which is using hardware isolation so it is safe to
talk directly to the hardware but isn't leting us?  If we could use
things to work around errata in card implementation details it would
make some sense to me (although we don't seem to have any cards with
that got the MSI registers wrong at this point).  Regardless these
operations clearly have a different granularity than the other
operations, and should have a different lookup method.


We have the irq operations.
+	/* check - Check that the requested MSI allocation is OK.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs being requested.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * This routine is responsible for checking that the given PCI device
+	 * can be allocated the requested type and number of MSIs.
+	 *
+	 * It is up to this routine to determine if the requested number of
+	 * MSIs is valid for the device in question. If the number of MSIs,
+	 * or the particular MSI entries, can not be supported for any
+	 * reason this routine must return non-zero.
+	 *
+	 * If the check is succesful this routine must return 0.
+	 */
+	int (*check) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+
+	/* alloc - Allocate MSIs for the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs being requested.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * This routine is responsible for allocating the number of
+	 * MSIs to the given PCI device.
+	 *
+	 * Upon completion there must be @num MSIs assigned to this device,
+	 * the "vector" member of each struct msix_entry must be filled in
+	 * with the Linux irq number allocated to it. The corresponding
+	 * irq_descs must also be setup with an appropriate handler if
+	 * required.
+	 *
+	 * If the allocation completes succesfully this routine must return 0.
+	 */
+	int (*alloc) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+
+	/* free - free the MSIs assigned to the device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * Free all MSIs and associated resources for the device. If any
+	 * MSIs have been enabled they will have been disabled already by
+	 * the generic code.
+	 */
+	void (*free) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);

These because they are per irq make sense as per bus operations unless
you have a good architecture definition like x86 has.  Roughly those
operations are what we currently have except the current operations
are a little simpler and easier to deal with for the architecture
code.

And then there are the operations that are going in the wrong
direction.
+	/* setup_msi_msg - Setup an MSI message for the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @entry:	The MSI entry to create a msi_msg for.
+	 * @msg:	Written with the magic address and data.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * Returns the "magic address and data" used to trigger the msi.
+	 * If the setup is succesful this routine must return 0.
+	 *
+	 * This callback is optional.
+	 */
+	int (*setup_msi_msg) (struct pci_dev *pdev, struct msix_entry *entry,
+				struct msi_msg *msg, int type);

Much to much of the operations base approach as proposed looks like
when you have a hammer every problem looks like a nail, given how much
confusion about what was put into the operations structure.

I don't mind taking a small step and making the alloc/free primitives
per bus in a generic fashion.

I don't mind supporting poorly designed hypervisor interfaces, if it
is easy.

I do strongly mind code that doesn't work, or we can't git-bisect
through to find where bugs were introduced.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-28 21:20         ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 21:20 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Tony Luck, Grant Grundler, David S. Miller, linux-kernel,
	Kyle McMartin, linuxppc-dev, Brice Goglin, Greg Kroah-Hartman,
	shaohua.li, Ingo Molnar, linux-pci

Jeff Garzik <jeff@garzik.org> writes:

> Benjamin Herrenschmidt wrote:
>>> The only architecture problem that isn't solvable in this context is
>>> the problem of supporting the crazy hypervisor on the ppc RTAS, which
>>> asks us to drive the hardware but does not give us access to the
>>> hardware registers.
>>
>> So you are saying that we should use your model while admitting that it
>> can't solve our problems...
>>
>> I really don't understand why you seem so totally opposed to Michael's
>> approach which definitely looks to me like the sane thing to do. Note
>> that in the end, Michael's approach isn't -that- different from yours,
>> just a bit more abstracted.
>
>
> I think the high-level ops approach makes more sense.  It's more future proof,
> in addition to covering all existing implementations.

I'm not arguing against an operations based approach.  I'm arguing for simple
obviously correct steps, and not throwing the baby out with the bath
water.

My patches should be a precursor to an operations based approach
because they are simple step from where we are now.

Every keeps telling me the operations approach is the right thing to
do and I see code that doesn't work, and can't work without extreme
difficulty on the architectures currently supported.  That makes me
irritated, and unfortunately much less accepting.

I see people pushing ridiculous interfaces like the RTAS hypervisor
interface at me, and saying we must support running firmware drivers
in the msi code.

I just ask for simple evolutionary change as I presented, so we don't
break things or loose requirements along the way.

Please argue with me on the details of what the ops based approach does
better, which specific problems does it solve. 

The proposed ops base approach mixes different kinds of operations
in the same structure:

We have the hardware operations:
+	/* enable - Enable the MSIs on the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs being requested.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * This routine enables the MSIs on the given PCI device.
+	 *
+	 * If the enable completes succesfully this routine must return 0.
+	 *
+	 * This callback is optional.
+	 */
+	int (*enable) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+
+	/* disable - disable the MSI for the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs to disable.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+         * This routine should perform the inverse of enable.
+	 */
+	void (*disable) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+

Which are either talking directly to the hardware, or are talking
to the hypervisor, which is using hardware isolation so it is safe to
talk directly to the hardware but isn't leting us?  If we could use
things to work around errata in card implementation details it would
make some sense to me (although we don't seem to have any cards with
that got the MSI registers wrong at this point).  Regardless these
operations clearly have a different granularity than the other
operations, and should have a different lookup method.


We have the irq operations.
+	/* check - Check that the requested MSI allocation is OK.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs being requested.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * This routine is responsible for checking that the given PCI device
+	 * can be allocated the requested type and number of MSIs.
+	 *
+	 * It is up to this routine to determine if the requested number of
+	 * MSIs is valid for the device in question. If the number of MSIs,
+	 * or the particular MSI entries, can not be supported for any
+	 * reason this routine must return non-zero.
+	 *
+	 * If the check is succesful this routine must return 0.
+	 */
+	int (*check) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+
+	/* alloc - Allocate MSIs for the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs being requested.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * This routine is responsible for allocating the number of
+	 * MSIs to the given PCI device.
+	 *
+	 * Upon completion there must be @num MSIs assigned to this device,
+	 * the "vector" member of each struct msix_entry must be filled in
+	 * with the Linux irq number allocated to it. The corresponding
+	 * irq_descs must also be setup with an appropriate handler if
+	 * required.
+	 *
+	 * If the allocation completes succesfully this routine must return 0.
+	 */
+	int (*alloc) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+
+	/* free - free the MSIs assigned to the device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * Free all MSIs and associated resources for the device. If any
+	 * MSIs have been enabled they will have been disabled already by
+	 * the generic code.
+	 */
+	void (*free) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);

These because they are per irq make sense as per bus operations unless
you have a good architecture definition like x86 has.  Roughly those
operations are what we currently have except the current operations
are a little simpler and easier to deal with for the architecture
code.

And then there are the operations that are going in the wrong
direction.
+	/* setup_msi_msg - Setup an MSI message for the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @entry:	The MSI entry to create a msi_msg for.
+	 * @msg:	Written with the magic address and data.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * Returns the "magic address and data" used to trigger the msi.
+	 * If the setup is succesful this routine must return 0.
+	 *
+	 * This callback is optional.
+	 */
+	int (*setup_msi_msg) (struct pci_dev *pdev, struct msix_entry *entry,
+				struct msi_msg *msg, int type);

Much to much of the operations base approach as proposed looks like
when you have a hammer every problem looks like a nail, given how much
confusion about what was put into the operations structure.

I don't mind taking a small step and making the alloc/free primitives
per bus in a generic fashion.

I don't mind supporting poorly designed hypervisor interfaces, if it
is easy.

I do strongly mind code that doesn't work, or we can't git-bisect
through to find where bugs were introduced.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 21:20         ` Eric W. Biederman
@ 2007-01-28 21:26           ` Ingo Molnar
  -1 siblings, 0 replies; 178+ messages in thread
From: Ingo Molnar @ 2007-01-28 21:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeff Garzik, Benjamin Herrenschmidt, Greg Kroah-Hartman,
	Tony Luck, Grant Grundler, linux-kernel, Kyle McMartin,
	linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller


* Eric W. Biederman <ebiederm@xmission.com> wrote:

> I'm not arguing against an operations based approach.  I'm arguing for 
> simple obviously correct steps, and not throwing the baby out with the 
> bath water.
> 
> My patches should be a precursor to an operations based approach
> because they are simple step from where we are now.

yeah. I'd say your approach is to go from A to B:

  [A] -----------------------------------------------------> [B]
                                                              |
                                                             [C]

while there might be some other arguments that "no, lets go to C 
instead", i say lets not throw away the already implemented and already 
working and nicely layered [A]->[B] transition, just because there's an 
argument whether the end result should be 'B' or 'C'. Unless someone who 
wants to see 'C' produces a patchset that walks the whole way i dont see 
any reason to not go with your patchset. It clearly removes alot of 
cruft.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-28 21:26           ` Ingo Molnar
  0 siblings, 0 replies; 178+ messages in thread
From: Ingo Molnar @ 2007-01-28 21:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tony Luck, Grant Grundler, Jeff Garzik, linux-kernel,
	Kyle McMartin, linuxppc-dev, Brice Goglin, Greg Kroah-Hartman,
	shaohua.li, linux-pci, David S. Miller


* Eric W. Biederman <ebiederm@xmission.com> wrote:

> I'm not arguing against an operations based approach.  I'm arguing for 
> simple obviously correct steps, and not throwing the baby out with the 
> bath water.
> 
> My patches should be a precursor to an operations based approach
> because they are simple step from where we are now.

yeah. I'd say your approach is to go from A to B:

  [A] -----------------------------------------------------> [B]
                                                              |
                                                             [C]

while there might be some other arguments that "no, lets go to C 
instead", i say lets not throw away the already implemented and already 
working and nicely layered [A]->[B] transition, just because there's an 
argument whether the end result should be 'B' or 'C'. Unless someone who 
wants to see 'C' produces a patchset that walks the whole way i dont see 
any reason to not go with your patchset. It clearly removes alot of 
cruft.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 20:23     ` Benjamin Herrenschmidt
@ 2007-01-28 21:34       ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 21:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Greg Kroah-Hartman, Tony Luck, Grant Grundler, Ingo Molnar,
	linux-kernel, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

>> The other big change is that I added a field to irq_desc to point
>> at the msi_desc.  This removes the conflicts with the existing pointer
>> fields and makes the irq -> msi_desc mapping useable outside of msi.c
>
> I'm not even sure we would have needed that with Michael's mecanism in
> fact. One other reason why I prefer it.
>
> Basically, backends like MPIC etc... don't need it. The irq chip
> operations are normal MPIC operations and don't need to know they are
> done on an MSI nor what MSI etc... and thus we don't need it. Same with
> RTAS.

If you get rid of the bass ackwards setup_msi_msg operation they do,
so you can support at least one write_msi_msg call.

> On the other hand, x86 needs it, but then, x86 uses it's own MSI
> specific irq_chip, in which case it can use irq_desc->chip_data as long
> as it does it within the backend.

Most of the uses are within msi.c as the code is currently structured
which means you can't use it that way.

> So I may have missed a case where a given backend might need both that
> irq -> msi_desc mapping -and- use irq_desc->chip_data for other things,
> but that's one thing I was hoping we could avoid with Michael's code.

That is where we are today.  Find a way to remove the code that uses it
and it can go away.

>> The only architecture problem that isn't solvable in this context is
>> the problem of supporting the crazy hypervisor on the ppc RTAS, which
>> asks us to drive the hardware but does not give us access to the
>> hardware registers.
>
> So you are saying that we should use your model while admitting that it
> can't solve our problems...

My approach can solve your problems with a few tweaks just like Michaels
approach would have needed to solve mine.

> I really don't understand why you seem so totally opposed to Michael's
> approach which definitely looks to me like the sane thing to do. Note
> that in the end, Michael's approach isn't -that- different from yours,
> just a bit more abstracted.

1) Because every one tells me it is the greatest thing since sliced bread,
   and when I look it simply doesn't work, and my feeling would be it would
   be a complete retesting effort of all currently supported architectures
   to make Michaels code work.

2) Because it was scrap and replace, which is a horrible way to deal with
   a problem when we have 3 architectures working already.

Honestly I think Michael and I can get along but all of the cheer leaders seem
to be exacerbating the situation.

I do agree Michael's approach isn't that different than mine and I think we
can converge on a single implementation.  To a large extent that is what
my patchset is about.  Moving the current code far enough it is usable,
and a reasonable basis for more work.

I don't write the current code but since I touched it and started cleaning
it up I seem to be stuck with it.  So I will be happy to take care of it
until we get a version that all architectures can use.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-28 21:34       ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 21:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Tony Luck, Grant Grundler, David S. Miller, Greg Kroah-Hartman,
	linux-kernel, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, Ingo Molnar, linux-pci

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

>> The other big change is that I added a field to irq_desc to point
>> at the msi_desc.  This removes the conflicts with the existing pointer
>> fields and makes the irq -> msi_desc mapping useable outside of msi.c
>
> I'm not even sure we would have needed that with Michael's mecanism in
> fact. One other reason why I prefer it.
>
> Basically, backends like MPIC etc... don't need it. The irq chip
> operations are normal MPIC operations and don't need to know they are
> done on an MSI nor what MSI etc... and thus we don't need it. Same with
> RTAS.

If you get rid of the bass ackwards setup_msi_msg operation they do,
so you can support at least one write_msi_msg call.

> On the other hand, x86 needs it, but then, x86 uses it's own MSI
> specific irq_chip, in which case it can use irq_desc->chip_data as long
> as it does it within the backend.

Most of the uses are within msi.c as the code is currently structured
which means you can't use it that way.

> So I may have missed a case where a given backend might need both that
> irq -> msi_desc mapping -and- use irq_desc->chip_data for other things,
> but that's one thing I was hoping we could avoid with Michael's code.

That is where we are today.  Find a way to remove the code that uses it
and it can go away.

>> The only architecture problem that isn't solvable in this context is
>> the problem of supporting the crazy hypervisor on the ppc RTAS, which
>> asks us to drive the hardware but does not give us access to the
>> hardware registers.
>
> So you are saying that we should use your model while admitting that it
> can't solve our problems...

My approach can solve your problems with a few tweaks just like Michaels
approach would have needed to solve mine.

> I really don't understand why you seem so totally opposed to Michael's
> approach which definitely looks to me like the sane thing to do. Note
> that in the end, Michael's approach isn't -that- different from yours,
> just a bit more abstracted.

1) Because every one tells me it is the greatest thing since sliced bread,
   and when I look it simply doesn't work, and my feeling would be it would
   be a complete retesting effort of all currently supported architectures
   to make Michaels code work.

2) Because it was scrap and replace, which is a horrible way to deal with
   a problem when we have 3 architectures working already.

Honestly I think Michael and I can get along but all of the cheer leaders seem
to be exacerbating the situation.

I do agree Michael's approach isn't that different than mine and I think we
can converge on a single implementation.  To a large extent that is what
my patchset is about.  Moving the current code far enough it is usable,
and a reasonable basis for more work.

I don't write the current code but since I touched it and started cleaning
it up I seem to be stuck with it.  So I will be happy to take care of it
until we get a version that all architectures can use.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 1/6] msi: Kill msi_lookup_irq
  2007-01-28 19:42     ` Eric W. Biederman
@ 2007-01-28 22:01       ` Paul Mackerras
  -1 siblings, 0 replies; 178+ messages in thread
From: Paul Mackerras @ 2007-01-28 22:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Tony Luck, Grant Grundler, Ingo Molnar,
	linux-kernel, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Eric W. Biederman writes:

> @@ -693,15 +664,14 @@ int pci_enable_msi(struct pci_dev* dev)
>  	if (!pos)
>  		return -EINVAL;
>  
> -	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSI));
> +	WARN_ON(!!dev->msi_enabled);

Minor nit: what's wrong with just WARN_ON(dev->msi_enabled) ?
Also here:

> @@ -836,16 +811,14 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
>  				return -EINVAL;	/* duplicate entry */
>  		}
>  	}
> -	temp = dev->irq;
> -	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSIX));
> +	WARN_ON(!!dev->msix_enabled);

Paul.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 1/6] msi: Kill msi_lookup_irq
@ 2007-01-28 22:01       ` Paul Mackerras
  0 siblings, 0 replies; 178+ messages in thread
From: Paul Mackerras @ 2007-01-28 22:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tony Luck, Grant Grundler, David S. Miller, Greg Kroah-Hartman,
	linux-kernel, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, Ingo Molnar, linux-pci

Eric W. Biederman writes:

> @@ -693,15 +664,14 @@ int pci_enable_msi(struct pci_dev* dev)
>  	if (!pos)
>  		return -EINVAL;
>  
> -	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSI));
> +	WARN_ON(!!dev->msi_enabled);

Minor nit: what's wrong with just WARN_ON(dev->msi_enabled) ?
Also here:

> @@ -836,16 +811,14 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
>  				return -EINVAL;	/* duplicate entry */
>  		}
>  	}
> -	temp = dev->irq;
> -	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSIX));
> +	WARN_ON(!!dev->msix_enabled);

Paul.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 21:20         ` Eric W. Biederman
@ 2007-01-28 22:09           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-28 22:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeff Garzik, Greg Kroah-Hartman, Tony Luck, Grant Grundler,
	Ingo Molnar, linux-kernel, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller


 .../... (enable/disable bits)

> Which are either talking directly to the hardware, or are talking
> to the hypervisor, which is using hardware isolation so it is safe to
> talk directly to the hardware but isn't leting us?  If we could use
> things to work around errata in card implementation details it would
> make some sense to me (although we don't seem to have any cards with
> that got the MSI registers wrong at this point).  Regardless these
> operations clearly have a different granularity than the other
> operations, and should have a different lookup method.

I'm not sure I undersdand the point of your rant here. The hypervisor
case hooks at alloc/free and does everything from there. It doens't use
an enable or a diable hook.

The enable/disable ops are optional for that reason. When not present,
it's assumed that alloc/free do it all.

When using a "direct" approach (what we call "raw"), we expect backends
to just plug the provided helper functions in enable/disable. It's still
a hook so that one can do additional platform specific bits if
necessary, but in that specific case, I do agree we could just remove it
and move the "raw" code back into the toplevel functions, with a way
(via a special return code from alloc maybe ?) for the HV case to tell
us not to go through there. That was one of our initial approaches when
working with Michael on the design.

However, that sort of hurts my sense of aestetics :-) I quite like the
toplevel to be just a toplevel, and clearly separate the raw "helpers"
and the backend. Provides more flexibility to handle all possible crazy
cases in the future.

You seem to absolutely want to get the HV case to go throuh the same
code path as the "raw" case, and that will not happen.

  .../... (irq operations)

> These because they are per irq make sense as per bus operations unless
> you have a good architecture definition like x86 has.  Roughly those
> operations are what we currently have except the current operations
> are a little simpler and easier to deal with for the architecture
> code.

Oh ? How so ? (easier/simpler ?)

> And then there are the operations that are going in the wrong
> direction.
> +	/* setup_msi_msg - Setup an MSI message for the given device.
> +	 *
> +	 * @pdev:	PCI device structure.
> +	 * @entry:	The MSI entry to create a msi_msg for.
> +	 * @msg:	Written with the magic address and data.
> +	 * @type:	The type, MSI or MSI-X.
> +	 *
> +	 * Returns the "magic address and data" used to trigger the msi.
> +	 * If the setup is succesful this routine must return 0.
> +	 *
> +	 * This callback is optional.
> +	 */
> +	int (*setup_msi_msg) (struct pci_dev *pdev, struct msix_entry *entry,
> +				struct msi_msg *msg, int type);
> 
> Much to much of the operations base approach as proposed looks like
> when you have a hammer every problem looks like a nail, given how much
> confusion about what was put into the operations structure.

This is indeed a lower level hook to be used by "raw" enable/disable. An
other approach would be to remove it, have each backend have it's own
enable/disable that obtains the address/data and calls into a helper to
program them. This would indeed be a little bit nicer in a layering
perspective. But it adds a bit more code to each backend, so we kept
things closer to the way they used to be. I don't have a firm reason not
to change it however, I need talk to Michael in case he has more good
reasons to keep it that way around. 

> I don't mind taking a small step and making the alloc/free primitives
> per bus in a generic fashion. 
>
> I don't mind supporting poorly designed hypervisor interfaces, if it
> is easy.

And it it's not, we don't support them ? Ugh ? Well, it happens to be
fairly easy but still, I don't understand your approach there.

> I do strongly mind code that doesn't work, or we can't git-bisect
> through to find where bugs were introduced.

It doesn't work yet for you which is why it's not -replacing- your
current code. Again, this was intended as arch code in the first place,
until other archs and maintainers voiced their opinion that we should
move that to generic code. It may not be perfect, we may still want to
change things, maybe make some things closer to the direction you are
taking for the x86 code, but I don't understand the root of such a
strong opposition except mayeb that you've spent time trying to fix the
x86 junk and now are annoyed to see some of that work possibly
replaced ?

I agree with the problem if small changes & bisecting in the general
case. In fact, it would be nice if we could use your fixed code with
little change to "plug" it in as the x86 backend in many ways. Michael's
work isn't a re-implementation of everything, it's a re-structuring,
lots of bits of code that are missing can possibly be lifted from the
existing working implementation.

If we followed that "only do incrementental changes" rule all the time,
imagine in what state would be our USB stack today since we couldn't
have dropped in Linus replacement one ...

Ben.



^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-28 22:09           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-28 22:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tony Luck, Grant Grundler, Jeff Garzik, David S. Miller,
	Greg Kroah-Hartman, linux-kernel, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Ingo Molnar, linux-pci


 .../... (enable/disable bits)

> Which are either talking directly to the hardware, or are talking
> to the hypervisor, which is using hardware isolation so it is safe to
> talk directly to the hardware but isn't leting us?  If we could use
> things to work around errata in card implementation details it would
> make some sense to me (although we don't seem to have any cards with
> that got the MSI registers wrong at this point).  Regardless these
> operations clearly have a different granularity than the other
> operations, and should have a different lookup method.

I'm not sure I undersdand the point of your rant here. The hypervisor
case hooks at alloc/free and does everything from there. It doens't use
an enable or a diable hook.

The enable/disable ops are optional for that reason. When not present,
it's assumed that alloc/free do it all.

When using a "direct" approach (what we call "raw"), we expect backends
to just plug the provided helper functions in enable/disable. It's still
a hook so that one can do additional platform specific bits if
necessary, but in that specific case, I do agree we could just remove it
and move the "raw" code back into the toplevel functions, with a way
(via a special return code from alloc maybe ?) for the HV case to tell
us not to go through there. That was one of our initial approaches when
working with Michael on the design.

However, that sort of hurts my sense of aestetics :-) I quite like the
toplevel to be just a toplevel, and clearly separate the raw "helpers"
and the backend. Provides more flexibility to handle all possible crazy
cases in the future.

You seem to absolutely want to get the HV case to go throuh the same
code path as the "raw" case, and that will not happen.

  .../... (irq operations)

> These because they are per irq make sense as per bus operations unless
> you have a good architecture definition like x86 has.  Roughly those
> operations are what we currently have except the current operations
> are a little simpler and easier to deal with for the architecture
> code.

Oh ? How so ? (easier/simpler ?)

> And then there are the operations that are going in the wrong
> direction.
> +	/* setup_msi_msg - Setup an MSI message for the given device.
> +	 *
> +	 * @pdev:	PCI device structure.
> +	 * @entry:	The MSI entry to create a msi_msg for.
> +	 * @msg:	Written with the magic address and data.
> +	 * @type:	The type, MSI or MSI-X.
> +	 *
> +	 * Returns the "magic address and data" used to trigger the msi.
> +	 * If the setup is succesful this routine must return 0.
> +	 *
> +	 * This callback is optional.
> +	 */
> +	int (*setup_msi_msg) (struct pci_dev *pdev, struct msix_entry *entry,
> +				struct msi_msg *msg, int type);
> 
> Much to much of the operations base approach as proposed looks like
> when you have a hammer every problem looks like a nail, given how much
> confusion about what was put into the operations structure.

This is indeed a lower level hook to be used by "raw" enable/disable. An
other approach would be to remove it, have each backend have it's own
enable/disable that obtains the address/data and calls into a helper to
program them. This would indeed be a little bit nicer in a layering
perspective. But it adds a bit more code to each backend, so we kept
things closer to the way they used to be. I don't have a firm reason not
to change it however, I need talk to Michael in case he has more good
reasons to keep it that way around. 

> I don't mind taking a small step and making the alloc/free primitives
> per bus in a generic fashion. 
>
> I don't mind supporting poorly designed hypervisor interfaces, if it
> is easy.

And it it's not, we don't support them ? Ugh ? Well, it happens to be
fairly easy but still, I don't understand your approach there.

> I do strongly mind code that doesn't work, or we can't git-bisect
> through to find where bugs were introduced.

It doesn't work yet for you which is why it's not -replacing- your
current code. Again, this was intended as arch code in the first place,
until other archs and maintainers voiced their opinion that we should
move that to generic code. It may not be perfect, we may still want to
change things, maybe make some things closer to the direction you are
taking for the x86 code, but I don't understand the root of such a
strong opposition except mayeb that you've spent time trying to fix the
x86 junk and now are annoyed to see some of that work possibly
replaced ?

I agree with the problem if small changes & bisecting in the general
case. In fact, it would be nice if we could use your fixed code with
little change to "plug" it in as the x86 backend in many ways. Michael's
work isn't a re-implementation of everything, it's a re-structuring,
lots of bits of code that are missing can possibly be lifted from the
existing working implementation.

If we followed that "only do incrementental changes" rule all the time,
imagine in what state would be our USB stack today since we couldn't
have dropped in Linus replacement one ...

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 20:47       ` Jeff Garzik
@ 2007-01-28 22:11         ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 22:11 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Benjamin Herrenschmidt, Greg Kroah-Hartman, Tony Luck,
	Grant Grundler, Ingo Molnar, linux-kernel, Kyle McMartin,
	linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller

Jeff Garzik <jeff@garzik.org> writes:

> I think the high-level ops approach makes more sense.  It's more future proof,
> in addition to covering all existing implementations.

To be precise in Michaels implementation one of the parameters passed is
a type parameter so that the architecture has to know about each different
type of msi implementation.   In my implementation that field does not exist,
because it is unnecessary.  So as long as the message on the bus is a msi
message my implementation can be adapted to support it without any architecture
changes.

Being future proof is about getting the abstraction correct, and exposing
those details that matter, and removing those detail that don't.

It is a minor nit, not a fundamental flaw in the operations concept.  But
one of the reasons I am opposed to throwing out the current working code.
Evolutionary change ensures that things only the code remembers don't get
left behind.

I guess that is the other part of the discussion that shows up here
is, as long as the change is an evolutionary change from what is
working today.  I don't have any fundamental problems with it, but I
am completely against a revolutionary change.

Meanwhile because Michael has proposed operations my position has been
perceived as against operations.  While I have a lot of technical nits
to pick with the Michaels operations approach, I'm not fundamentally
against it.  I just don't want to loose the information that only
the code remembers.

Most of my technical objections have been formed by looking at what
the code does today, looking at what Michaels code is doing and seeing
details he missed.  If we just start with the current code base and
fix it the whole approach is much easier.

Anyway last I heard Michael was working on starting with the current
msi.c and making his patch set work, and I am hoping that my work
will make that patchset cleaner, and easier to do.  Even if we do
conflict at the moment :)

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-28 22:11         ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 22:11 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Tony Luck, Grant Grundler, David S. Miller, linux-kernel,
	Kyle McMartin, linuxppc-dev, Brice Goglin, Greg Kroah-Hartman,
	shaohua.li, Ingo Molnar, linux-pci

Jeff Garzik <jeff@garzik.org> writes:

> I think the high-level ops approach makes more sense.  It's more future proof,
> in addition to covering all existing implementations.

To be precise in Michaels implementation one of the parameters passed is
a type parameter so that the architecture has to know about each different
type of msi implementation.   In my implementation that field does not exist,
because it is unnecessary.  So as long as the message on the bus is a msi
message my implementation can be adapted to support it without any architecture
changes.

Being future proof is about getting the abstraction correct, and exposing
those details that matter, and removing those detail that don't.

It is a minor nit, not a fundamental flaw in the operations concept.  But
one of the reasons I am opposed to throwing out the current working code.
Evolutionary change ensures that things only the code remembers don't get
left behind.

I guess that is the other part of the discussion that shows up here
is, as long as the change is an evolutionary change from what is
working today.  I don't have any fundamental problems with it, but I
am completely against a revolutionary change.

Meanwhile because Michael has proposed operations my position has been
perceived as against operations.  While I have a lot of technical nits
to pick with the Michaels operations approach, I'm not fundamentally
against it.  I just don't want to loose the information that only
the code remembers.

Most of my technical objections have been formed by looking at what
the code does today, looking at what Michaels code is doing and seeing
details he missed.  If we just start with the current code base and
fix it the whole approach is much easier.

Anyway last I heard Michael was working on starting with the current
msi.c and making his patch set work, and I am hoping that my work
will make that patchset cleaner, and easier to do.  Even if we do
conflict at the moment :)

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 1/6] msi: Kill msi_lookup_irq
  2007-01-28 22:01       ` Paul Mackerras
@ 2007-01-28 22:18         ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 22:18 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Greg Kroah-Hartman, Tony Luck, Grant Grundler, Ingo Molnar,
	linux-kernel, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Paul Mackerras <paulus@samba.org> writes:

> Eric W. Biederman writes:
>
>> @@ -693,15 +664,14 @@ int pci_enable_msi(struct pci_dev* dev)
>>  	if (!pos)
>>  		return -EINVAL;
>>  
>> -	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSI));
>> +	WARN_ON(!!dev->msi_enabled);
>
> Minor nit: what's wrong with just WARN_ON(dev->msi_enabled) ?

It's a bitfield so gcc complains when something in WARN_ON calls
typeof on it.  So it is easier to just say !! than to dig into
WARN_ON and see if it made any sense to fix WARN_ON, or to see if gcc
needed the bug fix.

> Also here:
>
>> @@ -836,16 +811,14 @@ int pci_enable_msix(struct pci_dev* dev, struct
> msix_entry *entries, int nvec)
>>  				return -EINVAL;	/* duplicate entry */
>>  		}
>>  	}
>> -	temp = dev->irq;
>> -	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSIX));
>> +	WARN_ON(!!dev->msix_enabled);
>
> Paul.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 1/6] msi: Kill msi_lookup_irq
@ 2007-01-28 22:18         ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 22:18 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Tony Luck, Grant Grundler, David S. Miller, Greg Kroah-Hartman,
	linux-kernel, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, Ingo Molnar, linux-pci

Paul Mackerras <paulus@samba.org> writes:

> Eric W. Biederman writes:
>
>> @@ -693,15 +664,14 @@ int pci_enable_msi(struct pci_dev* dev)
>>  	if (!pos)
>>  		return -EINVAL;
>>  
>> -	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSI));
>> +	WARN_ON(!!dev->msi_enabled);
>
> Minor nit: what's wrong with just WARN_ON(dev->msi_enabled) ?

It's a bitfield so gcc complains when something in WARN_ON calls
typeof on it.  So it is easier to just say !! than to dig into
WARN_ON and see if it made any sense to fix WARN_ON, or to see if gcc
needed the bug fix.

> Also here:
>
>> @@ -836,16 +811,14 @@ int pci_enable_msix(struct pci_dev* dev, struct
> msix_entry *entries, int nvec)
>>  				return -EINVAL;	/* duplicate entry */
>>  		}
>>  	}
>> -	temp = dev->irq;
>> -	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSIX));
>> +	WARN_ON(!!dev->msix_enabled);
>
> Paul.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28 21:17               ` Benjamin Herrenschmidt
@ 2007-01-28 22:36                 ` Eric W. Biederman
  2007-01-28 23:17                   ` Benjamin Herrenschmidt
  2007-01-28 23:31                   ` David Miller
  0 siblings, 2 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 22:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

>> Yes of some form.  Although only needing 2 ops instead of 6 is still
>> simpler.  Until we can agree on a point where the ops lookup is
>> generic I don't see the point in placing it in generic code.
>
> At least 4 actually, since we would need suspend/resume as well.

That one I don't quite understand yet.  But the current code is
quite happy to support suspend/resume generically.  It will be
interesting to see what part of that support does not work on ppc.

Especially as the current interface allows you to reprogram the
msi_message at any time.
  
>> In addition I am extremely uncomfortable with making the interface to
>> the architecture any wider than we need it to be, as refactoring code
>> across multiple architectures is hard as usually the developer does
>> not have the hardware to touch all of the code that is touched.
>
> The interface to the arch in our model is the function to get ops :-)
> Most "normal" backends would just "plug" those ops with the provided
> raw_ functions.
>
>> The argument that we need to support what the RTAS is doing to support
>> other hypervisors seems to be a fallacy.  What the RTAS is doing is
>> not sane from a hardware standpoint, so I do not expect it from other
>> virtualized/hypervisor style environments. 
>>
>> If the hardware provides capabilities to isolate the MSI messages
>> properly it does not need to prevent us from touching the msi setup
>> registers. 
>
> It does isolate and it doesn't -prevent- config space access. However,
> in order to enable MSIs, we have to configure the device -and- the IRQ
> controller on the bus on which the device sits on, that is, to obtain
> vectors from the HV, configure the controller to receive MSIs from that
> device and route them to us, etc...., and the only API the HV provides
> for doing so is that RTAS function that configures both in one call.
>
> I don't see what's fundamentally wrong with that approach.

Because it mixes concerns that do not need to be mixed, and it complicates
the code.  The hypervisor has no need to understand how a hardware
device is built, and how it's registers operate.  It just needs to
know that the given hardware device will generate an msi message on
the bus.

The practical difference is if someone comes out with MSI-Y or they
have a card that doesn't quite implement the MSI registers correctly
that hypervisor interface falls down, whereas my current architecture
hooks do not.

Plus it is more code in the hypervisor if it has to has to distinguish
between MSI and MSI-X.

>>  If the hardware does not isolate the MSI messages properly
>> there is another problem.  Especially in the context of MSI-X where
>> the registers can be in the middle of any mmio bar I do not see a sane
>> way of keeping us from touching the hardware directly in the first
>> place.
>
> They are not blocked as I said above, at least not for most devices,
> (though the controller/receiver side is). However, we don't have an API
> to get the address/value to write into the device, nor to
> configure/enable MSIs in the PIC. The only API we have is basically
> called "change-msi" which can be use to enable MSI, MSI-X or disable
> them (though we can provide how many we want to enable out of what is
> requested by the device... we can't enable sparse MSI-X though, we can
> only enable the N first ones).

Right, so some way needs to be found to cope with that situation.
Likely that involves bypassing all of the code that talks directly to
the hardware for MSI.

>> However it is quite likely that supporting the RTAS is not going to 
>> require much code to support.  So I don't see an argument against not
>> supporting the RTAS.
>
> It would imply 2 or 3 more hooks at the toplevel... so we are going from
> your 2 initial hooks to 4 (bcs we need to hook suspend/resume), now to 6
> or 7.... 

But importantly the hooks are at a whole different layer of the code
and most likely at a completely different granularity.  You don't have
per bus hypervisor support do you?

So as I see it that is a different layer and should be treated differently.

>> There is the additional problem in all of this that our interface for
>> MSI-X to the drivers is quite likely the wrong interface.  I believe
>> we will want to incrementally allocate more irqs at run time as there
>> are work queues or the like which can be attached to them.  We can get
>> there with the current vector allocator by freeing and reallocating
>> all of the msi-x irqs when the driver wants more so the current
>> interface will suffice but it is far from optimal.
>
> Our hypervisor will not unfortunately let us do that. We can only use
> RTAS "change-msi" to allocate more/less MSIs and that is disruptive of
> the device function (we might lose pending interrupts when doing so, in
> fact, in the initial HV interface definition, we could only do that with
> the device actually disabled in the command register !).
>
> In general, I'd rather have the device pre-allocate the MSI-X it needs,
> though it can later on decide to use more or less.

That might be the right solution.  I don't know.  But that is one
among several that your HV interface is wrong and probably should be
fixed at the HV.  I definitely have no intentions of encouraging
another HV to emulate the brittleness of your solution.

Nor do I want to ask device drivers to preallocate 4096 interrupts
just in case they need them.  Even if batch allocation makes sense
always asking for the maximum possible that you might use is overkill.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28 22:36                 ` Eric W. Biederman
@ 2007-01-28 23:17                   ` Benjamin Herrenschmidt
  2007-01-28 23:38                     ` Eric W. Biederman
  2007-01-28 23:31                   ` David Miller
  1 sibling, 1 reply; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-28 23:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller


> Because it mixes concerns that do not need to be mixed, and it complicates
> the code.  The hypervisor has no need to understand how a hardware
> device is built, and how it's registers operate.  It just needs to
> know that the given hardware device will generate an msi message on
> the bus.

I'd be happy for you to go explain your view of what an hypervisor
should or should not do to IBM HV architects :-) But in the meantime,
that's how they have defined it and how it's been implemented and how we
have to support it. And I have a strong feeling that they won't be the
only ones to do it that way (I'd like to be proven wrong tho).

> Right, so some way needs to be found to cope with that situation.
> Likely that involves bypassing all of the code that talks directly to
> the hardware for MSI.

Which can be done by having the alloc() and free() hooks do all the work
provide they aren't done per-msi but per-call like in Michael's
approach. That is, in the MSI-X case, alloc is called once for all of
the MSI-X requested.

I understand that this conflicts with your idea of requesting new MSI-X
on the fly but I don't think that trying to add/remove MSI-X that way is
a sane approach anyway. If you are concerned about HW problems, I think
by doing so, you'll indeed hit them hard.

A driver who wants to modulate should really allocate all the MSI-X it
can possibly need and then enable/disable depending on its needs, I
don't trust hardware to behave properly if the stuff is reconfigured
while active.

> But importantly the hooks are at a whole different layer of the code
> and most likely at a completely different granularity.  You don't have
> per bus hypervisor support do you?
> 
> So as I see it that is a different layer and should be treated differently.

Well, I think that treating them differently will on the contrary
complicate the matter :-)

Now, as I said, I agree that Michael's current ops definition might
benefit from some changes.

I do agree for example that we might want to rework a bit what is done
in the area of the ->setup_msi_msg. An option is to remove it and
instead have the backend ->enable() hook be the one figuring out the
message and calling a low level -raw- helper rather than having a
generic raw helper hook directly in ->enable and itself then use
->setup_msi_msg as a lower level hook to get the message.

Since we need a low level raw helper to writeout the message
address/data anyway (for use by set_affinity among others), by doing so,
we avoid duplication.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28 20:14           ` Benjamin Herrenschmidt
  2007-01-28 20:53             ` Eric W. Biederman
@ 2007-01-28 23:25             ` David Miller
  1 sibling, 0 replies; 178+ messages in thread
From: David Miller @ 2007-01-28 23:25 UTC (permalink / raw)
  To: benh; +Cc: greg, kyle, linuxppc-dev, brice, shaohua.li, linux-pci, ebiederm

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Mon, 29 Jan 2007 07:14:52 +1100

> > Which should be good enough to handle everything but RTAS.
> 
> You keep ignoring the problem then... we -HAVE- to handle the RTAS case.
> In addition, it's not unlikely that other virtualized environment will
> provide a similar very high level APIs to MSIs.

I believe sparc64 has similar issues to RTAS fwiw.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 22:09           ` Benjamin Herrenschmidt
@ 2007-01-28 23:26             ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 23:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jeff Garzik, Greg Kroah-Hartman, Tony Luck, Grant Grundler,
	Ingo Molnar, linux-kernel, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

>  .../... (enable/disable bits)
>
>> Which are either talking directly to the hardware, or are talking
>> to the hypervisor, which is using hardware isolation so it is safe to
>> talk directly to the hardware but isn't leting us?  If we could use
>> things to work around errata in card implementation details it would
>> make some sense to me (although we don't seem to have any cards with
>> that got the MSI registers wrong at this point).  Regardless these
>> operations clearly have a different granularity than the other
>> operations, and should have a different lookup method.
>
> I'm not sure I undersdand the point of your rant here. The hypervisor
> case hooks at alloc/free and does everything from there. It doens't use
> an enable or a diable hook.
>
> The enable/disable ops are optional for that reason. When not present,
> it's assumed that alloc/free do it all.

Well my feeling is that in your weird HV case enable/disable should do
all of the work.  And alloc/free won't have to do anything because the
bus doesn't matter any more.

> When using a "direct" approach (what we call "raw"), we expect backends
> to just plug the provided helper functions in enable/disable. It's still
> a hook so that one can do additional platform specific bits if
> necessary, but in that specific case, I do agree we could just remove it
> and move the "raw" code back into the toplevel functions, with a way
> (via a special return code from alloc maybe ?) for the HV case to tell
> us not to go through there. That was one of our initial approaches when
> working with Michael on the design.
>
> However, that sort of hurts my sense of aestetics :-) I quite like the
> toplevel to be just a toplevel, and clearly separate the raw "helpers"
> and the backend. Provides more flexibility to handle all possible crazy
> cases in the future.

To be clear I see this as 2 distinct layers of code. enable/disable
that talks directly to the hardware, and the helpers of enable/disable
that allocate the irq.  I base this on the fact that I only need the
alloc/free when I am exclusively working with real hardware.

> You seem to absolutely want to get the HV case to go throuh the same
> code path as the "raw" case, and that will not happen.

Yes I do.  Because that is the only sane approach for a HV to use.
And yes we need an irq allocator to call the HV to setup the upstream
reception of the msi message.

However I don't think it will be to hard to support your HV once we get
the real hardware supported.  I just refuse to consider it before we have
figured out what makes sense in the context where we have to do everything.


>   .../... (irq operations)
>
>> These because they are per irq make sense as per bus operations unless
>> you have a good architecture definition like x86 has.  Roughly those
>> operations are what we currently have except the current operations
>> are a little simpler and easier to deal with for the architecture
>> code.
>
> Oh ? How so ? (easier/simpler ?)

I don't take a type parameter, and I don't take a vector.  All of
that work is done in the generic code.

>> And then there are the operations that are going in the wrong
>> direction.
>> +	/* setup_msi_msg - Setup an MSI message for the given device.
>> +	 *
>> +	 * @pdev:	PCI device structure.
>> +	 * @entry:	The MSI entry to create a msi_msg for.
>> +	 * @msg:	Written with the magic address and data.
>> +	 * @type:	The type, MSI or MSI-X.
>> +	 *
>> +	 * Returns the "magic address and data" used to trigger the msi.
>> +	 * If the setup is succesful this routine must return 0.
>> +	 *
>> +	 * This callback is optional.
>> +	 */
>> +	int (*setup_msi_msg) (struct pci_dev *pdev, struct msix_entry *entry,
>> +				struct msi_msg *msg, int type);
>> 
>> Much to much of the operations base approach as proposed looks like
>> when you have a hammer every problem looks like a nail, given how much
>> confusion about what was put into the operations structure.
>
> This is indeed a lower level hook to be used by "raw" enable/disable. An
> other approach would be to remove it, have each backend have it's own
> enable/disable that obtains the address/data and calls into a helper to
> program them. This would indeed be a little bit nicer in a layering
> perspective. But it adds a bit more code to each backend, so we kept
> things closer to the way they used to be. I don't have a firm reason not
> to change it however, I need talk to Michael in case he has more good
> reasons to keep it that way around. 

The current code in the kernel already is structured that way because
we have to reprogram the msi message on each irq migration.  Not using
a helper to write the message would be a noticeable change and require
a fair amount of code rewriting on the currently supported
architectures.

>> I don't mind taking a small step and making the alloc/free primitives
>> per bus in a generic fashion. 
>>
>> I don't mind supporting poorly designed hypervisor interfaces, if it
>> is easy.
>
> And it it's not, we don't support them ? Ugh ? Well, it happens to be
> fairly easy but still, I don't understand your approach there.

Yes.  In general the mainline linux kernel does not support certain
classes of stupidity.  TCP offload engines, firmware drivers for
hardware we care about, a fixed ABI to binary only modules, etc.
It is the responsibility of the OS to setup MSI so we do it, not
the firmware so we do it.

Not supporting stupid things that are hard to support encourages other
people not to be so silly, especially when linux still works on the
hardware when that silly feature isn't supported.

For similar reasons we don't support more than 1 irq with a plain MSI
capability.  It is hard, we can't do it on most hardware, and anyone
who wants more than 1 irq should just implement MSI-X and everyone
will be able to use it, on any hardware.

Part of the reason to not support a messed up HV interface if it hard
is that a HV is just software.  Which means the incremental cost to
fix it is roughly the same as fixing the linux kernel, and it puts
the burden on the people doing stupid things not on the rest of us
forever more.

>> I do strongly mind code that doesn't work, or we can't git-bisect
>> through to find where bugs were introduced.
>
> It doesn't work yet for you which is why it's not -replacing- your
> current code. Again, this was intended as arch code in the first place,
> until other archs and maintainers voiced their opinion that we should
> move that to generic code. It may not be perfect, we may still want to
> change things, maybe make some things closer to the direction you are
> taking for the x86 code, but I don't understand the root of such a
> strong opposition except mayeb that you've spent time trying to fix the
> x86 junk and now are annoyed to see some of that work possibly
> replaced ?

No.  I have spent time fixing what is there, and made it work.  I see
implementations proposed that don't handle cases I have fixed, and I
don't see anything that resembles a simple migration path for i386,
x86_64 and ia64.  Which is part of what annoys me when I am told
the ops work for everything.

As for the code not working important parts of the code (like MSI-X)
don't even work on ppc.  The strength of my opposition is largely
shaped by the number of people wearing rose colored glasses and
ignoring the problems, and missing huge details.

Given that we have been talking about things since before OLS I would
have expected the ppc code to be a little farther along.

> I agree with the problem if small changes & bisecting in the general
> case. In fact, it would be nice if we could use your fixed code with
> little change to "plug" it in as the x86 backend in many ways. Michael's
> work isn't a re-implementation of everything, it's a re-structuring,
> lots of bits of code that are missing can possibly be lifted from the
> existing working implementation.

Not the x86 backend but the raw backend.  You might not need all of
the features because you are always going through another interrupt
controller but that doesn't mean they shouldn't be there.

Michael has at least agreed to look in that direction so I'm hoping
my changes remove some of the difficulty for him.

> If we followed that "only do incrementental changes" rule all the time,
> imagine in what state would be our USB stack today since we couldn't
> have dropped in Linus replacement one ...

Well even that was a partial following of that rule because you didn't
rewrite the rest of the kernel at the same time, to better support
usb.  I do agree that there are instances where a complete rewrite is
the best path.  In this case I don't see a reasonable case for not
reusing what is there.

Nor do I see the level of care being put into the problem that would
cause me to trust a rewrite.  I have a huge number of little technical
problems with the proposed code, and see absolutely no overriding
virtue in it.  Especially when the worst of the problems with msi.c
can be easily fixed, as demonstrated by my patchset.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-28 23:26             ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 23:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Tony Luck, Grant Grundler, Jeff Garzik, David S. Miller,
	Greg Kroah-Hartman, linux-kernel, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Ingo Molnar, linux-pci

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

>  .../... (enable/disable bits)
>
>> Which are either talking directly to the hardware, or are talking
>> to the hypervisor, which is using hardware isolation so it is safe to
>> talk directly to the hardware but isn't leting us?  If we could use
>> things to work around errata in card implementation details it would
>> make some sense to me (although we don't seem to have any cards with
>> that got the MSI registers wrong at this point).  Regardless these
>> operations clearly have a different granularity than the other
>> operations, and should have a different lookup method.
>
> I'm not sure I undersdand the point of your rant here. The hypervisor
> case hooks at alloc/free and does everything from there. It doens't use
> an enable or a diable hook.
>
> The enable/disable ops are optional for that reason. When not present,
> it's assumed that alloc/free do it all.

Well my feeling is that in your weird HV case enable/disable should do
all of the work.  And alloc/free won't have to do anything because the
bus doesn't matter any more.

> When using a "direct" approach (what we call "raw"), we expect backends
> to just plug the provided helper functions in enable/disable. It's still
> a hook so that one can do additional platform specific bits if
> necessary, but in that specific case, I do agree we could just remove it
> and move the "raw" code back into the toplevel functions, with a way
> (via a special return code from alloc maybe ?) for the HV case to tell
> us not to go through there. That was one of our initial approaches when
> working with Michael on the design.
>
> However, that sort of hurts my sense of aestetics :-) I quite like the
> toplevel to be just a toplevel, and clearly separate the raw "helpers"
> and the backend. Provides more flexibility to handle all possible crazy
> cases in the future.

To be clear I see this as 2 distinct layers of code. enable/disable
that talks directly to the hardware, and the helpers of enable/disable
that allocate the irq.  I base this on the fact that I only need the
alloc/free when I am exclusively working with real hardware.

> You seem to absolutely want to get the HV case to go throuh the same
> code path as the "raw" case, and that will not happen.

Yes I do.  Because that is the only sane approach for a HV to use.
And yes we need an irq allocator to call the HV to setup the upstream
reception of the msi message.

However I don't think it will be to hard to support your HV once we get
the real hardware supported.  I just refuse to consider it before we have
figured out what makes sense in the context where we have to do everything.


>   .../... (irq operations)
>
>> These because they are per irq make sense as per bus operations unless
>> you have a good architecture definition like x86 has.  Roughly those
>> operations are what we currently have except the current operations
>> are a little simpler and easier to deal with for the architecture
>> code.
>
> Oh ? How so ? (easier/simpler ?)

I don't take a type parameter, and I don't take a vector.  All of
that work is done in the generic code.

>> And then there are the operations that are going in the wrong
>> direction.
>> +	/* setup_msi_msg - Setup an MSI message for the given device.
>> +	 *
>> +	 * @pdev:	PCI device structure.
>> +	 * @entry:	The MSI entry to create a msi_msg for.
>> +	 * @msg:	Written with the magic address and data.
>> +	 * @type:	The type, MSI or MSI-X.
>> +	 *
>> +	 * Returns the "magic address and data" used to trigger the msi.
>> +	 * If the setup is succesful this routine must return 0.
>> +	 *
>> +	 * This callback is optional.
>> +	 */
>> +	int (*setup_msi_msg) (struct pci_dev *pdev, struct msix_entry *entry,
>> +				struct msi_msg *msg, int type);
>> 
>> Much to much of the operations base approach as proposed looks like
>> when you have a hammer every problem looks like a nail, given how much
>> confusion about what was put into the operations structure.
>
> This is indeed a lower level hook to be used by "raw" enable/disable. An
> other approach would be to remove it, have each backend have it's own
> enable/disable that obtains the address/data and calls into a helper to
> program them. This would indeed be a little bit nicer in a layering
> perspective. But it adds a bit more code to each backend, so we kept
> things closer to the way they used to be. I don't have a firm reason not
> to change it however, I need talk to Michael in case he has more good
> reasons to keep it that way around. 

The current code in the kernel already is structured that way because
we have to reprogram the msi message on each irq migration.  Not using
a helper to write the message would be a noticeable change and require
a fair amount of code rewriting on the currently supported
architectures.

>> I don't mind taking a small step and making the alloc/free primitives
>> per bus in a generic fashion. 
>>
>> I don't mind supporting poorly designed hypervisor interfaces, if it
>> is easy.
>
> And it it's not, we don't support them ? Ugh ? Well, it happens to be
> fairly easy but still, I don't understand your approach there.

Yes.  In general the mainline linux kernel does not support certain
classes of stupidity.  TCP offload engines, firmware drivers for
hardware we care about, a fixed ABI to binary only modules, etc.
It is the responsibility of the OS to setup MSI so we do it, not
the firmware so we do it.

Not supporting stupid things that are hard to support encourages other
people not to be so silly, especially when linux still works on the
hardware when that silly feature isn't supported.

For similar reasons we don't support more than 1 irq with a plain MSI
capability.  It is hard, we can't do it on most hardware, and anyone
who wants more than 1 irq should just implement MSI-X and everyone
will be able to use it, on any hardware.

Part of the reason to not support a messed up HV interface if it hard
is that a HV is just software.  Which means the incremental cost to
fix it is roughly the same as fixing the linux kernel, and it puts
the burden on the people doing stupid things not on the rest of us
forever more.

>> I do strongly mind code that doesn't work, or we can't git-bisect
>> through to find where bugs were introduced.
>
> It doesn't work yet for you which is why it's not -replacing- your
> current code. Again, this was intended as arch code in the first place,
> until other archs and maintainers voiced their opinion that we should
> move that to generic code. It may not be perfect, we may still want to
> change things, maybe make some things closer to the direction you are
> taking for the x86 code, but I don't understand the root of such a
> strong opposition except mayeb that you've spent time trying to fix the
> x86 junk and now are annoyed to see some of that work possibly
> replaced ?

No.  I have spent time fixing what is there, and made it work.  I see
implementations proposed that don't handle cases I have fixed, and I
don't see anything that resembles a simple migration path for i386,
x86_64 and ia64.  Which is part of what annoys me when I am told
the ops work for everything.

As for the code not working important parts of the code (like MSI-X)
don't even work on ppc.  The strength of my opposition is largely
shaped by the number of people wearing rose colored glasses and
ignoring the problems, and missing huge details.

Given that we have been talking about things since before OLS I would
have expected the ppc code to be a little farther along.

> I agree with the problem if small changes & bisecting in the general
> case. In fact, it would be nice if we could use your fixed code with
> little change to "plug" it in as the x86 backend in many ways. Michael's
> work isn't a re-implementation of everything, it's a re-structuring,
> lots of bits of code that are missing can possibly be lifted from the
> existing working implementation.

Not the x86 backend but the raw backend.  You might not need all of
the features because you are always going through another interrupt
controller but that doesn't mean they shouldn't be there.

Michael has at least agreed to look in that direction so I'm hoping
my changes remove some of the difficulty for him.

> If we followed that "only do incrementental changes" rule all the time,
> imagine in what state would be our USB stack today since we couldn't
> have dropped in Linus replacement one ...

Well even that was a partial following of that rule because you didn't
rewrite the rest of the kernel at the same time, to better support
usb.  I do agree that there are instances where a complete rewrite is
the best path.  In this case I don't see a reasonable case for not
reusing what is there.

Nor do I see the level of care being put into the problem that would
cause me to trust a rewrite.  I have a huge number of little technical
problems with the proposed code, and see absolutely no overriding
virtue in it.  Especially when the worst of the problems with msi.c
can be easily fixed, as demonstrated by my patchset.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28 20:53             ` Eric W. Biederman
  2007-01-28 21:17               ` Benjamin Herrenschmidt
@ 2007-01-28 23:26               ` David Miller
  1 sibling, 0 replies; 178+ messages in thread
From: David Miller @ 2007-01-28 23:26 UTC (permalink / raw)
  To: ebiederm; +Cc: kyle, linuxppc-dev, brice, greg, shaohua.li, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 13:53:14 -0700

> The argument that we need to support what the RTAS is doing to support
> other hypervisors seems to be a fallacy.  What the RTAS is doing is
> not sane from a hardware standpoint, so I do not expect it from other
> virtualized/hypervisor style environments. 
> 
> If the hardware provides capabilities to isolate the MSI messages
> properly it does not need to prevent us from touching the msi setup
> registers.

I disagree, I think you will find more systems, not less of them,
doing something like RTAS.  And as I stated in another email I
believe sparc64 behaves the same way as RTAS.

Account for these systems now, they exist and we need to support
them properly.  It's not a one-off kind of thing.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28 22:36                 ` Eric W. Biederman
  2007-01-28 23:17                   ` Benjamin Herrenschmidt
@ 2007-01-28 23:31                   ` David Miller
  2007-01-28 23:59                     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 178+ messages in thread
From: David Miller @ 2007-01-28 23:31 UTC (permalink / raw)
  To: ebiederm; +Cc: kyle, linuxppc-dev, brice, greg, shaohua.li, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 15:36:20 -0700

> That might be the right solution.  I don't know.  But that is one
> among several that your HV interface is wrong and probably should be
> fixed at the HV.  I definitely have no intentions of encouraging
> another HV to emulate the brittleness of your solution.
> 
> Nor do I want to ask device drivers to preallocate 4096 interrupts
> just in case they need them.  Even if batch allocation makes sense
> always asking for the maximum possible that you might use is overkill.

Other platforms do this and I think it is totally reasonable
to protect the defined PCI config register writes for MSI
and MSI-X behind the hypervisor calls.

It is one of several legitimate ways to keep a PCI device from
transmitting random junk to other devices which are on behind the same
PCI controller yet belong to another virtual domain.

Another solution is to have a PCI-E bridges define the unit of
splitability between virtual domains, and have those PCI-E bridges
protect each other from inter-domain MSI message writes.

Both solutions are valid, and each platform and hypervisor may make
either decision and it is reasonable.

About your argument of a MSI-Y, these platforms simply don't support
such devices and the people who design these systems and hypervisors
absolutely accept that limitation as a design trade off.  It's not a
bad thing.

Look, I have no idea where all this resistence come from to abstract
this stuff behind enough levels to support things like RTAS et al.
properly.  Please stop it now.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 23:26             ` Eric W. Biederman
@ 2007-01-28 23:37               ` David Miller
  -1 siblings, 0 replies; 178+ messages in thread
From: David Miller @ 2007-01-28 23:37 UTC (permalink / raw)
  To: ebiederm
  Cc: benh, jeff, greg, tony.luck, grundler, mingo, linux-kernel, kyle,
	linuxppc-dev, brice, shaohua.li, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 16:26:44 -0700

> Yes.  In general the mainline linux kernel does not support certain
> classes of stupidity.  TCP offload engines, firmware drivers for
> hardware we care about, a fixed ABI to binary only modules, etc.
> It is the responsibility of the OS to setup MSI so we do it, not
> the firmware so we do it.

I absolutely disagree with you Eric, and I think you're being
rediculious.

If the hypervisor doesn't control the MSI PCI config space
register writes, this allows the device to spam PCI devices
which belong to other domains.

It's a freakin' reasonable design trade off decision, get over
it! :-)

Yes it can be done at the hardware level, and many hypervisor
based systems do that, but it's not the one-and-only true
way to implment inter-domain protection behind a single
PCI host controller.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-28 23:37               ` David Miller
  0 siblings, 0 replies; 178+ messages in thread
From: David Miller @ 2007-01-28 23:37 UTC (permalink / raw)
  To: ebiederm
  Cc: tony.luck, grundler, jeff, linux-kernel, kyle, linuxppc-dev,
	brice, greg, shaohua.li, mingo, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 16:26:44 -0700

> Yes.  In general the mainline linux kernel does not support certain
> classes of stupidity.  TCP offload engines, firmware drivers for
> hardware we care about, a fixed ABI to binary only modules, etc.
> It is the responsibility of the OS to setup MSI so we do it, not
> the firmware so we do it.

I absolutely disagree with you Eric, and I think you're being
rediculious.

If the hypervisor doesn't control the MSI PCI config space
register writes, this allows the device to spam PCI devices
which belong to other domains.

It's a freakin' reasonable design trade off decision, get over
it! :-)

Yes it can be done at the hardware level, and many hypervisor
based systems do that, but it's not the one-and-only true
way to implment inter-domain protection behind a single
PCI host controller.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28 23:17                   ` Benjamin Herrenschmidt
@ 2007-01-28 23:38                     ` Eric W. Biederman
  2007-01-28 23:51                       ` David Miller
                                         ` (2 more replies)
  0 siblings, 3 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-28 23:38 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

>> Because it mixes concerns that do not need to be mixed, and it complicates
>> the code.  The hypervisor has no need to understand how a hardware
>> device is built, and how it's registers operate.  It just needs to
>> know that the given hardware device will generate an msi message on
>> the bus.
>
> I'd be happy for you to go explain your view of what an hypervisor
> should or should not do to IBM HV architects :-) 

Sure think you can setup a meeting, or give me an email introduction to them.

> But in the meantime,
> that's how they have defined it and how it's been implemented and how we
> have to support it. And I have a strong feeling that they won't be the
> only ones to do it that way (I'd like to be proven wrong tho).

No the general linux kernel does not have to support it, and if we don't
I suspect that message would get back fairly clearly to the IBM HV architects.

I haven't been watching closely but I haven heard any rumors on the x86
side that they are looking in that direction.

>> Right, so some way needs to be found to cope with that situation.
>> Likely that involves bypassing all of the code that talks directly to
>> the hardware for MSI.
>
> Which can be done by having the alloc() and free() hooks do all the work
> provide they aren't done per-msi but per-call like in Michael's
> approach. That is, in the MSI-X case, alloc is called once for all of
> the MSI-X requested.
>
> I understand that this conflicts with your idea of requesting new MSI-X
> on the fly but I don't think that trying to add/remove MSI-X that way is
> a sane approach anyway. If you are concerned about HW problems, I think
> by doing so, you'll indeed hit them hard.

That isn't even the reason it is that way.  It is because allocating
4096 irqs in a single vector is a bad idea, and because it requires you
to pass type information of what kind of msi you are dealing with to the
lower levels in an allocation routine that make it bad idea.  Because
if you don't consider the IBM HV it provides not benefit and just puts
unnecessary loops, and type information in architecture code.

Face it.  Trying to make the allocation routine serve for both the
raw and the HV case unmodified is a layering violation.

>> But importantly the hooks are at a whole different layer of the code
>> and most likely at a completely different granularity.  You don't have
>> per bus hypervisor support do you?
>> 
>> So as I see it that is a different layer and should be treated differently.
>
> Well, I think that treating them differently will on the contrary
> complicate the matter :-)

It is tying unrelated concerns together in msi_ops and I am opposed to
that.

> Now, as I said, I agree that Michael's current ops definition might
> benefit from some changes.
>
> I do agree for example that we might want to rework a bit what is done
> in the area of the ->setup_msi_msg. An option is to remove it and
> instead have the backend ->enable() hook be the one figuring out the
> message and calling a low level -raw- helper rather than having a
> generic raw helper hook directly in ->enable and itself then use
> ->setup_msi_msg as a lower level hook to get the message.
>
> Since we need a low level raw helper to writeout the message
> address/data anyway (for use by set_affinity among others), by doing so,
> we avoid duplication.

Exactly.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 20:47       ` Jeff Garzik
@ 2007-01-28 23:42         ` David Miller
  -1 siblings, 0 replies; 178+ messages in thread
From: David Miller @ 2007-01-28 23:42 UTC (permalink / raw)
  To: jeff
  Cc: benh, ebiederm, greg, tony.luck, grundler, mingo, linux-kernel,
	kyle, linuxppc-dev, brice, shaohua.li, linux-pci

From: Jeff Garzik <jeff@garzik.org>
Date: Sun, 28 Jan 2007 15:47:24 -0500

> I think the high-level ops approach makes more sense.  It's more future 
> proof, in addition to covering all existing implementations.

I totally agree with this.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-28 23:42         ` David Miller
  0 siblings, 0 replies; 178+ messages in thread
From: David Miller @ 2007-01-28 23:42 UTC (permalink / raw)
  To: jeff
  Cc: tony.luck, grundler, linux-kernel, kyle, linuxppc-dev, ebiederm,
	greg, shaohua.li, mingo, linux-pci, brice

From: Jeff Garzik <jeff@garzik.org>
Date: Sun, 28 Jan 2007 15:47:24 -0500

> I think the high-level ops approach makes more sense.  It's more future 
> proof, in addition to covering all existing implementations.

I totally agree with this.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 21:20         ` Eric W. Biederman
@ 2007-01-28 23:44           ` David Miller
  -1 siblings, 0 replies; 178+ messages in thread
From: David Miller @ 2007-01-28 23:44 UTC (permalink / raw)
  To: ebiederm
  Cc: jeff, benh, greg, tony.luck, grundler, mingo, linux-kernel, kyle,
	linuxppc-dev, brice, shaohua.li, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 14:20:12 -0700

> I see people pushing ridiculous interfaces like the RTAS hypervisor
> interface at me, and saying we must support running firmware drivers
> in the msi code.

This is not what's going on.

The hypervisor does the PCI config space programming on the
device to setup the MSI so that it can be done in a controlled
manner and such that the device cannot ever be configured by
one domain to shoot MSI packets over at devices which belong
to another domain.

It's that simple.

That's absolutely reasonable, and is I believe what you'll see the
sparc64 hypervisor(s) all needing as well.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-28 23:44           ` David Miller
  0 siblings, 0 replies; 178+ messages in thread
From: David Miller @ 2007-01-28 23:44 UTC (permalink / raw)
  To: ebiederm
  Cc: tony.luck, grundler, jeff, linux-kernel, kyle, linuxppc-dev,
	brice, greg, shaohua.li, mingo, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 14:20:12 -0700

> I see people pushing ridiculous interfaces like the RTAS hypervisor
> interface at me, and saying we must support running firmware drivers
> in the msi code.

This is not what's going on.

The hypervisor does the PCI config space programming on the
device to setup the MSI so that it can be done in a controlled
manner and such that the device cannot ever be configured by
one domain to shoot MSI packets over at devices which belong
to another domain.

It's that simple.

That's absolutely reasonable, and is I believe what you'll see the
sparc64 hypervisor(s) all needing as well.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28 23:38                     ` Eric W. Biederman
@ 2007-01-28 23:51                       ` David Miller
  2007-01-29  0:58                         ` Benjamin Herrenschmidt
  2007-01-29  0:26                       ` Benjamin Herrenschmidt
  2007-01-29  0:59                       ` Michael Ellerman
  2 siblings, 1 reply; 178+ messages in thread
From: David Miller @ 2007-01-28 23:51 UTC (permalink / raw)
  To: ebiederm; +Cc: kyle, linuxppc-dev, brice, greg, shaohua.li, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 16:38:28 -0700

> That isn't even the reason it is that way.  It is because allocating
> 4096 irqs in a single vector is a bad idea, and because it requires you
> to pass type information of what kind of msi you are dealing with to the
> lower levels in an allocation routine that make it bad idea.  Because
> if you don't consider the IBM HV it provides not benefit and just puts
> unnecessary loops, and type information in architecture code.

Eric, get over it, sparc64 will need this kind of abstraction
too in order to support MSI properly.

There are specific calls into the sparc64 hypervisor for MSI vs. MSI-X
configuration operations.  So a type is necessary.

Sun Niagara and IBM RTAS hypervisors are not going to get
rearchitected because you peed your pants over this on some Linux
mailing list :-)  Trust me on that one :))

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28 23:31                   ` David Miller
@ 2007-01-28 23:59                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-28 23:59 UTC (permalink / raw)
  To: David Miller
  Cc: greg, kyle, linuxppc-dev, brice, shaohua.li, linux-pci, ebiederm

> Look, I have no idea where all this resistence come from to abstract
> this stuff behind enough levels to support things like RTAS et al.
> properly.  Please stop it now.

Note that to be totally fair, in some aspects (mostly simplicity gained
from not handling the RTAS-type setup), Eric code is nicer than our
proposal.

What annoys me is that Eric wants to completely separate the handling of
RTAS-type via a separate abstraction than the "classic" case.

The main thing here is that with Eric code, the backend really only
cares about one interrupt at a time, via the alloc/free hook, and thus
can totally ignore wether it's an MSI or one of multiple MSI-X (or even
one of multiple MSIs if we ever support that).

Michael's code makes it a little bit less transparent... alloc() /
free() has to operate on a level that matches the HV interfaces, thus
are called for either a single MSI or a set of MSI-X, though we made
that interface nice enough so we really only deal with an array and a
count (with the count being 1 for a single MSI).

One thing we could do, is remove our enable/disable hooks. The
functionality can be kept into the core, as is with Eric's code,
provided we have a way for alloc/free to say "job done, nothing else
needed", via either a special result code or maybe an ops "member"
variable set to 1 statically in the definition of the RTAS ops.

Another thing is we still need to have the addr/data returned for the
non-RTAS case. Eric doesn't like the setup_msi_msg() callback through
the ops because it operates at a different layer than alloc/free. The
option there would have to have alloc/free return the setup infos and
store them in the msi data on platforms where that is useful.

At this point I don't really have a firm preference of either taking
Michael's code and changing it in some areas to please Eric or try to
evolve from Eric's code, though I do feel that the later would still
have strong resistance in the area where alloc/free are concerned, that
is the whole idea of allocating the whole set at once or per-MSI, the
later being unsuitable for RTAS-like implementations.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28 23:38                     ` Eric W. Biederman
  2007-01-28 23:51                       ` David Miller
@ 2007-01-29  0:26                       ` Benjamin Herrenschmidt
  2007-01-29  0:59                       ` Michael Ellerman
  2 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29  0:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller


> That isn't even the reason it is that way.  It is because allocating
> 4096 irqs in a single vector is a bad idea

Why ?

> and because it requires you to pass type information of what kind of
> msi you are dealing with to the lower levels in an allocation routine 
> that make it bad idea.  
> Because if you don't consider the IBM HV it provides not benefit and 
> just puts unnecessary loops, and type information in architecture 
> code.

The only difference in practice is loop vs. no loop in fact. That is

alloc_irq (one MSI) for your version and
alloc_irqs (an array of MSIs) for our version.

Sure, the later, we also pass the type, but you don't use it for the
"raw" case, we only use it for the hypervisor case.

The difference is that one version (yours) cannot handle the HV case
while the other can. Makes a big difference to me.

> Face it.  Trying to make the allocation routine serve for both the
> raw and the HV case unmodified is a layering violation.

So you want two separate abstractions and I think that's gross.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28 23:51                       ` David Miller
@ 2007-01-29  0:58                         ` Benjamin Herrenschmidt
  2007-01-29  1:13                           ` David Miller
  2007-01-31  6:52                           ` David Miller
  0 siblings, 2 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29  0:58 UTC (permalink / raw)
  To: David Miller
  Cc: greg, kyle, linuxppc-dev, brice, shaohua.li, linux-pci, ebiederm


> There are specific calls into the sparc64 hypervisor for MSI vs. MSI-X
> configuration operations.  So a type is necessary.

BTW. Do you have some pointers to documentation on those sparc64
interfaces ? I'd like to have a look as we might still try to change
some of our approach to match some of Eric's whishes, I want to make
sure I'm not going somewhere that will not work for sparc...

For example, I'd like to know if sparc64 HV is indeed like IBM, that is
a single HV call does the complete setup, or if you still have some
level of manual config space access to do.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-28 23:38                     ` Eric W. Biederman
  2007-01-28 23:51                       ` David Miller
  2007-01-29  0:26                       ` Benjamin Herrenschmidt
@ 2007-01-29  0:59                       ` Michael Ellerman
  2 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-29  0:59 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Kyle McMartin, linuxppc-dev, Brice Goglin, Greg Kroah-Hartman,
	shaohua.li, linux-pci, David S. Miller

[-- Attachment #1: Type: text/plain, Size: 4123 bytes --]

On Sun, 2007-01-28 at 16:38 -0700, Eric W. Biederman wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:
> 
> >> Because it mixes concerns that do not need to be mixed, and it complicates
> >> the code.  The hypervisor has no need to understand how a hardware
> >> device is built, and how it's registers operate.  It just needs to
> >> know that the given hardware device will generate an msi message on
> >> the bus.
> >
> > I'd be happy for you to go explain your view of what an hypervisor
> > should or should not do to IBM HV architects :-) 
> 
> Sure think you can setup a meeting, or give me an email introduction to them.
> 
> > But in the meantime,
> > that's how they have defined it and how it's been implemented and how we
> > have to support it. And I have a strong feeling that they won't be the
> > only ones to do it that way (I'd like to be proven wrong tho).
> 
> No the general linux kernel does not have to support it, and if we don't
> I suspect that message would get back fairly clearly to the IBM HV architects.

OT, but: No, it wouldn't. It would just play into the hands of people
who think Linux is immature, unpredictable and risky. IBM has another
UNIX remember.

> I haven't been watching closely but I haven heard any rumors on the x86
> side that they are looking in that direction.
> 
> >> Right, so some way needs to be found to cope with that situation.
> >> Likely that involves bypassing all of the code that talks directly to
> >> the hardware for MSI.
> >
> > Which can be done by having the alloc() and free() hooks do all the work
> > provide they aren't done per-msi but per-call like in Michael's
> > approach. That is, in the MSI-X case, alloc is called once for all of
> > the MSI-X requested.
> >
> > I understand that this conflicts with your idea of requesting new MSI-X
> > on the fly but I don't think that trying to add/remove MSI-X that way is
> > a sane approach anyway. If you are concerned about HW problems, I think
> > by doing so, you'll indeed hit them hard.
> 
> That isn't even the reason it is that way.  It is because allocating
> 4096 irqs in a single vector is a bad idea, and because it requires you
> to pass type information of what kind of msi you are dealing with to the
> lower levels in an allocation routine that make it bad idea.  Because
> if you don't consider the IBM HV it provides not benefit and just puts
> unnecessary loops, and type information in architecture code.

I'm not sure what the issue with 4096 irqs is.

As far as passing type information to the alloc routine, it's only there
_if_ the alloc routine needs it.

If you'd prefer we could not pass the type explicitly to the alloc
routine, but rather just have it sitting in the msi_info ... which is
exactly what the current code does, the type is stored in the msi_desc.

What we could do is move the msi_msg into the msix_entry struct, then we
could do alloc like below and remove the need for setup_msi_msg:

int arch_setup_msi_irqs(struct pci_dev *pdev, int num,
                       struct msix_entry *entries)
{
        int i;

        for (i = 0; i < num; i++) {
       		int irq, ret;
       		irq = create_irq();
       		if (irq < 0)
               		return irq;

       		set_irq_msi(irq, desc);
       		ret = msi_compose_msg(dev, irq, &entries[i].msg);
       		if (ret < 0) {
               		destroy_irq(irq);
               		return ret;
       		}

                entries[i].vector = irq;

       		set_irq_chip_and_handler_name(irq, &msi_chip, handle_edge_irq, "edge");
        }
}

Which is almost exactly the same as the current code, except it's inside
a for loop - and it saves the msg and vector to be written later by the
enable hook - which will basically be write_msi_msg() in a loop.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-29  0:58                         ` Benjamin Herrenschmidt
@ 2007-01-29  1:13                           ` David Miller
  2007-01-29  3:17                             ` Benjamin Herrenschmidt
  2007-01-29  5:46                             ` Eric W. Biederman
  2007-01-31  6:52                           ` David Miller
  1 sibling, 2 replies; 178+ messages in thread
From: David Miller @ 2007-01-29  1:13 UTC (permalink / raw)
  To: benh; +Cc: greg, kyle, linuxppc-dev, brice, shaohua.li, linux-pci, ebiederm

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Mon, 29 Jan 2007 11:58:21 +1100

> 
> > There are specific calls into the sparc64 hypervisor for MSI vs. MSI-X
> > configuration operations.  So a type is necessary.
> 
> BTW. Do you have some pointers to documentation on those sparc64
> interfaces ? I'd like to have a look as we might still try to change
> some of our approach to match some of Eric's whishes, I want to make
> sure I'm not going somewhere that will not work for sparc...
> 
> For example, I'd like to know if sparc64 HV is indeed like IBM, that is
> a single HV call does the complete setup, or if you still have some
> level of manual config space access to do.

I just started reading those docs right now in fact :-)

The sparc64 hypervisor manual is at:

	http://opensparc-t2.sunsource.net/index.html

Click on "UltraSPARC T1 Hypervisor API Specification" near
the bottom of the page.  The MSI bits are in section 21.4 on
page 105.

BTW, I like how Banjamin is being constructive by expressing
interest in how sparc64's hypervisor works instead of Eric's
seeming non-interest in how or why RTAS or sparc64 work the
way that they do :-)

That being said, it looks like the hypervisor calls just setup
the MSI config inside of the PCI host controller, you still have
to do the PCI config space writes.  So in this regard it's not
like RTAS.

The PCI controller defines a 32-bit and a 64-bit MSI address range
the PCI controller will respond to as MSI.  Then there are queues
where received MSI interrupt information is stored, you subsequently
assosciate a MSI (which they call "msi") or a MSI-X (which they call
a "msg") with one of these queues.  Each queue generates a unique
interrupt to the system, and therefore this is the granularity at
which CPU targetting is done.

All of this stuff is defined via various OFW properties in the PCI
controller root bus node.

Example:

    Node 0xf02762a0
        .node:  f02762a0
        available:  81000000.00000000.00000000.00000000.00010000.82000000.00000000.00120000.00000000.000e0000.82000000.00000000.00300000.00000000.7fcf0000
        reg:  c0000780.00000000.00000000.00000000
        ranges:  01000000.00000000.00000000.000000e8.10000000.00000000.10000000.02000000.00000000.00000000.000000ea.00000000.00000000.7fff0000.03000000.00000000.00000000.000000ec.00000000.00000003.ffff0000
        msi-eq-to-devino:  00000000.00000024.00000018
        #msi-eqs:  00000024
        msix-data-width:  00000020
        msi-eq-size:  00000080
        msi-ranges:  00000000.00000100
        msi-data-mask:  000000ff
        #msi:  00000100
        msi-address-ranges:  00000000.7fff0000.00010000.00000003.ffff0000.00010000
        bus-range:  00000002.00000007
        no-probe-list: '0'
        bus-parity-generated:  
        #address-cells:  00000003
        #size-cells:  00000002
        name: 'pci'
        compatible: 'SUNW,sun4v-pci'
        device_type: 'pciex'
        virtual-dma:  80000000.80000000
        interrupt-map:  00020000.00000000.00000000.00000001.f02762a0.00000014.00020000.00000000.00000000.00000002.f02762a0.00000015.00020000.00000000.00000000.00000003.f02762a0.00000016.00020000.00000000.00000000.00000004.f02762a0.00000017
        interrupt-map-mask:  00fff000.00000000.00000000.00000007
        #interrupt-cells:  00000001
        interrupts:  0000003f.0000003e

The "devino" is the "system interrupt", each of which you can
enable/disable/cpu-target.  The above states that there are
36 MSI queues, to which msis and msgs can be assosciated.
So this would encompass devinos 0x18 --> 0x18 + 0x24 because
the msi-eq-to-devino property specifies the triplet
"first msiQ, num msiQs, first devino".

I'm working out all of this stuff myself and will try to cut as much
of an implementation as possible over the next few evenings.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 23:26             ` Eric W. Biederman
@ 2007-01-29  1:33               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29  1:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeff Garzik, Greg Kroah-Hartman, Tony Luck, Grant Grundler,
	Ingo Molnar, linux-kernel, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller


> To be clear I see this as 2 distinct layers of code. enable/disable
> that talks directly to the hardware, and the helpers of enable/disable
> that allocate the irq.  I base this on the fact that I only need the
> alloc/free when I am exclusively working with real hardware.

We need the alloc/free in all cases, wether we are talking to real HW or
hypervisor. Alloc free is what allocates linux virtual irq numbers (or
irq_desc's if your prefer) and what sets up the irq_desc->irq_chip to
the appropriate thing for MSIs on that machines. Thus it's really the
required step for everybody.

The thing you seem to be mixing up is allocating of linux virtual irqs
(picking an irq desc) and allocating of a HW vectors on your platformn
(which happens to be the same pretty much on x86 nowdays no ? That is,
they have the same numbering domain don't they ?).

That is, while in the HV case, we don't allocate HW vectors (the HV does
it for us), we still need to allocate linux irqs, setup the irq desc,
and hook them up.

> > You seem to absolutely want to get the HV case to go throuh the same
> > code path as the "raw" case, and that will not happen.
> 
> Yes I do.  Because that is the only sane approach for a HV to use.

BUT WE DON'T HAVE A CHOICE ON WHAT APPROACH THE HV USES !!!! pfff...
Isn't that clear enough ?

IBM will not change their HV interfaces becasue you don't like them, and
I doubt sun will neither and despite you disagreeing on that, we -do-
have to support them (hell, that's what I'm paid for among other
things :-)

It would be nice if we could dictate all HV and hardware vendors around
how we think they should work and what interfaces they should provide, I
suppose M$ can do that with Windows, but unfortunately, we aren't in a
position to do that.
 
> And yes we need an irq allocator to call the HV to setup the upstream
> reception of the msi message.

Not sure I completely parse the above.

> However I don't think it will be to hard to support your HV once we get
> the real hardware supported.  I just refuse to consider it before we have
> figured out what makes sense in the context where we have to do everything.

Hrmph....

> >   .../... (irq operations)
> >
> >> These because they are per irq make sense as per bus operations unless
> >> you have a good architecture definition like x86 has.  Roughly those
> >> operations are what we currently have except the current operations
> >> are a little simpler and easier to deal with for the architecture
> >> code.
> >
> > Oh ? How so ? (easier/simpler ?)
> 
> I don't take a type parameter, and I don't take a vector.  All of
> that work is done in the generic code.

Well, so basically, the main difference is that we make MSI looks like
MSI-X by providing an alloc/free abstraction that takes an array in all
cases and you make MSI-X look like MSI by working one interrupt at a
time.

Your case avoids a for () loop in the backend, at the cost of being
fundamentally incompatible with our HV approach (and possibly others
like sparc64).

We do pass the MSI vs. MSI-X because it's handy for the HV case to pass
it along to the firmware, though it doesn't have to be used, and indeed,
in the "raw" case, we don't use it.
 
> > This is indeed a lower level hook to be used by "raw" enable/disable. An
> > other approach would be to remove it, have each backend have it's own
> > enable/disable that obtains the address/data and calls into a helper to
> > program them. This would indeed be a little bit nicer in a layering
> > perspective. But it adds a bit more code to each backend, so we kept
> > things closer to the way they used to be. I don't have a firm reason not
> > to change it however, I need talk to Michael in case he has more good
> > reasons to keep it that way around. 
> 
> The current code in the kernel already is structured that way because
> we have to reprogram the msi message on each irq migration.  Not using
> a helper to write the message would be a noticeable change and require
> a fair amount of code rewriting on the currently supported
> architectures.

We never proposed not to use a helper to write back the message. We are
missing such a helper in the current implementation, true, but that
doesn't mean we are opposed to havign it, on the contrary.

However, I don't think your implementation is much cleaner :-) The thing
is that Michael's implementation completely avoids having any knowledge
of the specifics of enabling/disabling MSI's or MSI-X's in the top level
core code.

The main difference after the alloc/free case is the enable/disable
case:

You do something like that: Toplevel calls the backend "setup" for each
MSI or MSI-X, which itself then calls back into a helper to actually
write the message, that helper doing then a switch/case on MSI vs. MSI-X
based on infos in the msi desc. Then, you go back to the toplevel which
goes whack the config space to atually do the global enabling of MSIs or
MSI-X.

Well, I don't think that from a layering perspective, that is much
nicer. Your toplevel is a mix of high level interface to the backend and
low level code specific to the "raw" implementation.

In fact, I preferred the way it was done previously in that area in the
sense that if you decide to have the "raw" implementation indeed be the
"default" one, then move it at the top level and call some hook to
obtain the address/value pair for each MSI. That doesn't preclude having
the low level write_msi_message() function still be exported for use by
the set_affinity callback.

Michael's approach is similar than the above except that instead of
having the raw implementation at the toplevel, it hooks is via
enable/disable/setup_msi_msg.
 
> Yes.  In general the mainline linux kernel does not support certain
> classes of stupidity.
> TCP offload engines, firmware drivers for
> hardware we care about, a fixed ABI to binary only modules, etc.
> It is the responsibility of the OS to setup MSI so we do it, not
> the firmware so we do it.
> 
> Not supporting stupid things that are hard to support encourages other
> people not to be so silly, especially when linux still works on the
> hardware when that silly feature isn't supported.

Not supporting IBM HV because of those idealistic reasons means not
supporting a whole range of IBM machines in linux since LSIs are
optional on PCI-E.

It's not just a performance difference. A whole set of hardware will
-not- work on those machines because somebody has an ideal view of the
world (heh, that's funny, that same person actively works on x86
support, damn, that's something less than ideal :-)

I think that's a bit of a lame attitude (without wanting to be
insulting). The same way we can't dictate HW vendors how to do their
stuff (we try to encourage them ,we try to teach them, but once the HW
is out and people use it, we do also try to actually support it). So
yes, we try to "fix" some of our HV stuff when we think it's too much
off the hook (for example, initial interfaces didn't allow to
differenciate MSI from MSI-X at all, we got that changed) but there's a
limit on our influence on these things (heh, they also have to support
other operating systems) and we can't just say "won't support you" when
the interfaces don't please us.

> For similar reasons we don't support more than 1 irq with a plain MSI
> capability.  

I never understood why we had this stupid limitation in our API. It
would have been easy enough to do an API that can support it, as long as
we properly define that the platform is allowed to give you less than
what you asked.

> It is hard

Not really... Heh, in fact, with those "stupid" hypervisor interfaces,
it's actually very easy :-) But even in the raw case. Really not that
hard. Easier than MSI-X in many ways.

> we can't do it on most hardware

I've seen quite a few cards who say they do more than 1 MSI and the host
hardware shouldn't matter in that area.

> and anyone who wants more than 1 irq should just implement MSI-X and everyone
> will be able to use it, on any hardware.

Sure, anyone should just implement their hardware the way the linux
folks tell them to do, too bad HW vendors don't worship us as gods and
don't take our rules as god send laws ...

> Part of the reason to not support a messed up HV interface if it hard
> is that a HV is just software.  Which means the incremental cost to
> fix it is roughly the same as fixing the linux kernel, and it puts
> the burden on the people doing stupid things not on the rest of us
> forever more.

The comparison between > 1 MSI and HV is bad here. Supporting only 1 MSI
actually still allows the HW to work. Not supporting the HV (and thus
not supporting MSIs on those machines) does not when you start hitting
hardware that doesn't do LSI (which is allowed by spec and is starting
to appear, some IB cards for example don't do LSI).

> No.  I have spent time fixing what is there, and made it work.  I see
> implementations proposed that don't handle cases I have fixed, and I
> don't see anything that resembles a simple migration path for i386,
> x86_64 and ia64.  Which is part of what annoys me when I am told
> the ops work for everything.

They potentially do, and the easy migration path is mostly to use the
existing code as a HV-type backend for x86, and then incrementally fix
our generic "raw" helpers and move bits of the x86 / old-generic-code to
it... 

That's also a nice incremental approach...
 
> As for the code not working important parts of the code (like MSI-X)
> don't even work on ppc. 
> The strength of my opposition is largely
> shaped by the number of people wearing rose colored glasses and
> ignoring the problems, and missing huge details.

Well, which is why I'd like to have a more constructive discussion on
how to address those rather than outright dismissing the approach. You
are using the fact that Michael's implementation isn't feature complete
as an argument to dismiss the entire approach. In a way, you are
commiting a layering violation in your argument :-)

However, we can't get that resolved if we still don't agree on the veric
basic premises of the direction we are taking. We need to agree on some
of the fundamentals (like having alloc/free take an array or be
per-interrupt) or agree to disagree in which case we have no choice on
our side to "finish" Michael's implementation to do all those bits it's
missing and propose it as an alternate since the main one will not
handle our needs.
 
> Given that we have been talking about things since before OLS I would
> have expected the ppc code to be a little farther along.

We have been delayed / side tracked with other things. Shit happens.

> Not the x86 backend but the raw backend.  You might not need all of
> the features because you are always going through another interrupt
> controller but that doesn't mean they shouldn't be there.

I never disagreed with that. I always said we should have most of the
missing bits added to the raw backend for x86.

> Michael has at least agreed to look in that direction so I'm hoping
> my changes remove some of the difficulty for him.

Some do, some makes it more difficult. The way you removed the
alloc/free vs. setup/teardown distinction and made the whole thing
per-interrupt makes things more difficult for us.

> Nor do I see the level of care being put into the problem that would
> cause me to trust a rewrite.  I have a huge number of little technical
> problems with the proposed code, and see absolutely no overriding
> virtue in it.  Especially when the worst of the problems with msi.c
> can be easily fixed, as demonstrated by my patchset.

Can be fixed in a way that is by design incompatible with what we need. 

How should I phrase this so you understand that's not an option for
us ? 

Ben.



^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-29  1:33               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29  1:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Tony Luck, Grant Grundler, Jeff Garzik, David S. Miller,
	Greg Kroah-Hartman, linux-kernel, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Ingo Molnar, linux-pci


> To be clear I see this as 2 distinct layers of code. enable/disable
> that talks directly to the hardware, and the helpers of enable/disable
> that allocate the irq.  I base this on the fact that I only need the
> alloc/free when I am exclusively working with real hardware.

We need the alloc/free in all cases, wether we are talking to real HW or
hypervisor. Alloc free is what allocates linux virtual irq numbers (or
irq_desc's if your prefer) and what sets up the irq_desc->irq_chip to
the appropriate thing for MSIs on that machines. Thus it's really the
required step for everybody.

The thing you seem to be mixing up is allocating of linux virtual irqs
(picking an irq desc) and allocating of a HW vectors on your platformn
(which happens to be the same pretty much on x86 nowdays no ? That is,
they have the same numbering domain don't they ?).

That is, while in the HV case, we don't allocate HW vectors (the HV does
it for us), we still need to allocate linux irqs, setup the irq desc,
and hook them up.

> > You seem to absolutely want to get the HV case to go throuh the same
> > code path as the "raw" case, and that will not happen.
> 
> Yes I do.  Because that is the only sane approach for a HV to use.

BUT WE DON'T HAVE A CHOICE ON WHAT APPROACH THE HV USES !!!! pfff...
Isn't that clear enough ?

IBM will not change their HV interfaces becasue you don't like them, and
I doubt sun will neither and despite you disagreeing on that, we -do-
have to support them (hell, that's what I'm paid for among other
things :-)

It would be nice if we could dictate all HV and hardware vendors around
how we think they should work and what interfaces they should provide, I
suppose M$ can do that with Windows, but unfortunately, we aren't in a
position to do that.
 
> And yes we need an irq allocator to call the HV to setup the upstream
> reception of the msi message.

Not sure I completely parse the above.

> However I don't think it will be to hard to support your HV once we get
> the real hardware supported.  I just refuse to consider it before we have
> figured out what makes sense in the context where we have to do everything.

Hrmph....

> >   .../... (irq operations)
> >
> >> These because they are per irq make sense as per bus operations unless
> >> you have a good architecture definition like x86 has.  Roughly those
> >> operations are what we currently have except the current operations
> >> are a little simpler and easier to deal with for the architecture
> >> code.
> >
> > Oh ? How so ? (easier/simpler ?)
> 
> I don't take a type parameter, and I don't take a vector.  All of
> that work is done in the generic code.

Well, so basically, the main difference is that we make MSI looks like
MSI-X by providing an alloc/free abstraction that takes an array in all
cases and you make MSI-X look like MSI by working one interrupt at a
time.

Your case avoids a for () loop in the backend, at the cost of being
fundamentally incompatible with our HV approach (and possibly others
like sparc64).

We do pass the MSI vs. MSI-X because it's handy for the HV case to pass
it along to the firmware, though it doesn't have to be used, and indeed,
in the "raw" case, we don't use it.
 
> > This is indeed a lower level hook to be used by "raw" enable/disable. An
> > other approach would be to remove it, have each backend have it's own
> > enable/disable that obtains the address/data and calls into a helper to
> > program them. This would indeed be a little bit nicer in a layering
> > perspective. But it adds a bit more code to each backend, so we kept
> > things closer to the way they used to be. I don't have a firm reason not
> > to change it however, I need talk to Michael in case he has more good
> > reasons to keep it that way around. 
> 
> The current code in the kernel already is structured that way because
> we have to reprogram the msi message on each irq migration.  Not using
> a helper to write the message would be a noticeable change and require
> a fair amount of code rewriting on the currently supported
> architectures.

We never proposed not to use a helper to write back the message. We are
missing such a helper in the current implementation, true, but that
doesn't mean we are opposed to havign it, on the contrary.

However, I don't think your implementation is much cleaner :-) The thing
is that Michael's implementation completely avoids having any knowledge
of the specifics of enabling/disabling MSI's or MSI-X's in the top level
core code.

The main difference after the alloc/free case is the enable/disable
case:

You do something like that: Toplevel calls the backend "setup" for each
MSI or MSI-X, which itself then calls back into a helper to actually
write the message, that helper doing then a switch/case on MSI vs. MSI-X
based on infos in the msi desc. Then, you go back to the toplevel which
goes whack the config space to atually do the global enabling of MSIs or
MSI-X.

Well, I don't think that from a layering perspective, that is much
nicer. Your toplevel is a mix of high level interface to the backend and
low level code specific to the "raw" implementation.

In fact, I preferred the way it was done previously in that area in the
sense that if you decide to have the "raw" implementation indeed be the
"default" one, then move it at the top level and call some hook to
obtain the address/value pair for each MSI. That doesn't preclude having
the low level write_msi_message() function still be exported for use by
the set_affinity callback.

Michael's approach is similar than the above except that instead of
having the raw implementation at the toplevel, it hooks is via
enable/disable/setup_msi_msg.
 
> Yes.  In general the mainline linux kernel does not support certain
> classes of stupidity.
> TCP offload engines, firmware drivers for
> hardware we care about, a fixed ABI to binary only modules, etc.
> It is the responsibility of the OS to setup MSI so we do it, not
> the firmware so we do it.
> 
> Not supporting stupid things that are hard to support encourages other
> people not to be so silly, especially when linux still works on the
> hardware when that silly feature isn't supported.

Not supporting IBM HV because of those idealistic reasons means not
supporting a whole range of IBM machines in linux since LSIs are
optional on PCI-E.

It's not just a performance difference. A whole set of hardware will
-not- work on those machines because somebody has an ideal view of the
world (heh, that's funny, that same person actively works on x86
support, damn, that's something less than ideal :-)

I think that's a bit of a lame attitude (without wanting to be
insulting). The same way we can't dictate HW vendors how to do their
stuff (we try to encourage them ,we try to teach them, but once the HW
is out and people use it, we do also try to actually support it). So
yes, we try to "fix" some of our HV stuff when we think it's too much
off the hook (for example, initial interfaces didn't allow to
differenciate MSI from MSI-X at all, we got that changed) but there's a
limit on our influence on these things (heh, they also have to support
other operating systems) and we can't just say "won't support you" when
the interfaces don't please us.

> For similar reasons we don't support more than 1 irq with a plain MSI
> capability.  

I never understood why we had this stupid limitation in our API. It
would have been easy enough to do an API that can support it, as long as
we properly define that the platform is allowed to give you less than
what you asked.

> It is hard

Not really... Heh, in fact, with those "stupid" hypervisor interfaces,
it's actually very easy :-) But even in the raw case. Really not that
hard. Easier than MSI-X in many ways.

> we can't do it on most hardware

I've seen quite a few cards who say they do more than 1 MSI and the host
hardware shouldn't matter in that area.

> and anyone who wants more than 1 irq should just implement MSI-X and everyone
> will be able to use it, on any hardware.

Sure, anyone should just implement their hardware the way the linux
folks tell them to do, too bad HW vendors don't worship us as gods and
don't take our rules as god send laws ...

> Part of the reason to not support a messed up HV interface if it hard
> is that a HV is just software.  Which means the incremental cost to
> fix it is roughly the same as fixing the linux kernel, and it puts
> the burden on the people doing stupid things not on the rest of us
> forever more.

The comparison between > 1 MSI and HV is bad here. Supporting only 1 MSI
actually still allows the HW to work. Not supporting the HV (and thus
not supporting MSIs on those machines) does not when you start hitting
hardware that doesn't do LSI (which is allowed by spec and is starting
to appear, some IB cards for example don't do LSI).

> No.  I have spent time fixing what is there, and made it work.  I see
> implementations proposed that don't handle cases I have fixed, and I
> don't see anything that resembles a simple migration path for i386,
> x86_64 and ia64.  Which is part of what annoys me when I am told
> the ops work for everything.

They potentially do, and the easy migration path is mostly to use the
existing code as a HV-type backend for x86, and then incrementally fix
our generic "raw" helpers and move bits of the x86 / old-generic-code to
it... 

That's also a nice incremental approach...
 
> As for the code not working important parts of the code (like MSI-X)
> don't even work on ppc. 
> The strength of my opposition is largely
> shaped by the number of people wearing rose colored glasses and
> ignoring the problems, and missing huge details.

Well, which is why I'd like to have a more constructive discussion on
how to address those rather than outright dismissing the approach. You
are using the fact that Michael's implementation isn't feature complete
as an argument to dismiss the entire approach. In a way, you are
commiting a layering violation in your argument :-)

However, we can't get that resolved if we still don't agree on the veric
basic premises of the direction we are taking. We need to agree on some
of the fundamentals (like having alloc/free take an array or be
per-interrupt) or agree to disagree in which case we have no choice on
our side to "finish" Michael's implementation to do all those bits it's
missing and propose it as an alternate since the main one will not
handle our needs.
 
> Given that we have been talking about things since before OLS I would
> have expected the ppc code to be a little farther along.

We have been delayed / side tracked with other things. Shit happens.

> Not the x86 backend but the raw backend.  You might not need all of
> the features because you are always going through another interrupt
> controller but that doesn't mean they shouldn't be there.

I never disagreed with that. I always said we should have most of the
missing bits added to the raw backend for x86.

> Michael has at least agreed to look in that direction so I'm hoping
> my changes remove some of the difficulty for him.

Some do, some makes it more difficult. The way you removed the
alloc/free vs. setup/teardown distinction and made the whole thing
per-interrupt makes things more difficult for us.

> Nor do I see the level of care being put into the problem that would
> cause me to trust a rewrite.  I have a huge number of little technical
> problems with the proposed code, and see absolutely no overriding
> virtue in it.  Especially when the worst of the problems with msi.c
> can be easily fixed, as demonstrated by my patchset.

Can be fixed in a way that is by design incompatible with what we need. 

How should I phrase this so you understand that's not an option for
us ? 

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-29  1:13                           ` David Miller
@ 2007-01-29  3:17                             ` Benjamin Herrenschmidt
  2007-01-29  4:19                               ` David Miller
  2007-01-29  5:46                             ` Eric W. Biederman
  1 sibling, 1 reply; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29  3:17 UTC (permalink / raw)
  To: David Miller
  Cc: greg, kyle, linuxppc-dev, brice, shaohua.li, linux-pci, ebiederm


> That being said, it looks like the hypervisor calls just setup
> the MSI config inside of the PCI host controller, you still have
> to do the PCI config space writes.  So in this regard it's not
> like RTAS.

Ok, between the spec and your email, I think I sort-of understand it :-)
(damn, the Sun spec is a bit obscure...). 

So basically, from a kernel MSI backend perspective, I think you would
mostly use the "raw" implementation and locally implement your own
vector allocation.

However, your vector space is per-bus (which is good), so you do need to
allocate linux virtual irqs and map them to the actual MSI vectors like
we do on powerpc.

I think Eric's framework would work for you. As long as you don't need
to do something special for MSI-X, which I don't think you do...

Of course, Michael's stuff would work too, though it needs some
additions as you probably need to use config space (or MSI-X MMIO) for
masking & unmasking which we haven't implemented yet.

You are probably better off starting from Eric's stuff with his latest
patches I suppose...

At this point, I feel like Eric and use will not find a common ground,
which leaves us to those options:

 - Just give up and keep our current powerpc hooks at the toplevel. That
is, powerpc does it's own pci_enable_msi/x etc... (we need to fix those
hooks a bit but basically that's the idea). Internally, those go through
function pointers on which the RTAS implementation hooks directly, and
for non-RTAS powerpc archs, those point back to Eric's code which is
useable for these. In addition, I still want to have Eric's two "arch"
callbacks be themselves ops derived from the PCI device but that too can
be done in arch specific ways.

 - Give up in a different way and on powerpc, use Michael's
infrastructure and not use Eric's code at all (that means moving
Michael's stuff back to arch/powerpc which was Greg's original objection
to it).

 - Try to force our stuff in by implementing x86 completely (and Altix)
under Michael's infrastructure and then try to convince
Andrew/Greg/Linus to take it. Fairly unlikely. We do have a somewhat
"gradual" approach to it which consist of having Michael's code at the
toplevel, Eric's code hooked in as if it was a hypervisor, and then
gradually "merge" the raw backend with the x86 code, but it doesn't seem
very sexy (to me neither).

The main problem that I see that prevents us from an approach where
either we fix Michael's code to please Eric or change Eric's code to fit
our needs is that the way Eric code is evolving (based on his latest
patches), it's moving into a direction that is fundamentally unuseable
for our RTAS backend.

So unless Eric agrees to change his mind on that issue, we simply cannot
find a common abstraction. Which means that the only way we'll ever be
able to implement RTAS is by having separate hooks above Eric's code.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-29  3:17                             ` Benjamin Herrenschmidt
@ 2007-01-29  4:19                               ` David Miller
  2007-01-29  4:44                                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 178+ messages in thread
From: David Miller @ 2007-01-29  4:19 UTC (permalink / raw)
  To: benh; +Cc: greg, kyle, linuxppc-dev, brice, shaohua.li, linux-pci, ebiederm

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Mon, 29 Jan 2007 14:17:02 +1100

> However, your vector space is per-bus (which is good), so you do need to
> allocate linux virtual irqs and map them to the actual MSI vectors like
> we do on powerpc.

Yes, I already use virtual irqs on powerpc so it'll be easy to
implement.

Those "devino" numbers are used with "device numbers" to create
system interrupt numbers, and I'd point the virtual IRQ at that.

> I think Eric's framework would work for you. As long as you don't need
> to do something special for MSI-X, which I don't think you do...

That's my current understanding as well.

>  - Try to force our stuff in by implementing x86 completely (and Altix)
> under Michael's infrastructure and then try to convince
> Andrew/Greg/Linus to take it. Fairly unlikely. We do have a somewhat
> "gradual" approach to it which consist of having Michael's code at the
> toplevel, Eric's code hooked in as if it was a hypervisor, and then
> gradually "merge" the raw backend with the x86 code, but it doesn't seem
> very sexy (to me neither).

Well unless you have a working alternative for x86/ia64/etc folk you
have no alternative to Eric's patches to offer for consideration.

I think in the future we'll see more stuff like RTAS, it's the only
way outside of hardware filtering in the PCI-E bridges to provide
real isolation between PCI devices that get divided into different
logical domains.  And full isolation is absolutely required for
proper virtualization.

I think Eric really needs to consider the problem of logical domains,
and what the problem is which the RTAS folks are trying to resolve.
You can't just say something sucks without providing a resaonable
alternative suggestion.

Eric isn't responding to any of my emails on this matter, and that is
not helping at all.  If he would, on the other hand, make constructive
suggestions of how to implement isolation between independant PCI
devices on the same PCI bus which belong to different logical domains,
accounting for MSI, we could actually have a real conversation.

You can't implement isolation unless you 1) strictly control what
devices can do to other devices on the PCI domain or 2) filter
transactions in the PCI bridges so that PCI devices cannot send
arbitrary junk to each other.

#2 is prohibitively expensive and complicated because it requires
specialized hardware.  #1 is low cost in that all you need to do is
make PCI config space accesses and MSI setup go through the
hypervisor.  That's why systems implement #1 to give full isolation.

That's why I think the whole MSI hypervisor thing done by RTAS is
absolutely reasonable and something we should support.  It's NOT like
TCP Offload Engines and the like, not at all, and it's quite upsetting
to see Eric characterize it in that way.  It's a protection and
isolation facility, not a way to hide hardware behind binary blobs.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-29  4:19                               ` David Miller
@ 2007-01-29  4:44                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29  4:44 UTC (permalink / raw)
  To: David Miller
  Cc: greg, kyle, linuxppc-dev, brice, shaohua.li, linux-pci, ebiederm


> You can't implement isolation unless you 1) strictly control what
> devices can do to other devices on the PCI domain or 2) filter
> transactions in the PCI bridges so that PCI devices cannot send
> arbitrary junk to each other.
> 
> #2 is prohibitively expensive and complicated because it requires
> specialized hardware.  #1 is low cost in that all you need to do is
> make PCI config space accesses and MSI setup go through the
> hypervisor.  That's why systems implement #1 to give full isolation.

Well, on IBM machines, they do #2 :-) Or rather, they have a P2P bridge
for every slot pretty much ... but then, those are expensive
machinces :-)

Now, #1 is an acceptable solution up to a certain point... MSIs being
just normal upstream stores, while you can try to prevent the OS from
programming the config space for a device to use somebody else MSI
target address or values, you can't prevent the OS to program some
random device DMA engine to issue those same bogus cycles, unless you
have an iommu per slot, which is what IBM has :-)

So yes, #1 is probably the "easy" solution but is definitely not 100%
robust.

> That's why I think the whole MSI hypervisor thing done by RTAS is
> absolutely reasonable and something we should support.  It's NOT like
> TCP Offload Engines and the like, not at all, and it's quite upsetting
> to see Eric characterize it in that way.  It's a protection and
> isolation facility, not a way to hide hardware behind binary blobs.

I tend myself to go for a simpler reason which is that unlike TOE which
is a supposed performance improvement on already working hardware with a
huge impact on the linux TCP stack, MSIs "a-la-rtas" are a fairly low
impact on an overall small piece of code to support, and not doing so
means basically totally unuseable machines as soon as MSI-only PCIe
stuff shows up (and it does exist already).

Ben. 

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 23:37               ` David Miller
@ 2007-01-29  5:18                 ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-29  5:18 UTC (permalink / raw)
  To: David Miller
  Cc: benh, jeff, greg, tony.luck, grundler, mingo, linux-kernel, kyle,
	linuxppc-dev, brice, shaohua.li, linux-pci

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Sun, 28 Jan 2007 16:26:44 -0700
>
>> Yes.  In general the mainline linux kernel does not support certain
>> classes of stupidity.  TCP offload engines, firmware drivers for
>> hardware we care about, a fixed ABI to binary only modules, etc.
>> It is the responsibility of the OS to setup MSI so we do it, not
>> the firmware so we do it.
>
> I absolutely disagree with you Eric, and I think you're being
> rediculious.
>
> If the hypervisor doesn't control the MSI PCI config space
> register writes, this allows the device to spam PCI devices
> which belong to other domains.
>
> It's a freakin' reasonable design trade off decision, get over
> it! :-)

I completely agree with you in the case you have described, it does
mean that the hypervisor needs to trust all of the MSI capable
hardware in the system but it if that is the best your hardware can
support it is a reasonable trade-off.

With the MSI-X registers in a random part of some memory mapped bar
and not guaranteed to be page aligned, things are more difficult to
isolate purely in a software based hypervisor.

> Yes it can be done at the hardware level, and many hypervisor
> based systems do that, but it's not the one-and-only true
> way to implment inter-domain protection behind a single
> PCI host controller.

The reason I consider the case crazy is that every example I have
been given is where the hardware is doing the filtering above the
PCI device.  So the hypervisor has no need to filter the pci config
traffic or to write to the msi config registers for us.  Yet the
defined hypervisor interface is.  Given the reduction in flexibility
of an interface where the hypervisor writes to the config registers
for the OS as compared to an interface where the hypervisor provides
a destination for MSI messages from a particular device upon request,
I think it is silly to design an interface when you full hardware
support to act like an interface built for a hypervisor that had
to do everything in software.

Regardless of my opinion on the sanity of the hypervisor architects.
I have not seen anything that indicates it will be hard to support
the hypervisor doing everything or most of everything for us, so
I see no valid technical objection to it.  Nor have I ever.

So I have no problem with additional patches in that direction.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-29  5:18                 ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-29  5:18 UTC (permalink / raw)
  To: David Miller
  Cc: tony.luck, grundler, jeff, linux-kernel, kyle, linuxppc-dev,
	brice, greg, shaohua.li, mingo, linux-pci

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Sun, 28 Jan 2007 16:26:44 -0700
>
>> Yes.  In general the mainline linux kernel does not support certain
>> classes of stupidity.  TCP offload engines, firmware drivers for
>> hardware we care about, a fixed ABI to binary only modules, etc.
>> It is the responsibility of the OS to setup MSI so we do it, not
>> the firmware so we do it.
>
> I absolutely disagree with you Eric, and I think you're being
> rediculious.
>
> If the hypervisor doesn't control the MSI PCI config space
> register writes, this allows the device to spam PCI devices
> which belong to other domains.
>
> It's a freakin' reasonable design trade off decision, get over
> it! :-)

I completely agree with you in the case you have described, it does
mean that the hypervisor needs to trust all of the MSI capable
hardware in the system but it if that is the best your hardware can
support it is a reasonable trade-off.

With the MSI-X registers in a random part of some memory mapped bar
and not guaranteed to be page aligned, things are more difficult to
isolate purely in a software based hypervisor.

> Yes it can be done at the hardware level, and many hypervisor
> based systems do that, but it's not the one-and-only true
> way to implment inter-domain protection behind a single
> PCI host controller.

The reason I consider the case crazy is that every example I have
been given is where the hardware is doing the filtering above the
PCI device.  So the hypervisor has no need to filter the pci config
traffic or to write to the msi config registers for us.  Yet the
defined hypervisor interface is.  Given the reduction in flexibility
of an interface where the hypervisor writes to the config registers
for the OS as compared to an interface where the hypervisor provides
a destination for MSI messages from a particular device upon request,
I think it is silly to design an interface when you full hardware
support to act like an interface built for a hypervisor that had
to do everything in software.

Regardless of my opinion on the sanity of the hypervisor architects.
I have not seen anything that indicates it will be hard to support
the hypervisor doing everything or most of everything for us, so
I see no valid technical objection to it.  Nor have I ever.

So I have no problem with additional patches in that direction.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  5:18                 ` Eric W. Biederman
@ 2007-01-29  5:25                   ` David Miller
  -1 siblings, 0 replies; 178+ messages in thread
From: David Miller @ 2007-01-29  5:25 UTC (permalink / raw)
  To: ebiederm
  Cc: benh, jeff, greg, tony.luck, grundler, mingo, linux-kernel, kyle,
	linuxppc-dev, brice, shaohua.li, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 22:18:59 -0700

> Regardless of my opinion on the sanity of the hypervisor architects.
> I have not seen anything that indicates it will be hard to support
> the hypervisor doing everything or most of everything for us, so
> I see no valid technical objection to it.  Nor have I ever.
> 
> So I have no problem with additional patches in that direction.

Ok, that's great to hear.

I know your bi-directional approach isn't exactly what Ben
wants but he can support his machines with it.  Maybe after
some time we can agree to move from that more towards the
totally abstracted scheme.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-29  5:25                   ` David Miller
  0 siblings, 0 replies; 178+ messages in thread
From: David Miller @ 2007-01-29  5:25 UTC (permalink / raw)
  To: ebiederm
  Cc: tony.luck, grundler, jeff, linux-kernel, kyle, linuxppc-dev,
	brice, greg, shaohua.li, mingo, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 22:18:59 -0700

> Regardless of my opinion on the sanity of the hypervisor architects.
> I have not seen anything that indicates it will be hard to support
> the hypervisor doing everything or most of everything for us, so
> I see no valid technical objection to it.  Nor have I ever.
> 
> So I have no problem with additional patches in that direction.

Ok, that's great to hear.

I know your bi-directional approach isn't exactly what Ben
wants but he can support his machines with it.  Maybe after
some time we can agree to move from that more towards the
totally abstracted scheme.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-29  1:13                           ` David Miller
  2007-01-29  3:17                             ` Benjamin Herrenschmidt
@ 2007-01-29  5:46                             ` Eric W. Biederman
  2007-01-29  6:08                               ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-29  5:46 UTC (permalink / raw)
  To: David Miller
  Cc: kyle, linuxppc-dev, ebiederm, greg, shaohua.li, linux-pci, brice

David Miller <davem@davemloft.net> writes:

> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Mon, 29 Jan 2007 11:58:21 +1100
>
>> 
>> > There are specific calls into the sparc64 hypervisor for MSI vs. MSI-X
>> > configuration operations.  So a type is necessary.
>> 
>> BTW. Do you have some pointers to documentation on those sparc64
>> interfaces ? I'd like to have a look as we might still try to change
>> some of our approach to match some of Eric's whishes, I want to make
>> sure I'm not going somewhere that will not work for sparc...
>> 
>> For example, I'd like to know if sparc64 HV is indeed like IBM, that is
>> a single HV call does the complete setup, or if you still have some
>> level of manual config space access to do.
>
> I just started reading those docs right now in fact :-)
>
> The sparc64 hypervisor manual is at:
>
> 	http://opensparc-t2.sunsource.net/index.html
>
> Click on "UltraSPARC T1 Hypervisor API Specification" near
> the bottom of the page.  The MSI bits are in section 21.4 on
> page 105.
>
> BTW, I like how Banjamin is being constructive by expressing
> interest in how sparc64's hypervisor works instead of Eric's
> seeming non-interest in how or why RTAS or sparc64 work the
> way that they do :-)

My problem is that I have been asking about RTAS for six months
since before OLS.  Slowly the information has trickled in.  My first
impression is boy is that weird.  My second impression after getting
the full details was huh?  That is ridiculous, simply because they
don't need to do a 

You wound up posting this sparc link before I could ask about it.
Sorry for taking a couple of hours to respond but I'm not always in
front of my computer, and to some extent responding to everything
was becoming counter productive.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  5:25                   ` David Miller
@ 2007-01-29  5:58                     ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-29  5:58 UTC (permalink / raw)
  To: David Miller
  Cc: benh, jeff, greg, tony.luck, grundler, mingo, linux-kernel, kyle,
	linuxppc-dev, brice, shaohua.li, linux-pci

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Sun, 28 Jan 2007 22:18:59 -0700
>
>> Regardless of my opinion on the sanity of the hypervisor architects.
>> I have not seen anything that indicates it will be hard to support
>> the hypervisor doing everything or most of everything for us, so
>> I see no valid technical objection to it.  Nor have I ever.
>> 
>> So I have no problem with additional patches in that direction.
>
> Ok, that's great to hear.
>
> I know your bi-directional approach isn't exactly what Ben
> wants but he can support his machines with it.  Maybe after
> some time we can agree to move from that more towards the
> totally abstracted scheme.

Moving farther has been my intention the entire time, even
while writing those patches.  I'm just not prepared to do it in
one giant patch where bug hunting becomes impossible.

I think I have moved msi.c to the point it won't be a horror to
work with, so we can start seriously looking at what it will
take to support hypervisors that do this.

I don't believe there is anything generic we can do in the general
hypervisor case, so we need a way for the architecture code in
the case where it is inappropriate to write directly to the msi
and msi-x registers to have a completely different implementation of:
pci_enable_msi, pci_disable_msi, pci_enable_msix, psi_disable_msix,
and whatever other driver interface bits we have in there.

One small step at a time and we should get there soon.

Eric


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-29  5:58                     ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-29  5:58 UTC (permalink / raw)
  To: David Miller
  Cc: tony.luck, grundler, jeff, linux-kernel, kyle, linuxppc-dev,
	brice, greg, shaohua.li, mingo, linux-pci

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Sun, 28 Jan 2007 22:18:59 -0700
>
>> Regardless of my opinion on the sanity of the hypervisor architects.
>> I have not seen anything that indicates it will be hard to support
>> the hypervisor doing everything or most of everything for us, so
>> I see no valid technical objection to it.  Nor have I ever.
>> 
>> So I have no problem with additional patches in that direction.
>
> Ok, that's great to hear.
>
> I know your bi-directional approach isn't exactly what Ben
> wants but he can support his machines with it.  Maybe after
> some time we can agree to move from that more towards the
> totally abstracted scheme.

Moving farther has been my intention the entire time, even
while writing those patches.  I'm just not prepared to do it in
one giant patch where bug hunting becomes impossible.

I think I have moved msi.c to the point it won't be a horror to
work with, so we can start seriously looking at what it will
take to support hypervisors that do this.

I don't believe there is anything generic we can do in the general
hypervisor case, so we need a way for the architecture code in
the case where it is inappropriate to write directly to the msi
and msi-x registers to have a completely different implementation of:
pci_enable_msi, pci_disable_msi, pci_enable_msix, psi_disable_msix,
and whatever other driver interface bits we have in there.

One small step at a time and we should get there soon.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  5:25                   ` David Miller
@ 2007-01-29  6:05                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29  6:05 UTC (permalink / raw)
  To: David Miller
  Cc: ebiederm, jeff, greg, tony.luck, grundler, mingo, linux-kernel,
	kyle, linuxppc-dev, brice, shaohua.li, linux-pci

On Sun, 2007-01-28 at 21:25 -0800, David Miller wrote:
> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Sun, 28 Jan 2007 22:18:59 -0700
> 
> > Regardless of my opinion on the sanity of the hypervisor architects.
> > I have not seen anything that indicates it will be hard to support
> > the hypervisor doing everything or most of everything for us, so
> > I see no valid technical objection to it.  Nor have I ever.
> > 
> > So I have no problem with additional patches in that direction.
> 
> Ok, that's great to hear.
> 
> I know your bi-directional approach isn't exactly what Ben
> wants but he can support his machines with it.  Maybe after
> some time we can agree to move from that more towards the
> totally abstracted scheme.

It can support my machines without HV with trivial changes I reckon: I
need an ops struct to indirect eric's 2 remaining arch hooks
(setup/teardown) but that can be done inline within asm-powerpc. I need
to double check of course and probably actually port the MPIC backend
and possibly go write the Cell Axon one while at it to verify everything
is allright, but the base design seems sound enough.

For the ones with HV (RTAS stuff), we still need to agree on how to
approach it. We can either:

Option 1
--------

Do a hook -above- Eric stuff, by having the toplevel APIs themselves be
arch hooks that can either go toward the RTAS implementation or toward
Eric's code. That is, eric code would define those (pick better names if
you are good at it):

	pci_generic_enable_msi
	pci_generic_disable_msi
	pci_generic_enable_msix
	pci_generic_disable_msix
	pci_generic_save_msi_state
	pci_generic_restore_msi_state

Then we can have asm-i386/msi.h & friends do something like

#define pci_enable_msi	pci_generic_enable_msi
#define pci_disable_msi	pci_generic_disable_msi
   etc...

And we can have asm-powerpc/msi.h hook then via ppc_md:

static inline int pci_enable_msi(xxx...)
{
	return ppc_md.pci_enable_msi(xxx...);
}
etc...

(ppc_md is our per-platform global hook structure filled at boot when we
discover on what machine type we are running on) so that pSeries can use
it's own RTAS callbacks, and others can just re-hook those to Eric's
code.


Option 2
--------

That is to make Eric's code itself cope with the HV case. I'm a bit at
loss right now as how precisely to do it. I need to spend more time
staring at the code after Eric latest patches rather than the patches
themselves I suppose :-) (Eric, they don't apply out of the box on
current git, they are against -mm ?).

Some of the main issues here, more/less following the order in which
Eric code calls things:

 - The number of vectors for MSI-X is obtained from config space (at
least for sanity checking the requested argument). On RTAS, it should
come from an OF property (we are really not supposed to go read the
config space even if we can). I -suppose- we can survive for now with
just reading it, but we might well run into trouble with some "special"
devices shared accross partitions or if the IBM magic bridges themselves
ever start sending MSI-X on their own (unlikely but who knows...).
Michael's code handled that by having a callback ->check() do the sanity
checking of the nvec, and then just use the nvec passed in as an
argument once it's sane.

So for that I would propose adding an arch_check_msi(pdev, type, nvec)
or something like that. Note the biggest issue at this point anyway.

 - The real big one: For MSI-X, Eric's code tries to "hide" the fact
that those are MSI-X by allocating the msi-x entry array, then iterating
through them calling arch_setup_msi_irq() for each of them.

For that to work for us, it would need to be different, possibly
pre-allocating the array, and having -one- call taking an array and a
nvec. That's one of the reasons why I liked Michael's approach as
instead of making MSI-X look like MSI, it made MSI look like MSI-X by
passing a 1 entry array in the MSI case. Both approaches can probably be
made to handle multiple MSIs if we ever want to handle them.

The same issue is present for teardown of course.

 - We need HV hooks for suspend/resume at one point. Nothing urgent
there as our HV machines don't do suspend/resume just yet :-) But if we
ever implement something like suspend-to-disk, they'll definitely need
something as we are likely to get different vectors back from the
firmware so we need to re-map them to the same linux IRQ numbers.

I need to have a second look at Eric's code after I manage to find the
right combination of kernel for his patches to apply to check if I
missed anything important.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-29  6:05                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29  6:05 UTC (permalink / raw)
  To: David Miller
  Cc: tony.luck, grundler, jeff, greg, linux-kernel, kyle,
	linuxppc-dev, ebiederm, shaohua.li, mingo, linux-pci, brice

On Sun, 2007-01-28 at 21:25 -0800, David Miller wrote:
> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Sun, 28 Jan 2007 22:18:59 -0700
> 
> > Regardless of my opinion on the sanity of the hypervisor architects.
> > I have not seen anything that indicates it will be hard to support
> > the hypervisor doing everything or most of everything for us, so
> > I see no valid technical objection to it.  Nor have I ever.
> > 
> > So I have no problem with additional patches in that direction.
> 
> Ok, that's great to hear.
> 
> I know your bi-directional approach isn't exactly what Ben
> wants but he can support his machines with it.  Maybe after
> some time we can agree to move from that more towards the
> totally abstracted scheme.

It can support my machines without HV with trivial changes I reckon: I
need an ops struct to indirect eric's 2 remaining arch hooks
(setup/teardown) but that can be done inline within asm-powerpc. I need
to double check of course and probably actually port the MPIC backend
and possibly go write the Cell Axon one while at it to verify everything
is allright, but the base design seems sound enough.

For the ones with HV (RTAS stuff), we still need to agree on how to
approach it. We can either:

Option 1
--------

Do a hook -above- Eric stuff, by having the toplevel APIs themselves be
arch hooks that can either go toward the RTAS implementation or toward
Eric's code. That is, eric code would define those (pick better names if
you are good at it):

	pci_generic_enable_msi
	pci_generic_disable_msi
	pci_generic_enable_msix
	pci_generic_disable_msix
	pci_generic_save_msi_state
	pci_generic_restore_msi_state

Then we can have asm-i386/msi.h & friends do something like

#define pci_enable_msi	pci_generic_enable_msi
#define pci_disable_msi	pci_generic_disable_msi
   etc...

And we can have asm-powerpc/msi.h hook then via ppc_md:

static inline int pci_enable_msi(xxx...)
{
	return ppc_md.pci_enable_msi(xxx...);
}
etc...

(ppc_md is our per-platform global hook structure filled at boot when we
discover on what machine type we are running on) so that pSeries can use
it's own RTAS callbacks, and others can just re-hook those to Eric's
code.


Option 2
--------

That is to make Eric's code itself cope with the HV case. I'm a bit at
loss right now as how precisely to do it. I need to spend more time
staring at the code after Eric latest patches rather than the patches
themselves I suppose :-) (Eric, they don't apply out of the box on
current git, they are against -mm ?).

Some of the main issues here, more/less following the order in which
Eric code calls things:

 - The number of vectors for MSI-X is obtained from config space (at
least for sanity checking the requested argument). On RTAS, it should
come from an OF property (we are really not supposed to go read the
config space even if we can). I -suppose- we can survive for now with
just reading it, but we might well run into trouble with some "special"
devices shared accross partitions or if the IBM magic bridges themselves
ever start sending MSI-X on their own (unlikely but who knows...).
Michael's code handled that by having a callback ->check() do the sanity
checking of the nvec, and then just use the nvec passed in as an
argument once it's sane.

So for that I would propose adding an arch_check_msi(pdev, type, nvec)
or something like that. Note the biggest issue at this point anyway.

 - The real big one: For MSI-X, Eric's code tries to "hide" the fact
that those are MSI-X by allocating the msi-x entry array, then iterating
through them calling arch_setup_msi_irq() for each of them.

For that to work for us, it would need to be different, possibly
pre-allocating the array, and having -one- call taking an array and a
nvec. That's one of the reasons why I liked Michael's approach as
instead of making MSI-X look like MSI, it made MSI look like MSI-X by
passing a 1 entry array in the MSI case. Both approaches can probably be
made to handle multiple MSIs if we ever want to handle them.

The same issue is present for teardown of course.

 - We need HV hooks for suspend/resume at one point. Nothing urgent
there as our HV machines don't do suspend/resume just yet :-) But if we
ever implement something like suspend-to-disk, they'll definitely need
something as we are likely to get different vectors back from the
firmware so we need to re-map them to the same linux IRQ numbers.

I need to have a second look at Eric's code after I manage to find the
right combination of kernel for his patches to apply to check if I
missed anything important.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-29  5:46                             ` Eric W. Biederman
@ 2007-01-29  6:08                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29  6:08 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: greg, kyle, linuxppc-dev, brice, shaohua.li, linux-pci, David Miller


> My problem is that I have been asking about RTAS for six months
> since before OLS.  Slowly the information has trickled in.  My first
> impression is boy is that weird.  My second impression after getting
> the full details was huh?  That is ridiculous, simply because they
> don't need to do a 

Michael's been posting early versions of his work ages ago, as Jake did
with some of his earlier stuff based on hooking at the toplevel, and I'm
pretty sure that at least for Michael's stuff, you've always been CCed.

Anyway, doesn't matter now. In my latest reply to David, I've basically
summarized what I think are our 2 options to move forward based on your
patches, I would appreciate your input on that.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 4/16] Abstract MSI suspend
  2007-01-28  8:27   ` [RFC/PATCH 4/16] Abstract MSI suspend Eric W. Biederman
@ 2007-01-29  7:22     ` Michael Ellerman
  2007-01-29  8:45       ` Eric W. Biederman
  2007-02-01  4:24       ` Greg KH
  0 siblings, 2 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-29  7:22 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller, EricW.Biederman

[-- Attachment #1: Type: text/plain, Size: 1907 bytes --]

On Sun, 2007-01-28 at 01:27 -0700, Eric W. Biederman wrote:
> Michael Ellerman <michael@ellerman.id.au> writes:
> 
> > Currently pci_disable_device() disables MSI on a device by twiddling
> > bits in config space via disable_msi_mode().
> >
> > On some platforms that may not be appropriate, so abstract the MSI
> > suspend logic into pci_disable_device_msi().
> 
> >
> > Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
> > ---
> >
> >  drivers/pci/msi.c |   11 +++++++++++
> >  drivers/pci/pci.c |    7 +------
> >  drivers/pci/pci.h |    2 ++
> >  3 files changed, 14 insertions(+), 6 deletions(-)
> >
> > Index: msi/drivers/pci/msi.c
> > ===================================================================
> > --- msi.orig/drivers/pci/msi.c
> > +++ msi/drivers/pci/msi.c
> > @@ -271,6 +271,17 @@ void disable_msi_mode(struct pci_dev *de
> >  	pci_intx(dev, 1);  /* enable intx */
> >  }
> >  
> > +void pci_disable_device_msi(struct pci_dev *dev)
> > +{
> > +	if (dev->msi_enabled)
> > +		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
> > +			PCI_CAP_ID_MSI);
> > +
> > +	if (dev->msix_enabled)
> > +		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
> > +			PCI_CAP_ID_MSIX);
> 
> Just a quick note. This is wrong.  It should be PCI_CAP_ID_MSIX.
> The code that is being moved is buggy.  So the patch itself doesn't
> make the situation any worse.

Greg, if you want to drop that patch I'll prepare two patches to fix it
and then move it. I don't have any hardware to test, although I'm
guessing no one does given that it's been broken since its inception.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  6:05                     ` Benjamin Herrenschmidt
@ 2007-01-29  8:28                       ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-29  8:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Miller, ebiederm, jeff, greg, tony.luck, grundler, mingo,
	linux-kernel, kyle, linuxppc-dev, brice, shaohua.li, linux-pci

>
> That is to make Eric's code itself cope with the HV case. I'm a bit at
> loss right now as how precisely to do it. I need to spend more time
> staring at the code after Eric latest patches rather than the patches
> themselves I suppose :-) (Eric, they don't apply out of the box on
> current git, they are against -mm ?).

Current git + gregkh-pci (Which has a couple of Michaels patches).
With current git the only problem should be context around msi_lookup_irq
which changes between the two.  But in this case the context around
an entire function being deleted doesn't matter.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-29  8:28                       ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-29  8:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: tony.luck, grundler, jeff, greg, linux-kernel, kyle,
	linuxppc-dev, linux-pci, ebiederm, shaohua.li, mingo,
	David Miller, brice

>
> That is to make Eric's code itself cope with the HV case. I'm a bit at
> loss right now as how precisely to do it. I need to spend more time
> staring at the code after Eric latest patches rather than the patches
> themselves I suppose :-) (Eric, they don't apply out of the box on
> current git, they are against -mm ?).

Current git + gregkh-pci (Which has a couple of Michaels patches).
With current git the only problem should be context around msi_lookup_irq
which changes between the two.  But in this case the context around
an entire function being deleted doesn't matter.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 4/16] Abstract MSI suspend
  2007-01-29  7:22     ` Michael Ellerman
@ 2007-01-29  8:45       ` Eric W. Biederman
  2007-01-29  9:47         ` Michael Ellerman
  2007-02-01  4:24       ` Greg KH
  1 sibling, 1 reply; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-29  8:45 UTC (permalink / raw)
  To: michael
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Michael Ellerman <michael@ellerman.id.au> writes:

> On Sun, 2007-01-28 at 01:27 -0700, Eric W. Biederman wrote:
>> Michael Ellerman <michael@ellerman.id.au> writes:
>> 
>> > Currently pci_disable_device() disables MSI on a device by twiddling
>> > bits in config space via disable_msi_mode().
>> >
>> > On some platforms that may not be appropriate, so abstract the MSI
>> > suspend logic into pci_disable_device_msi().
>> 
>> >
>> > Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
>> > ---
>> >
>> >  drivers/pci/msi.c |   11 +++++++++++
>> >  drivers/pci/pci.c |    7 +------
>> >  drivers/pci/pci.h |    2 ++
>> >  3 files changed, 14 insertions(+), 6 deletions(-)
>> >
>> > Index: msi/drivers/pci/msi.c
>> > ===================================================================
>> > --- msi.orig/drivers/pci/msi.c
>> > +++ msi/drivers/pci/msi.c
>> > @@ -271,6 +271,17 @@ void disable_msi_mode(struct pci_dev *de
>> >  	pci_intx(dev, 1);  /* enable intx */
>> >  }
>> >  
>> > +void pci_disable_device_msi(struct pci_dev *dev)
>> > +{
>> > +	if (dev->msi_enabled)
>> > +		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
>> > +			PCI_CAP_ID_MSI);
>> > +
>> > +	if (dev->msix_enabled)
>> > +		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
>> > +			PCI_CAP_ID_MSIX);
>> 
>> Just a quick note. This is wrong.  It should be PCI_CAP_ID_MSIX.
>> The code that is being moved is buggy.  So the patch itself doesn't
>> make the situation any worse.
>
> Greg, if you want to drop that patch I'll prepare two patches to fix it
> and then move it. I don't have any hardware to test, although I'm
> guessing no one does given that it's been broken since its inception.

The mthca IB driver was one of the early adopters of MSI, and it uses
MSI-X.  So it isn't that no one is using MSI-X and the MSI-X code
paths don't get exercised.

I expect what is closer to the truth is that the code authors have so
far simply disabled msi separately instead of assuming pci_disable_device
will do it magically for them.  So it may be the case that we can
just kill this code path altogether.

Possibly we can just reduce it to WARN_ON(dev->msi_enabled || dev->msix_enabled);

I suspect msi_remove_pci_irq_vectors may similarly not actually make a
difference.   I think the function dates from a time when the code
attempted to cache the irq number so if you removed and re-added a module
or at least disabled and enabled msi you would get the same linux irq
number.  I remember killing that caching because it made the code
unmaintainable and wasn't really useful.

In summary I think there is still room for cleanup in msi.c, but it
think it is at least reaching the point where you just don't stumble
over opportunities.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  6:05                     ` Benjamin Herrenschmidt
@ 2007-01-29  9:03                       ` Eric W. Biederman
  -1 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-29  9:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Miller, jeff, greg, tony.luck, grundler, mingo,
	linux-kernel, kyle, linuxppc-dev, brice, shaohua.li, linux-pci

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

> On Sun, 2007-01-28 at 21:25 -0800, David Miller wrote:
>> From: ebiederm@xmission.com (Eric W. Biederman)
>> Date: Sun, 28 Jan 2007 22:18:59 -0700
>> 
>> > Regardless of my opinion on the sanity of the hypervisor architects.
>> > I have not seen anything that indicates it will be hard to support
>> > the hypervisor doing everything or most of everything for us, so
>> > I see no valid technical objection to it.  Nor have I ever.
>> > 
>> > So I have no problem with additional patches in that direction.
>> 
>> Ok, that's great to hear.
>> 
>> I know your bi-directional approach isn't exactly what Ben
>> wants but he can support his machines with it.  Maybe after
>> some time we can agree to move from that more towards the
>> totally abstracted scheme.
>
> It can support my machines without HV with trivial changes I reckon: I
> need an ops struct to indirect eric's 2 remaining arch hooks
> (setup/teardown) but that can be done inline within asm-powerpc. I need
> to double check of course and probably actually port the MPIC backend
> and possibly go write the Cell Axon one while at it to verify everything
> is allright, but the base design seems sound enough.
>
> For the ones with HV (RTAS stuff), we still need to agree on how to
> approach it. We can either:
>
> Option 1
> --------
>
> Do a hook -above- Eric stuff, by having the toplevel APIs themselves be
> arch hooks that can either go toward the RTAS implementation or toward
> Eric's code. That is, eric code would define those (pick better names if
> you are good at it):
>
> 	pci_generic_enable_msi
> 	pci_generic_disable_msi
> 	pci_generic_enable_msix
> 	pci_generic_disable_msix
> 	pci_generic_save_msi_state
> 	pci_generic_restore_msi_state
>
> Then we can have asm-i386/msi.h & friends do something like
>
> #define pci_enable_msi	pci_generic_enable_msi
> #define pci_disable_msi	pci_generic_disable_msi
>    etc...
>
> And we can have asm-powerpc/msi.h hook then via ppc_md:
>
> static inline int pci_enable_msi(xxx...)
> {
> 	return ppc_md.pci_enable_msi(xxx...);
> }
> etc...
>
> (ppc_md is our per-platform global hook structure filled at boot when we
> discover on what machine type we are running on) so that pSeries can use
> it's own RTAS callbacks, and others can just re-hook those to Eric's
> code.

This is the most straight forward and handles machines with really
weird msi setups, so I lean in this direction.

The question is there anything at all we can do generically?

I can't see a case where ppc_md would not wind up with the hooks
that decide if it is a hypervisor or not.  Even if we came up
with a better set of functions you need to hook.

> Option 2
> --------
>
> That is to make Eric's code itself cope with the HV case. I'm a bit at
> loss right now as how precisely to do it. I need to spend more time
> staring at the code after Eric latest patches rather than the patches
> themselves I suppose :-) (Eric, they don't apply out of the box on
> current git, they are against -mm ?).
>
> Some of the main issues here, more/less following the order in which
> Eric code calls things:
>
>  - The number of vectors for MSI-X is obtained from config space (at
> least for sanity checking the requested argument). On RTAS, it should
> come from an OF property (we are really not supposed to go read the
> config space even if we can). I -suppose- we can survive for now with
> just reading it, but we might well run into trouble with some "special"
> devices shared accross partitions or if the IBM magic bridges themselves
> ever start sending MSI-X on their own (unlikely but who knows...).
> Michael's code handled that by having a callback ->check() do the sanity
> checking of the nvec, and then just use the nvec passed in as an
> argument once it's sane.

Ok. I think I get the point of check.  I believe I need to look at your
code a little more and see what you are doing to see if there is anything
generic worth doing, that we can always do outside of architecture code
no matter how much of the job the Hypervisor wants to do for us.

I'd hate to hit a different Hypervisor that did something close but
not quite the same and have the code fail then.  So definitely
avoiding touching pci config space at all in the calls seems to make a
lot of sense.  This includes avoiding pci_find_capability right?

Off the top of my head the only things we can do generically are
some data structure things and flags like dev->msi_enabled or
dev->msix_enabled.

Anyway have a nice night and more in the morning.


Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-29  9:03                       ` Eric W. Biederman
  0 siblings, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-29  9:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: tony.luck, grundler, jeff, greg, linux-kernel, kyle,
	linuxppc-dev, linux-pci, brice, shaohua.li, mingo, David Miller

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

> On Sun, 2007-01-28 at 21:25 -0800, David Miller wrote:
>> From: ebiederm@xmission.com (Eric W. Biederman)
>> Date: Sun, 28 Jan 2007 22:18:59 -0700
>> 
>> > Regardless of my opinion on the sanity of the hypervisor architects.
>> > I have not seen anything that indicates it will be hard to support
>> > the hypervisor doing everything or most of everything for us, so
>> > I see no valid technical objection to it.  Nor have I ever.
>> > 
>> > So I have no problem with additional patches in that direction.
>> 
>> Ok, that's great to hear.
>> 
>> I know your bi-directional approach isn't exactly what Ben
>> wants but he can support his machines with it.  Maybe after
>> some time we can agree to move from that more towards the
>> totally abstracted scheme.
>
> It can support my machines without HV with trivial changes I reckon: I
> need an ops struct to indirect eric's 2 remaining arch hooks
> (setup/teardown) but that can be done inline within asm-powerpc. I need
> to double check of course and probably actually port the MPIC backend
> and possibly go write the Cell Axon one while at it to verify everything
> is allright, but the base design seems sound enough.
>
> For the ones with HV (RTAS stuff), we still need to agree on how to
> approach it. We can either:
>
> Option 1
> --------
>
> Do a hook -above- Eric stuff, by having the toplevel APIs themselves be
> arch hooks that can either go toward the RTAS implementation or toward
> Eric's code. That is, eric code would define those (pick better names if
> you are good at it):
>
> 	pci_generic_enable_msi
> 	pci_generic_disable_msi
> 	pci_generic_enable_msix
> 	pci_generic_disable_msix
> 	pci_generic_save_msi_state
> 	pci_generic_restore_msi_state
>
> Then we can have asm-i386/msi.h & friends do something like
>
> #define pci_enable_msi	pci_generic_enable_msi
> #define pci_disable_msi	pci_generic_disable_msi
>    etc...
>
> And we can have asm-powerpc/msi.h hook then via ppc_md:
>
> static inline int pci_enable_msi(xxx...)
> {
> 	return ppc_md.pci_enable_msi(xxx...);
> }
> etc...
>
> (ppc_md is our per-platform global hook structure filled at boot when we
> discover on what machine type we are running on) so that pSeries can use
> it's own RTAS callbacks, and others can just re-hook those to Eric's
> code.

This is the most straight forward and handles machines with really
weird msi setups, so I lean in this direction.

The question is there anything at all we can do generically?

I can't see a case where ppc_md would not wind up with the hooks
that decide if it is a hypervisor or not.  Even if we came up
with a better set of functions you need to hook.

> Option 2
> --------
>
> That is to make Eric's code itself cope with the HV case. I'm a bit at
> loss right now as how precisely to do it. I need to spend more time
> staring at the code after Eric latest patches rather than the patches
> themselves I suppose :-) (Eric, they don't apply out of the box on
> current git, they are against -mm ?).
>
> Some of the main issues here, more/less following the order in which
> Eric code calls things:
>
>  - The number of vectors for MSI-X is obtained from config space (at
> least for sanity checking the requested argument). On RTAS, it should
> come from an OF property (we are really not supposed to go read the
> config space even if we can). I -suppose- we can survive for now with
> just reading it, but we might well run into trouble with some "special"
> devices shared accross partitions or if the IBM magic bridges themselves
> ever start sending MSI-X on their own (unlikely but who knows...).
> Michael's code handled that by having a callback ->check() do the sanity
> checking of the nvec, and then just use the nvec passed in as an
> argument once it's sane.

Ok. I think I get the point of check.  I believe I need to look at your
code a little more and see what you are doing to see if there is anything
generic worth doing, that we can always do outside of architecture code
no matter how much of the job the Hypervisor wants to do for us.

I'd hate to hit a different Hypervisor that did something close but
not quite the same and have the code fail then.  So definitely
avoiding touching pci config space at all in the calls seems to make a
lot of sense.  This includes avoiding pci_find_capability right?

Off the top of my head the only things we can do generically are
some data structure things and flags like dev->msi_enabled or
dev->msix_enabled.

Anyway have a nice night and more in the morning.


Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 4/16] Abstract MSI suspend
  2007-01-29  8:45       ` Eric W. Biederman
@ 2007-01-29  9:47         ` Michael Ellerman
  2007-01-29 16:52           ` Grant Grundler
  2007-01-29 17:20           ` Eric W. Biederman
  0 siblings, 2 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-29  9:47 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

[-- Attachment #1: Type: text/plain, Size: 3893 bytes --]

On Mon, 2007-01-29 at 01:45 -0700, Eric W. Biederman wrote:
> Michael Ellerman <michael@ellerman.id.au> writes:
> 
> > On Sun, 2007-01-28 at 01:27 -0700, Eric W. Biederman wrote:
> >> Michael Ellerman <michael@ellerman.id.au> writes:
> >> 
> >> > Currently pci_disable_device() disables MSI on a device by twiddling
> >> > bits in config space via disable_msi_mode().
> >> >
> >> > On some platforms that may not be appropriate, so abstract the MSI
> >> > suspend logic into pci_disable_device_msi().
> >> 
> >> >
> >> > Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
> >> > ---
> >> >
> >> >  drivers/pci/msi.c |   11 +++++++++++
> >> >  drivers/pci/pci.c |    7 +------
> >> >  drivers/pci/pci.h |    2 ++
> >> >  3 files changed, 14 insertions(+), 6 deletions(-)
> >> >
> >> > Index: msi/drivers/pci/msi.c
> >> > ===================================================================
> >> > --- msi.orig/drivers/pci/msi.c
> >> > +++ msi/drivers/pci/msi.c
> >> > @@ -271,6 +271,17 @@ void disable_msi_mode(struct pci_dev *de
> >> >  	pci_intx(dev, 1);  /* enable intx */
> >> >  }
> >> >  
> >> > +void pci_disable_device_msi(struct pci_dev *dev)
> >> > +{
> >> > +	if (dev->msi_enabled)
> >> > +		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
> >> > +			PCI_CAP_ID_MSI);
> >> > +
> >> > +	if (dev->msix_enabled)
> >> > +		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
> >> > +			PCI_CAP_ID_MSIX);
> >> 
> >> Just a quick note. This is wrong.  It should be PCI_CAP_ID_MSIX.
> >> The code that is being moved is buggy.  So the patch itself doesn't
> >> make the situation any worse.
> >
> > Greg, if you want to drop that patch I'll prepare two patches to fix it
> > and then move it. I don't have any hardware to test, although I'm
> > guessing no one does given that it's been broken since its inception.
> 
> The mthca IB driver was one of the early adopters of MSI, and it uses
> MSI-X.  So it isn't that no one is using MSI-X and the MSI-X code
> paths don't get exercised.

I meant the MSI-X suspend/resume path specifically, I'm guessing most
laptops don't come with IB cards yet ;)

> I expect what is closer to the truth is that the code authors have so
> far simply disabled msi separately instead of assuming pci_disable_device
> will do it magically for them.  So it may be the case that we can
> just kill this code path altogether.

I recall reading comments to that effect in one driver, although it
wasn't obvious exactly what the problem was - but it's probably worth
doing a thorough review while the number of MSI/MSI-X drivers is small.

> Possibly we can just reduce it to WARN_ON(dev->msi_enabled || dev->msix_enabled);
> 
> I suspect msi_remove_pci_irq_vectors may similarly not actually make a
> difference.   I think the function dates from a time when the code
> attempted to cache the irq number so if you removed and re-added a module
> or at least disabled and enabled msi you would get the same linux irq
> number.  I remember killing that caching because it made the code
> unmaintainable and wasn't really useful.

That was my suspicion as well, I was hoping someone who knew the code
better than me would pipe up and let me know why it was a special case.
Have the original MSI authors vanished without a trace?

It seems to date from the initial MSI submission, and has only ever been
called from pci_free_resources(). The rest of the pci hotunplug code
paths are not clear to me though, so I don't know whether we can rely on
pci_disable_msi() already being called for us.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  9:03                       ` Eric W. Biederman
@ 2007-01-29 10:11                         ` Michael Ellerman
  -1 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-29 10:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Benjamin Herrenschmidt, David Miller, jeff, greg, tony.luck,
	grundler, mingo, linux-kernel, kyle, linuxppc-dev, brice,
	shaohua.li, linux-pci

[-- Attachment #1: Type: text/plain, Size: 6444 bytes --]

On Mon, 2007-01-29 at 02:03 -0700, Eric W. Biederman wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:
> 
> > On Sun, 2007-01-28 at 21:25 -0800, David Miller wrote:
> >> From: ebiederm@xmission.com (Eric W. Biederman)
> >> Date: Sun, 28 Jan 2007 22:18:59 -0700
> >> 
> >> > Regardless of my opinion on the sanity of the hypervisor architects.
> >> > I have not seen anything that indicates it will be hard to support
> >> > the hypervisor doing everything or most of everything for us, so
> >> > I see no valid technical objection to it.  Nor have I ever.
> >> > 
> >> > So I have no problem with additional patches in that direction.
> >> 
> >> Ok, that's great to hear.
> >> 
> >> I know your bi-directional approach isn't exactly what Ben
> >> wants but he can support his machines with it.  Maybe after
> >> some time we can agree to move from that more towards the
> >> totally abstracted scheme.
> >
> > It can support my machines without HV with trivial changes I reckon: I
> > need an ops struct to indirect eric's 2 remaining arch hooks
> > (setup/teardown) but that can be done inline within asm-powerpc. I need
> > to double check of course and probably actually port the MPIC backend
> > and possibly go write the Cell Axon one while at it to verify everything
> > is allright, but the base design seems sound enough.
> >
> > For the ones with HV (RTAS stuff), we still need to agree on how to
> > approach it. We can either:
> >
> > Option 1
> > --------
> >
> > Do a hook -above- Eric stuff, by having the toplevel APIs themselves be
> > arch hooks that can either go toward the RTAS implementation or toward
> > Eric's code. That is, eric code would define those (pick better names if
> > you are good at it):
> >
> > 	pci_generic_enable_msi
> > 	pci_generic_disable_msi
> > 	pci_generic_enable_msix
> > 	pci_generic_disable_msix
> > 	pci_generic_save_msi_state
> > 	pci_generic_restore_msi_state
> >
> > Then we can have asm-i386/msi.h & friends do something like
> >
> > #define pci_enable_msi	pci_generic_enable_msi
> > #define pci_disable_msi	pci_generic_disable_msi
> >    etc...
> >
> > And we can have asm-powerpc/msi.h hook then via ppc_md:
> >
> > static inline int pci_enable_msi(xxx...)
> > {
> > 	return ppc_md.pci_enable_msi(xxx...);
> > }
> > etc...
> >
> > (ppc_md is our per-platform global hook structure filled at boot when we
> > discover on what machine type we are running on) so that pSeries can use
> > it's own RTAS callbacks, and others can just re-hook those to Eric's
> > code.
> 
> This is the most straight forward and handles machines with really
> weird msi setups, so I lean in this direction.
> 
> The question is there anything at all we can do generically?
> 
> I can't see a case where ppc_md would not wind up with the hooks
> that decide if it is a hypervisor or not.  Even if we came up
> with a better set of functions you need to hook.
> 
> > Option 2
> > --------
> >
> > That is to make Eric's code itself cope with the HV case. I'm a bit at
> > loss right now as how precisely to do it. I need to spend more time
> > staring at the code after Eric latest patches rather than the patches
> > themselves I suppose :-) (Eric, they don't apply out of the box on
> > current git, they are against -mm ?).
> >
> > Some of the main issues here, more/less following the order in which
> > Eric code calls things:
> >
> >  - The number of vectors for MSI-X is obtained from config space (at
> > least for sanity checking the requested argument). On RTAS, it should
> > come from an OF property (we are really not supposed to go read the
> > config space even if we can). I -suppose- we can survive for now with
> > just reading it, but we might well run into trouble with some "special"
> > devices shared accross partitions or if the IBM magic bridges themselves
> > ever start sending MSI-X on their own (unlikely but who knows...).
> > Michael's code handled that by having a callback ->check() do the sanity
> > checking of the nvec, and then just use the nvec passed in as an
> > argument once it's sane.
> 
> Ok. I think I get the point of check.  I believe I need to look at your
> code a little more and see what you are doing to see if there is anything
> generic worth doing, that we can always do outside of architecture code
> no matter how much of the job the Hypervisor wants to do for us.
> 
> I'd hate to hit a different Hypervisor that did something close but
> not quite the same and have the code fail then.  So definitely
> avoiding touching pci config space at all in the calls seems to make a
> lot of sense.  This includes avoiding pci_find_capability right?

You can read config space, but it's not clear to me if the HV is allowed
to filter it and hide things. It's also possible that the device
supports MSI, but for some reason the HV doesn't allow it on that device
etc. so you really have to ask the HV if it's enabled. So pci_find_cap()
shouldn't crash or anything, but it may lie to you.

> Off the top of my head the only things we can do generically are
> some data structure things and flags like dev->msi_enabled or
> dev->msix_enabled.

It would be good to have a common data structure if possible. My
thinking was that most of the information is per pci_dev, so that's
where I put it. I realise the Intel code stores some info that's
per-irq, but most of it is per-device. I hadn't got anywhere near coding
it, but my vague idea was to add a arch_data (or whatever) pointer to my
msi_info struct, which would allow backends to stash stuff.

I think the pci_intx() calls can be in the core.

Munging dev->irq could be in the core, assuming it's left in some known
location by the code. On the other hand we might want to decide it's a
bad idea altogether.

One thing I did like about my code, is that pci_enable_msi() and
pci_enable_msix() are just small wrappers around generic_enable_msi() -
which does all the work, and is the same regardless of whether it's an
MSI or MSI-X. Although that's facilitated by the type arg which you
don't like.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-29 10:11                         ` Michael Ellerman
  0 siblings, 0 replies; 178+ messages in thread
From: Michael Ellerman @ 2007-01-29 10:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: tony.luck, grundler, jeff, linux-kernel, kyle, linuxppc-dev,
	linux-pci, brice, greg, shaohua.li, mingo, David Miller

[-- Attachment #1: Type: text/plain, Size: 6444 bytes --]

On Mon, 2007-01-29 at 02:03 -0700, Eric W. Biederman wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:
> 
> > On Sun, 2007-01-28 at 21:25 -0800, David Miller wrote:
> >> From: ebiederm@xmission.com (Eric W. Biederman)
> >> Date: Sun, 28 Jan 2007 22:18:59 -0700
> >> 
> >> > Regardless of my opinion on the sanity of the hypervisor architects.
> >> > I have not seen anything that indicates it will be hard to support
> >> > the hypervisor doing everything or most of everything for us, so
> >> > I see no valid technical objection to it.  Nor have I ever.
> >> > 
> >> > So I have no problem with additional patches in that direction.
> >> 
> >> Ok, that's great to hear.
> >> 
> >> I know your bi-directional approach isn't exactly what Ben
> >> wants but he can support his machines with it.  Maybe after
> >> some time we can agree to move from that more towards the
> >> totally abstracted scheme.
> >
> > It can support my machines without HV with trivial changes I reckon: I
> > need an ops struct to indirect eric's 2 remaining arch hooks
> > (setup/teardown) but that can be done inline within asm-powerpc. I need
> > to double check of course and probably actually port the MPIC backend
> > and possibly go write the Cell Axon one while at it to verify everything
> > is allright, but the base design seems sound enough.
> >
> > For the ones with HV (RTAS stuff), we still need to agree on how to
> > approach it. We can either:
> >
> > Option 1
> > --------
> >
> > Do a hook -above- Eric stuff, by having the toplevel APIs themselves be
> > arch hooks that can either go toward the RTAS implementation or toward
> > Eric's code. That is, eric code would define those (pick better names if
> > you are good at it):
> >
> > 	pci_generic_enable_msi
> > 	pci_generic_disable_msi
> > 	pci_generic_enable_msix
> > 	pci_generic_disable_msix
> > 	pci_generic_save_msi_state
> > 	pci_generic_restore_msi_state
> >
> > Then we can have asm-i386/msi.h & friends do something like
> >
> > #define pci_enable_msi	pci_generic_enable_msi
> > #define pci_disable_msi	pci_generic_disable_msi
> >    etc...
> >
> > And we can have asm-powerpc/msi.h hook then via ppc_md:
> >
> > static inline int pci_enable_msi(xxx...)
> > {
> > 	return ppc_md.pci_enable_msi(xxx...);
> > }
> > etc...
> >
> > (ppc_md is our per-platform global hook structure filled at boot when we
> > discover on what machine type we are running on) so that pSeries can use
> > it's own RTAS callbacks, and others can just re-hook those to Eric's
> > code.
> 
> This is the most straight forward and handles machines with really
> weird msi setups, so I lean in this direction.
> 
> The question is there anything at all we can do generically?
> 
> I can't see a case where ppc_md would not wind up with the hooks
> that decide if it is a hypervisor or not.  Even if we came up
> with a better set of functions you need to hook.
> 
> > Option 2
> > --------
> >
> > That is to make Eric's code itself cope with the HV case. I'm a bit at
> > loss right now as how precisely to do it. I need to spend more time
> > staring at the code after Eric latest patches rather than the patches
> > themselves I suppose :-) (Eric, they don't apply out of the box on
> > current git, they are against -mm ?).
> >
> > Some of the main issues here, more/less following the order in which
> > Eric code calls things:
> >
> >  - The number of vectors for MSI-X is obtained from config space (at
> > least for sanity checking the requested argument). On RTAS, it should
> > come from an OF property (we are really not supposed to go read the
> > config space even if we can). I -suppose- we can survive for now with
> > just reading it, but we might well run into trouble with some "special"
> > devices shared accross partitions or if the IBM magic bridges themselves
> > ever start sending MSI-X on their own (unlikely but who knows...).
> > Michael's code handled that by having a callback ->check() do the sanity
> > checking of the nvec, and then just use the nvec passed in as an
> > argument once it's sane.
> 
> Ok. I think I get the point of check.  I believe I need to look at your
> code a little more and see what you are doing to see if there is anything
> generic worth doing, that we can always do outside of architecture code
> no matter how much of the job the Hypervisor wants to do for us.
> 
> I'd hate to hit a different Hypervisor that did something close but
> not quite the same and have the code fail then.  So definitely
> avoiding touching pci config space at all in the calls seems to make a
> lot of sense.  This includes avoiding pci_find_capability right?

You can read config space, but it's not clear to me if the HV is allowed
to filter it and hide things. It's also possible that the device
supports MSI, but for some reason the HV doesn't allow it on that device
etc. so you really have to ask the HV if it's enabled. So pci_find_cap()
shouldn't crash or anything, but it may lie to you.

> Off the top of my head the only things we can do generically are
> some data structure things and flags like dev->msi_enabled or
> dev->msix_enabled.

It would be good to have a common data structure if possible. My
thinking was that most of the information is per pci_dev, so that's
where I put it. I realise the Intel code stores some info that's
per-irq, but most of it is per-device. I hadn't got anywhere near coding
it, but my vague idea was to add a arch_data (or whatever) pointer to my
msi_info struct, which would allow backends to stash stuff.

I think the pci_intx() calls can be in the core.

Munging dev->irq could be in the core, assuming it's left in some known
location by the code. On the other hand we might want to decide it's a
bad idea altogether.

One thing I did like about my code, is that pci_enable_msi() and
pci_enable_msix() are just small wrappers around generic_enable_msi() -
which does all the work, and is the same regardless of whether it's an
MSI or MSI-X. Although that's facilitated by the type arg which you
don't like.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 4/16] Abstract MSI suspend
  2007-01-29  9:47         ` Michael Ellerman
@ 2007-01-29 16:52           ` Grant Grundler
  2007-01-29 16:57             ` Roland Dreier
  2007-01-29 17:20           ` Eric W. Biederman
  1 sibling, 1 reply; 178+ messages in thread
From: Grant Grundler @ 2007-01-29 16:52 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Eric W. Biederman, shaohua.li, linux-pci, David S. Miller,
	Brice Goglin

On Mon, Jan 29, 2007 at 08:47:38PM +1100, Michael Ellerman wrote:
...
> > The mthca IB driver was one of the early adopters of MSI, and it uses
> > MSI-X.  So it isn't that no one is using MSI-X and the MSI-X code
> > paths don't get exercised.
> 
> I meant the MSI-X suspend/resume path specifically, I'm guessing most
> laptops don't come with IB cards yet ;)

laptops now come with 1000BT chips that _do_ support MSI-X.

grant

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 4/16] Abstract MSI suspend
  2007-01-29 16:52           ` Grant Grundler
@ 2007-01-29 16:57             ` Roland Dreier
  2007-01-29 17:02               ` Roland Dreier
  2007-01-29 22:03               ` Grant Grundler
  0 siblings, 2 replies; 178+ messages in thread
From: Roland Dreier @ 2007-01-29 16:57 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Eric W. Biederman, shaohua.li, linux-pci, David S. Miller,
	Brice Goglin

 > laptops now come with 1000BT chips that _do_ support MSI-X.

Really?  Which gigE chips are using MSI-X (as opposed to MSI)?

(I am using MSI with e1000 on my laptop, but I've not seen any NICs
other than 10-gigE NICs that even have an MSI-X capability -- none of
the e1000, tg3 or bnx2 devices I have around have it at least)

 - R.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 4/16] Abstract MSI suspend
  2007-01-29 16:57             ` Roland Dreier
@ 2007-01-29 17:02               ` Roland Dreier
  2007-01-29 17:25                 ` Eric W. Biederman
  2007-01-29 22:03               ` Grant Grundler
  1 sibling, 1 reply; 178+ messages in thread
From: Roland Dreier @ 2007-01-29 17:02 UTC (permalink / raw)
  To: Grant Grundler
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Eric W. Biederman, shaohua.li, linux-pci, David S. Miller,
	Brice Goglin

 > Really?  Which gigE chips are using MSI-X (as opposed to MSI)?

OK, I should look before I post.  But anyway a quick grep shows that
the forcedeth driver does enable MSI-X for at least some devices.  And
a quick look at the nv_suspend() function makes me think that suspend
probably won't work if MSI-X is used, since it doesn't save the MSI-X
state anywhere that I can see (unless the device is magic enough to
keep the MSI-X table in some sort of persistent storage, which I
highly doubt).

 - R.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 4/16] Abstract MSI suspend
  2007-01-29  9:47         ` Michael Ellerman
  2007-01-29 16:52           ` Grant Grundler
@ 2007-01-29 17:20           ` Eric W. Biederman
  1 sibling, 0 replies; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-29 17:20 UTC (permalink / raw)
  To: michael
  Cc: Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Michael Ellerman <michael@ellerman.id.au> writes:

> That was my suspicion as well, I was hoping someone who knew the code
> better than me would pipe up and let me know why it was a special case.
> Have the original MSI authors vanished without a trace?

I have never gotten any useable feedback from that direction anyway.

> It seems to date from the initial MSI submission, and has only ever been
> called from pci_free_resources(). The rest of the pci hotunplug code
> paths are not clear to me though, so I don't know whether we can rely on
> pci_disable_msi() already being called for us.

Good question.  I do know I have always been a little suspicions of
the suspend/resume path, because I haven't read them.

There may be a point in the hot-unplug where the code actually makes
sense in the don't touch the hardware kind of way.

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 4/16] Abstract MSI suspend
  2007-01-29 17:02               ` Roland Dreier
@ 2007-01-29 17:25                 ` Eric W. Biederman
  2007-01-29 17:32                   ` Roland Dreier
  0 siblings, 1 reply; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-29 17:25 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller

Roland Dreier <rdreier@cisco.com> writes:

>  > Really?  Which gigE chips are using MSI-X (as opposed to MSI)?
>
> OK, I should look before I post.  But anyway a quick grep shows that
> the forcedeth driver does enable MSI-X for at least some devices.  And
> a quick look at the nv_suspend() function makes me think that suspend
> probably won't work if MSI-X is used, since it doesn't save the MSI-X
> state anywhere that I can see (unless the device is magic enough to
> keep the MSI-X table in some sort of persistent storage, which I
> highly doubt).

Hmm.

There is this function.  It does save the table, I haven't
looked closely enough yet to know if it saves all of the
other details yet.

It certainly looks to me like we are at least missing
details like saving the mask bit.

I have always been under the impression that this code was
at least close enough that it could be fixed to do
what we needed.

int pci_save_msix_state(struct pci_dev *dev)
{
	int pos;
	int irq, head, tail = 0;
	u16 control;
	struct pci_cap_saved_state *save_state;

	if (!dev->msix_enabled)
		return 0;

	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
	if (pos <= 0 || dev->no_msi)
		return 0;

	/* save the capability */
	pci_read_config_word(dev, msi_control_reg(pos), &control);
	if (!(control & PCI_MSIX_FLAGS_ENABLE))
		return 0;
	save_state = kzalloc(sizeof(struct pci_cap_saved_state) + sizeof(u16),
		GFP_KERNEL);
	if (!save_state) {
		printk(KERN_ERR "Out of memory in pci_save_msix_state\n");
		return -ENOMEM;
	}
	*((u16 *)&save_state->data[0]) = control;

	/* save the table */
	irq = head = dev->first_msi_irq;
	while (head != tail) {
		struct msi_desc *entry;

		entry = msi_desc[irq];
		read_msi_msg(irq, &entry->msg_save);

		tail = msi_desc[irq]->link.tail;
		irq = tail;
	}

	save_state->cap_nr = PCI_CAP_ID_MSIX;
	pci_add_saved_cap(dev, save_state);
	return 0;
}

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 4/16] Abstract MSI suspend
  2007-01-29 17:25                 ` Eric W. Biederman
@ 2007-01-29 17:32                   ` Roland Dreier
  0 siblings, 0 replies; 178+ messages in thread
From: Roland Dreier @ 2007-01-29 17:32 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller

 > There is this function.  It does save the table, I haven't
 > looked closely enough yet to know if it saves all of the
 > other details yet.

OK, fair enough, I missed that call to pci_save_msix_state() in
pci_save_state().  Yeah, I guess it's close to working.

 - R.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  9:03                       ` Eric W. Biederman
@ 2007-01-29 20:22                         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29 20:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, jeff, greg, tony.luck, grundler, mingo,
	linux-kernel, kyle, linuxppc-dev, brice, shaohua.li, linux-pci


> This is the most straight forward and handles machines with really
> weird msi setups, so I lean in this direction.
> 
> The question is there anything at all we can do generically?
> 
> I can't see a case where ppc_md would not wind up with the hooks
> that decide if it is a hypervisor or not.  Even if we came up
> with a better set of functions you need to hook.

Sure, but with Michael's approach, the only hook was get_msi_ops(pdev) 

Anyway, there isn't -that- much that can be done generically in the HV
case. Mostly some argument sanity checking, the logic for saving &
restoring pdev->irq for MSIs, that sort of thing.

> Ok. I think I get the point of check.  I believe I need to look at your
> code a little more and see what you are doing to see if there is anything
> generic worth doing, that we can always do outside of architecture code
> no matter how much of the job the Hypervisor wants to do for us.

I understand.

> I'd hate to hit a different Hypervisor that did something close but
> not quite the same and have the code fail then.  So definitely
> avoiding touching pci config space at all in the calls seems to make a
> lot of sense.  This includes avoiding pci_find_capability right?

Quite possibly yes. I'm pretty sure it will work on IBM HV but we aren't
really supposed to use it...

> Off the top of my head the only things we can do generically are
> some data structure things and flags like dev->msi_enabled or
> dev->msix_enabled.

That and the saving & restoring of pdev->irq. That is not very much.

> Anyway have a nice night and more in the morning.

Ben.


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-29 20:22                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29 20:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: tony.luck, grundler, jeff, greg, linux-kernel, kyle,
	linuxppc-dev, linux-pci, brice, shaohua.li, mingo, David Miller


> This is the most straight forward and handles machines with really
> weird msi setups, so I lean in this direction.
> 
> The question is there anything at all we can do generically?
> 
> I can't see a case where ppc_md would not wind up with the hooks
> that decide if it is a hypervisor or not.  Even if we came up
> with a better set of functions you need to hook.

Sure, but with Michael's approach, the only hook was get_msi_ops(pdev) 

Anyway, there isn't -that- much that can be done generically in the HV
case. Mostly some argument sanity checking, the logic for saving &
restoring pdev->irq for MSIs, that sort of thing.

> Ok. I think I get the point of check.  I believe I need to look at your
> code a little more and see what you are doing to see if there is anything
> generic worth doing, that we can always do outside of architecture code
> no matter how much of the job the Hypervisor wants to do for us.

I understand.

> I'd hate to hit a different Hypervisor that did something close but
> not quite the same and have the code fail then.  So definitely
> avoiding touching pci config space at all in the calls seems to make a
> lot of sense.  This includes avoiding pci_find_capability right?

Quite possibly yes. I'm pretty sure it will work on IBM HV but we aren't
really supposed to use it...

> Off the top of my head the only things we can do generically are
> some data structure things and flags like dev->msi_enabled or
> dev->msix_enabled.

That and the saving & restoring of pdev->irq. That is not very much.

> Anyway have a nice night and more in the morning.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29 10:11                         ` Michael Ellerman
@ 2007-01-29 20:32                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29 20:32 UTC (permalink / raw)
  To: michael
  Cc: Eric W. Biederman, David Miller, jeff, greg, tony.luck, grundler,
	mingo, linux-kernel, kyle, linuxppc-dev, brice, shaohua.li,
	linux-pci


> You can read config space, but it's not clear to me if the HV is allowed
> to filter it and hide things. 

I've seen it do it for example with EADS bridges. I haven't seen doing
it with devices (other than hiding entire functions) but I wouldn't
exclude it...

> It's also possible that the device
> supports MSI, but for some reason the HV doesn't allow it on that device
> etc. so you really have to ask the HV if it's enabled. So pci_find_cap()
> shouldn't crash or anything, but it may lie to you.

Yup.

> One thing I did like about my code, is that pci_enable_msi() and
> pci_enable_msix() are just small wrappers around generic_enable_msi() -
> which does all the work, and is the same regardless of whether it's an
> MSI or MSI-X. Although that's facilitated by the type arg which you
> don't like.

Part of the reason is you make MSI look like MSI-X (a vector of 1 entry)
while Eric does the opposite.

Ben.



^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-29 20:32                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29 20:32 UTC (permalink / raw)
  To: michael
  Cc: tony.luck, grundler, jeff, greg, linux-kernel, kyle,
	linuxppc-dev, linux-pci, Eric W. Biederman, shaohua.li, mingo,
	David Miller, brice


> You can read config space, but it's not clear to me if the HV is allowed
> to filter it and hide things. 

I've seen it do it for example with EADS bridges. I haven't seen doing
it with devices (other than hiding entire functions) but I wouldn't
exclude it...

> It's also possible that the device
> supports MSI, but for some reason the HV doesn't allow it on that device
> etc. so you really have to ask the HV if it's enabled. So pci_find_cap()
> shouldn't crash or anything, but it may lie to you.

Yup.

> One thing I did like about my code, is that pci_enable_msi() and
> pci_enable_msix() are just small wrappers around generic_enable_msi() -
> which does all the work, and is the same regardless of whether it's an
> MSI or MSI-X. Although that's facilitated by the type arg which you
> don't like.

Part of the reason is you make MSI look like MSI-X (a vector of 1 entry)
while Eric does the opposite.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 4/16] Abstract MSI suspend
  2007-01-29 16:57             ` Roland Dreier
  2007-01-29 17:02               ` Roland Dreier
@ 2007-01-29 22:03               ` Grant Grundler
  1 sibling, 0 replies; 178+ messages in thread
From: Grant Grundler @ 2007-01-29 22:03 UTC (permalink / raw)
  To: Roland Dreier
  Cc: Grant Grundler, Greg Kroah-Hartman, Kyle McMartin, linuxppc-dev,
	Eric W. Biederman, shaohua.li, linux-pci, David S. Miller,
	Brice Goglin

On Mon, Jan 29, 2007 at 08:57:53AM -0800, Roland Dreier wrote:
>  > laptops now come with 1000BT chips that _do_ support MSI-X.
> 
> Really?  Which gigE chips are using MSI-X (as opposed to MSI)?

Sorry, I was thinking MSI (not MSI-X).
The point was that MSI will need suspend/resume support.

> (I am using MSI with e1000 on my laptop, but I've not seen any NICs
> other than 10-gigE NICs that even have an MSI-X capability -- none of
> the e1000, tg3 or bnx2 devices I have around have it at least)

yes - same here.  gige NICs don't need MSI-X if they can use MSI.

thanks,
grant
 
>  - R.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29 20:22                         ` Benjamin Herrenschmidt
@ 2007-01-29 23:05                           ` Paul Mackerras
  -1 siblings, 0 replies; 178+ messages in thread
From: Paul Mackerras @ 2007-01-29 23:05 UTC (permalink / raw)
  To: Eric W. Biederman, Benjamin Herrenschmidt
  Cc: tony.luck, grundler, jeff, greg, linux-kernel, kyle,
	linuxppc-dev, linux-pci, brice, shaohua.li, mingo, David Miller

Benjamin Herrenschmidt writes:

> > I'd hate to hit a different Hypervisor that did something close but
> > not quite the same and have the code fail then.  So definitely
> > avoiding touching pci config space at all in the calls seems to make a
> > lot of sense.  This includes avoiding pci_find_capability right?
> 
> Quite possibly yes. I'm pretty sure it will work on IBM HV but we aren't
> really supposed to use it...

Actually, I don't know of any reason why we can't use
pci_find_capability.  We are supposed to avoid trying to touch config
space of devices (in fact, functions) that aren't assigned to our
partition, but we're not talking about that here.

I just got an answer from the hypervisor architects.  It turns out
that the hardware _does_ prevent the device from sending MSI messages
to another partition.  The OS _can_ write whatever it likes to the MSI
address and data registers.  It can potentially lose interrupts (or, I
expect, get the device isolated by EEH) but it can't disrupt another
partition.

I think the reason why the hypervisor call writes the values straight
into the MSI/MSI-X registers in the device is (a) that's convenient
for AIX, since it saves it from immediately having to do more calls
into the hypervisor to write those values to the device, and (b) there
are some ABI complications in returning a lot of values, so the device
registers provide a convenient place to return those values.

So it would be possible, although gross, to do the hypervisor call,
read the values from config space and return them to the generic code,
then let the generic code write them to config space for us. :P

The remaining point of difference then seems to be that for MSI-X, we
really want to know up-front how many interrupts the device driver is
asking for, rather than having a series of alloc requests dribble in
one at a time.

Regards,
Paul.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-29 23:05                           ` Paul Mackerras
  0 siblings, 0 replies; 178+ messages in thread
From: Paul Mackerras @ 2007-01-29 23:05 UTC (permalink / raw)
  To: Eric W. Biederman, Benjamin Herrenschmidt
  Cc: tony.luck, grundler, jeff, David Miller, greg, linux-kernel,
	kyle, linuxppc-dev, brice, shaohua.li, linux-pci, mingo

Benjamin Herrenschmidt writes:

> > I'd hate to hit a different Hypervisor that did something close but
> > not quite the same and have the code fail then.  So definitely
> > avoiding touching pci config space at all in the calls seems to make a
> > lot of sense.  This includes avoiding pci_find_capability right?
> 
> Quite possibly yes. I'm pretty sure it will work on IBM HV but we aren't
> really supposed to use it...

Actually, I don't know of any reason why we can't use
pci_find_capability.  We are supposed to avoid trying to touch config
space of devices (in fact, functions) that aren't assigned to our
partition, but we're not talking about that here.

I just got an answer from the hypervisor architects.  It turns out
that the hardware _does_ prevent the device from sending MSI messages
to another partition.  The OS _can_ write whatever it likes to the MSI
address and data registers.  It can potentially lose interrupts (or, I
expect, get the device isolated by EEH) but it can't disrupt another
partition.

I think the reason why the hypervisor call writes the values straight
into the MSI/MSI-X registers in the device is (a) that's convenient
for AIX, since it saves it from immediately having to do more calls
into the hypervisor to write those values to the device, and (b) there
are some ABI complications in returning a lot of values, so the device
registers provide a convenient place to return those values.

So it would be possible, although gross, to do the hypervisor call,
read the values from config space and return them to the generic code,
then let the generic code write them to config space for us. :P

The remaining point of difference then seems to be that for MSI-X, we
really want to know up-front how many interrupts the device driver is
asking for, rather than having a series of alloc requests dribble in
one at a time.

Regards,
Paul.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29 10:11                         ` Michael Ellerman
@ 2007-01-29 23:29                           ` Paul Mackerras
  -1 siblings, 0 replies; 178+ messages in thread
From: Paul Mackerras @ 2007-01-29 23:29 UTC (permalink / raw)
  To: michael
  Cc: Eric W. Biederman, tony.luck, grundler, jeff, linux-kernel, kyle,
	linuxppc-dev, linux-pci, brice, greg, shaohua.li, mingo,
	David Miller

Michael Ellerman writes:

> You can read config space, but it's not clear to me if the HV is allowed
> to filter it and hide things. It's also possible that the device

It appears that the HV does not prevent us from reading or writing any
config space registers for functions that are assigned to us.

> supports MSI, but for some reason the HV doesn't allow it on that device
> etc. so you really have to ask the HV if it's enabled. So pci_find_cap()
> shouldn't crash or anything, but it may lie to you.

It's possible that the device can do MSI(X), but that using MSI(X)
requires other platform resources (e.g. interrupt source numbers) and
there are none free.  I believe the platform guarantees a minimum
number of MSI(X) interrupts per function, but a pci_enable_msix() call
may not be able to give the driver as many MSI-X interrupts as it is
requesting even if the function can handle that many.

Paul.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-29 23:29                           ` Paul Mackerras
  0 siblings, 0 replies; 178+ messages in thread
From: Paul Mackerras @ 2007-01-29 23:29 UTC (permalink / raw)
  To: michael
  Cc: tony.luck, grundler, jeff, David Miller, greg, linux-kernel,
	kyle, linuxppc-dev, Eric W. Biederman, shaohua.li, linux-pci,
	mingo, brice

Michael Ellerman writes:

> You can read config space, but it's not clear to me if the HV is allowed
> to filter it and hide things. It's also possible that the device

It appears that the HV does not prevent us from reading or writing any
config space registers for functions that are assigned to us.

> supports MSI, but for some reason the HV doesn't allow it on that device
> etc. so you really have to ask the HV if it's enabled. So pci_find_cap()
> shouldn't crash or anything, but it may lie to you.

It's possible that the device can do MSI(X), but that using MSI(X)
requires other platform resources (e.g. interrupt source numbers) and
there are none free.  I believe the platform guarantees a minimum
number of MSI(X) interrupts per function, but a pci_enable_msix() call
may not be able to give the driver as many MSI-X interrupts as it is
requesting even if the function can handle that many.

Paul.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29 23:29                           ` Paul Mackerras
@ 2007-01-29 23:40                             ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29 23:40 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: michael, tony.luck, grundler, jeff, David Miller, greg,
	linux-kernel, kyle, linuxppc-dev, Eric W. Biederman, shaohua.li,
	linux-pci, mingo, brice


> It's possible that the device can do MSI(X), but that using MSI(X)
> requires other platform resources (e.g. interrupt source numbers) and
> there are none free.  I believe the platform guarantees a minimum
> number of MSI(X) interrupts per function, but a pci_enable_msix() call
> may not be able to give the driver as many MSI-X interrupts as it is
> requesting even if the function can handle that many.

However, the ibm,req#msi(-x) properties contain the number as requested
by the device, and thus I expect them to be identical to the config
space value. So if you are confident enough that our HV won't play any
tricks there in the future, reading the config space is as good as
hooking that check() callback, though it might not be vs. some other HV
for some other platform that might be more strict.

We cannot know in advance how much max the HV will give us without
actually trying ibm,change-msi and see the result code for it
unfortunately.

Ben.



^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-29 23:40                             ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 178+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29 23:40 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: tony.luck, grundler, jeff, greg, mingo, linux-kernel, kyle,
	linuxppc-dev, Eric W. Biederman, shaohua.li, linux-pci,
	David Miller, brice


> It's possible that the device can do MSI(X), but that using MSI(X)
> requires other platform resources (e.g. interrupt source numbers) and
> there are none free.  I believe the platform guarantees a minimum
> number of MSI(X) interrupts per function, but a pci_enable_msix() call
> may not be able to give the driver as many MSI-X interrupts as it is
> requesting even if the function can handle that many.

However, the ibm,req#msi(-x) properties contain the number as requested
by the device, and thus I expect them to be identical to the config
space value. So if you are confident enough that our HV won't play any
tricks there in the future, reading the config space is as good as
hooking that check() callback, though it might not be vs. some other HV
for some other platform that might be more strict.

We cannot know in advance how much max the HV will give us without
actually trying ibm,change-msi and see the result code for it
unfortunately.

Ben.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29 23:05                           ` Paul Mackerras
@ 2007-01-30 19:32                             ` Segher Boessenkool
  -1 siblings, 0 replies; 178+ messages in thread
From: Segher Boessenkool @ 2007-01-30 19:32 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Eric W. Biederman, Benjamin Herrenschmidt, tony.luck, grundler,
	jeff, David Miller, greg, linux-kernel, kyle, linuxppc-dev,
	brice, shaohua.li, linux-pci, mingo

> I just got an answer from the hypervisor architects.  It turns out
> that the hardware _does_ prevent the device from sending MSI messages
> to another partition.  The OS _can_ write whatever it likes to the MSI
> address and data registers.  It can potentially lose interrupts (or, I
> expect, get the device isolated by EEH) but it can't disrupt another
> partition.

The OS however has to write the values the HV wants to
the device, or things won't work -- so the HV can just
as well do it itself.  Also, pulling all the work into
the HV makes for a cleaner, more generic design (who
knows what hardware will show up within the next few
years, the HV interface had better be prepared).


Segher


^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-01-30 19:32                             ` Segher Boessenkool
  0 siblings, 0 replies; 178+ messages in thread
From: Segher Boessenkool @ 2007-01-30 19:32 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: tony.luck, grundler, jeff, mingo, linux-kernel, kyle,
	linuxppc-dev, Eric W. Biederman, greg, shaohua.li, linux-pci,
	David Miller, brice

> I just got an answer from the hypervisor architects.  It turns out
> that the hardware _does_ prevent the device from sending MSI messages
> to another partition.  The OS _can_ write whatever it likes to the MSI
> address and data registers.  It can potentially lose interrupts (or, I
> expect, get the device isolated by EEH) but it can't disrupt another
> partition.

The OS however has to write the values the HV wants to
the device, or things won't work -- so the HV can just
as well do it itself.  Also, pulling all the work into
the HV makes for a cleaner, more generic design (who
knows what hardware will show up within the next few
years, the HV interface had better be prepared).


Segher

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-29  0:58                         ` Benjamin Herrenschmidt
  2007-01-29  1:13                           ` David Miller
@ 2007-01-31  6:52                           ` David Miller
  2007-01-31  7:40                             ` Eric W. Biederman
  1 sibling, 1 reply; 178+ messages in thread
From: David Miller @ 2007-01-31  6:52 UTC (permalink / raw)
  To: benh; +Cc: greg, kyle, linuxppc-dev, brice, shaohua.li, linux-pci, ebiederm

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Mon, 29 Jan 2007 11:58:21 +1100

> Do you have some pointers to documentation on those sparc64
> interfaces ?

So I got things working on sparc64 with a one-liner to the current
upstream vanilla 2.6.20-rc7 :-)  It's not the best, but it works.

You can see it all at:

	kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6.git

Basically, I changed arch_teardown_msi_irq() to pass in the
PCI device pointer, that's it.

The rest is sparc64 specific stuff.

One thing that's disappointing is that this "MSI Queue" framework
sparc64 has really suggests a two-tiered interrupt handling scheme.
As I previously explained, on sparc64 you assosciated each MSI with a
queue, and you can attach multiple MSIs to a single queue.

The queue is what generates the interrupt, and in response to that
interrupt you process a ring of MSI descriptors in the queue.  The
descriptors have a bunch of very useful information which we have no
way to make use of currently, and in particular it has the MSI number
in each entry.

So what would be cool would be to be able to attach the IRQ
action entries to a list inside of the MSI queue.

Instead, what happens right now is that each queue has a single
MSI assosciated with it, and that's the interrupt.

The MSI descriptors have all sorts of useful information, such as a
system TICK timestamp (for profiling), the exact bus/dev/fn that
generated the MSI (for debugging), as well as the full MSI address,
data, and code values (for IRQ dispatch and PCI-E error message
processing).

I've tested tg3 with MSI on my Niagara, it seems to work well.
Unfortunately I don't have any MSI-X capable devices here, but
eventually I am sure I will.  I glanced over the MSI-X code and I see
no reason why it wouldn't work.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-31  6:52                           ` David Miller
@ 2007-01-31  7:40                             ` Eric W. Biederman
  2007-02-01  0:55                               ` David Miller
  0 siblings, 1 reply; 178+ messages in thread
From: Eric W. Biederman @ 2007-01-31  7:40 UTC (permalink / raw)
  To: David Miller
  Cc: kyle, linuxppc-dev, ebiederm, greg, shaohua.li, linux-pci, brice

David Miller <davem@davemloft.net> writes:

> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Mon, 29 Jan 2007 11:58:21 +1100
>
>> Do you have some pointers to documentation on those sparc64
>> interfaces ?
>
> So I got things working on sparc64 with a one-liner to the current
> upstream vanilla 2.6.20-rc7 :-)  It's not the best, but it works.
>
> You can see it all at:
>
> 	kernel.org:/pub/scm/linux/kernel/git/davem/sparc-2.6.git
>
> Basically, I changed arch_teardown_msi_irq() to pass in the
> PCI device pointer, that's it.

Neat. 

I think you could have omitted your one liner if you had done:
struct msi_desc *entry = get_irq_data(irq);
struct pci_dev *dev = entry->dev;

> The rest is sparc64 specific stuff.
>
> One thing that's disappointing is that this "MSI Queue" framework
> sparc64 has really suggests a two-tiered interrupt handling scheme.
> As I previously explained, on sparc64 you assosciated each MSI with a
> queue, and you can attach multiple MSIs to a single queue.

Interesting.

> I've tested tg3 with MSI on my Niagara, it seems to work well.
> Unfortunately I don't have any MSI-X capable devices here, but
> eventually I am sure I will.  I glanced over the MSI-X code and I see
> no reason why it wouldn't work.

Congratulations!

Eric

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 0/16] Ops based MSI Implementation
  2007-01-31  7:40                             ` Eric W. Biederman
@ 2007-02-01  0:55                               ` David Miller
  0 siblings, 0 replies; 178+ messages in thread
From: David Miller @ 2007-02-01  0:55 UTC (permalink / raw)
  To: ebiederm; +Cc: kyle, linuxppc-dev, brice, greg, shaohua.li, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Wed, 31 Jan 2007 00:40:34 -0700

> I think you could have omitted your one liner if you had done:
> struct msi_desc *entry = get_irq_data(irq);
> struct pci_dev *dev = entry->dev;

Good idea, I reimplemented it that way.

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [RFC/PATCH 4/16] Abstract MSI suspend
  2007-01-29  7:22     ` Michael Ellerman
  2007-01-29  8:45       ` Eric W. Biederman
@ 2007-02-01  4:24       ` Greg KH
  1 sibling, 0 replies; 178+ messages in thread
From: Greg KH @ 2007-02-01  4:24 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Kyle McMartin, linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller, EricW.Biederman

On Mon, Jan 29, 2007 at 06:22:57PM +1100, Michael Ellerman wrote:
> On Sun, 2007-01-28 at 01:27 -0700, Eric W. Biederman wrote:
> > Michael Ellerman <michael@ellerman.id.au> writes:
> > 
> > > Currently pci_disable_device() disables MSI on a device by twiddling
> > > bits in config space via disable_msi_mode().
> > >
> > > On some platforms that may not be appropriate, so abstract the MSI
> > > suspend logic into pci_disable_device_msi().
> > 
> > >
> > > Signed-off-by: Michael Ellerman <michael@ellerman.id.au>
> > > ---
> > >
> > >  drivers/pci/msi.c |   11 +++++++++++
> > >  drivers/pci/pci.c |    7 +------
> > >  drivers/pci/pci.h |    2 ++
> > >  3 files changed, 14 insertions(+), 6 deletions(-)
> > >
> > > Index: msi/drivers/pci/msi.c
> > > ===================================================================
> > > --- msi.orig/drivers/pci/msi.c
> > > +++ msi/drivers/pci/msi.c
> > > @@ -271,6 +271,17 @@ void disable_msi_mode(struct pci_dev *de
> > >  	pci_intx(dev, 1);  /* enable intx */
> > >  }
> > >  
> > > +void pci_disable_device_msi(struct pci_dev *dev)
> > > +{
> > > +	if (dev->msi_enabled)
> > > +		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
> > > +			PCI_CAP_ID_MSI);
> > > +
> > > +	if (dev->msix_enabled)
> > > +		disable_msi_mode(dev, pci_find_capability(dev, PCI_CAP_ID_MSI),
> > > +			PCI_CAP_ID_MSIX);
> > 
> > Just a quick note. This is wrong.  It should be PCI_CAP_ID_MSIX.
> > The code that is being moved is buggy.  So the patch itself doesn't
> > make the situation any worse.
> 
> Greg, if you want to drop that patch I'll prepare two patches to fix it
> and then move it. I don't have any hardware to test, although I'm
> guessing no one does given that it's been broken since its inception.

Ok, I've now dropped it.

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 22:09           ` Benjamin Herrenschmidt
@ 2007-02-01  4:29             ` Greg KH
  -1 siblings, 0 replies; 178+ messages in thread
From: Greg KH @ 2007-02-01  4:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Eric W. Biederman, Jeff Garzik, Tony Luck, Grant Grundler,
	Ingo Molnar, linux-kernel, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller

On Mon, Jan 29, 2007 at 09:09:14AM +1100, Benjamin Herrenschmidt wrote:
> 
> If we followed that "only do incrementental changes" rule all the time,
> imagine in what state would be our USB stack today since we couldn't
> have dropped in Linus replacement one ...

Bad example, that is not what happened at all.  There was not an
in-kernel USB stack when Linus wrote his.  Inaky had his
all-singing-all-dancing stack outside of the tree, and no one was really
helping out with it.

Only when Linus added his code to mainline did we all jump on it and
_incrementally_ improve it to what we have today.

So, in a way, you just proved that we need to do this in an incremental
fashion, which is what I was also saying all along :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 178+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
@ 2007-02-01  4:29             ` Greg KH
  0 siblings, 0 replies; 178+ messages in thread
From: Greg KH @ 2007-02-01  4:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Tony Luck, Grant Grundler, Jeff Garzik, David S. Miller,
	linux-kernel, Kyle McMartin, linuxppc-dev, Eric W. Biederman,
	shaohua.li, Ingo Molnar, linux-pci, Brice Goglin

On Mon, Jan 29, 2007 at 09:09:14AM +1100, Benjamin Herrenschmidt wrote:
> 
> If we followed that "only do incrementental changes" rule all the time,
> imagine in what state would be our USB stack today since we couldn't
> have dropped in Linus replacement one ...

Bad example, that is not what happened at all.  There was not an
in-kernel USB stack when Linus wrote his.  Inaky had his
all-singing-all-dancing stack outside of the tree, and no one was really
helping out with it.

Only when Linus added his code to mainline did we all jump on it and
_incrementally_ improve it to what we have today.

So, in a way, you just proved that we need to do this in an incremental
fashion, which is what I was also saying all along :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 178+ messages in thread

* patch msi-fix-msi_remove_pci_irq_vectors.patch added to gregkh-2.6 tree
  2007-01-28 19:45         ` Eric W. Biederman
  (?)
  (?)
@ 2007-02-01  6:07         ` gregkh
  -1 siblings, 0 replies; 178+ messages in thread
From: gregkh @ 2007-02-01  6:07 UTC (permalink / raw)
  To: greg, brice, davem, ebiederm, gregkh, grundler, kyle,
	linuxppc-dev, michael, mingo, shaohua.li, tony.luck


This is a note to let you know that I've just added the patch titled

     Subject: msi: Fix msi_remove_pci_irq_vectors.

to my gregkh-2.6 tree.  Its filename is

     msi-fix-msi_remove_pci_irq_vectors.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From owner-linux-pci@atrey.karlin.mff.cuni.cz  Wed Jan 31 22:00:21 2007
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 12:45:54 -0700
Subject: msi: Fix msi_remove_pci_irq_vectors.
To: Greg Kroah-Hartman <greg@kroah.com>
Cc: "David S. Miller" <davem@davemloft.net>, Kyle McMartin <kyle@parisc-linux.org>, <linuxppc-dev@ozlabs.org>, Brice Goglin <brice@myri.com>, <shaohua.li@intel.com>, Michael Ellerman <michael@ellerman.id.au>, Grant Grundler <grundler@parisc-linux.org>, Tony Luck <tony.luck@intel.com>, Ingo Molnar <mingo@elte.hu>
Message-ID: <m14pqbq6j1.fsf_-_@ebiederm.dsl.xmission.com>

Since msi_remove_pci_irq_vectors is designed to be called during
hotplug remove it is actively wrong to query the hardware and expect
meaningful results back.

To that end remove the pci_find_capability calls.  Testing
dev->msi_enabled and dev->msix_enabled gives us all of the information
we need.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 drivers/pci/msi.c |    8 ++------
 1 file changed, 2 insertions(+), 6 deletions(-)

--- gregkh-2.6.orig/drivers/pci/msi.c
+++ gregkh-2.6/drivers/pci/msi.c
@@ -854,13 +854,10 @@ void pci_disable_msix(struct pci_dev* de
  **/
 void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 {
-	int pos;
-
 	if (!pci_msi_enable || !dev)
  		return;
 
-	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
-	if (pos > 0 && dev->msi_enabled) {
+	if (dev->msi_enabled) {
 		if (irq_has_action(dev->first_msi_irq)) {
 			printk(KERN_WARNING "PCI: %s: msi_remove_pci_irq_vectors() "
 			       "called without free_irq() on MSI irq %d\n",
@@ -869,8 +866,7 @@ void msi_remove_pci_irq_vectors(struct p
 		} else /* Release MSI irq assigned to this device */
 			msi_free_irq(dev, dev->first_msi_irq);
 	}
-	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
-	if (pos > 0 && dev->msix_enabled) {
+	if (dev->msix_enabled) {
 		int irq, head, tail = 0, warning = 0;
 		void __iomem *base = NULL;
 


Patches currently in gregkh-2.6 which might be from greg@kroah.com are

^ permalink raw reply	[flat|nested] 178+ messages in thread

* patch msi-kill-msi_lookup_irq.patch added to gregkh-2.6 tree
  2007-01-28 19:42     ` Eric W. Biederman
                       ` (2 preceding siblings ...)
  (?)
@ 2007-02-01  6:07     ` gregkh
  -1 siblings, 0 replies; 178+ messages in thread
From: gregkh @ 2007-02-01  6:07 UTC (permalink / raw)
  To: greg, brice, davem, ebiederm, gregkh, grundler, kyle,
	linuxppc-dev, michael, mingo, shaohua.li, tony.luck


This is a note to let you know that I've just added the patch titled

     Subject: msi: Kill msi_lookup_irq

to my gregkh-2.6 tree.  Its filename is

     msi-kill-msi_lookup_irq.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From ebiederm@xmission.com  Wed Jan 31 21:59:15 2007
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 12:42:52 -0700
Subject: msi: Kill msi_lookup_irq
To: Greg Kroah-Hartman <greg@kroah.com>
Cc: "David S. Miller" <davem@davemloft.net>, Kyle McMartin <kyle@parisc-linux.org>, <linuxppc-dev@ozlabs.org>, Brice Goglin <brice@myri.com>, <shaohua.li@intel.com>, Michael Ellerman <michael@ellerman.id.au>, Grant Grundler <grundler@parisc-linux.org>, Tony Luck <tony.luck@intel.com>, Ingo Molnar <mingo@elte.hu>
Message-ID: <m1d54zq6o3.fsf_-_@ebiederm.dsl.xmission.com>


The function msi_lookup_irq was horrible.  As a side effect of running
it changed dev->irq, and then the callers would need to change it
back.  In addition it does a global scan through all of the irqs,
which seems to be the sole justification of the msi_lock.

To remove the neede for msi_lookup_irq I added first_msi_irq to struct
pci_dev.  Then depending on the context I replaced msi_lookup_irq with
dev->first_msi_irq, dev->msi_enabled, or dev->msix_enabled.

msi_enabled and msix_enabled were already present in pci_dev for other
reasons.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 drivers/pci/msi.c   |  149 ++++++++++++++++++++--------------------------------
 include/linux/pci.h |    3 +
 2 files changed, 62 insertions(+), 90 deletions(-)

--- gregkh-2.6.orig/drivers/pci/msi.c
+++ gregkh-2.6/drivers/pci/msi.c
@@ -272,28 +272,6 @@ void disable_msi_mode(struct pci_dev *de
 	pci_intx(dev, 1);  /* enable intx */
 }
 
-static int msi_lookup_irq(struct pci_dev *dev, int type)
-{
-	int irq;
-	unsigned long flags;
-
-	spin_lock_irqsave(&msi_lock, flags);
-	for (irq = 0; irq < NR_IRQS; irq++) {
-		if (!msi_desc[irq] || msi_desc[irq]->dev != dev ||
-			msi_desc[irq]->msi_attrib.type != type ||
-			msi_desc[irq]->msi_attrib.default_irq != dev->irq)
-			continue;
-		spin_unlock_irqrestore(&msi_lock, flags);
-		/* This pre-assigned MSI irq for this device
-		   already exists. Override dev->irq with this irq */
-		dev->irq = irq;
-		return 0;
-	}
-	spin_unlock_irqrestore(&msi_lock, flags);
-
-	return -EACCES;
-}
-
 #ifdef CONFIG_PM
 static int __pci_save_msi_state(struct pci_dev *dev)
 {
@@ -364,11 +342,13 @@ static void __pci_restore_msi_state(stru
 static int __pci_save_msix_state(struct pci_dev *dev)
 {
 	int pos;
-	int temp;
 	int irq, head, tail = 0;
 	u16 control;
 	struct pci_cap_saved_state *save_state;
 
+	if (!dev->msix_enabled)
+		return 0;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
 	if (pos <= 0 || dev->no_msi)
 		return 0;
@@ -386,13 +366,7 @@ static int __pci_save_msix_state(struct 
 	*((u16 *)&save_state->data[0]) = control;
 
 	/* save the table */
-	temp = dev->irq;
-	if (msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
-		kfree(save_state);
-		return -EINVAL;
-	}
-
-	irq = head = dev->irq;
+	irq = head = dev->first_msi_irq;
 	while (head != tail) {
 		struct msi_desc *entry;
 
@@ -402,7 +376,6 @@ static int __pci_save_msix_state(struct 
 		tail = msi_desc[irq]->link.tail;
 		irq = tail;
 	}
-	dev->irq = temp;
 
 	save_state->cap_nr = PCI_CAP_ID_MSIX;
 	pci_add_saved_cap(dev, save_state);
@@ -428,9 +401,11 @@ static void __pci_restore_msix_state(str
 	int pos;
 	int irq, head, tail = 0;
 	struct msi_desc *entry;
-	int temp;
 	struct pci_cap_saved_state *save_state;
 
+	if (!dev->msix_enabled)
+		return;
+
 	save_state = pci_find_saved_cap(dev, PCI_CAP_ID_MSIX);
 	if (!save_state)
 		return;
@@ -443,10 +418,7 @@ static void __pci_restore_msix_state(str
 		return;
 
 	/* route the table */
-	temp = dev->irq;
-	if (msi_lookup_irq(dev, PCI_CAP_ID_MSIX))
-		return;
-	irq = head = dev->irq;
+	irq = head = dev->first_msi_irq;
 	while (head != tail) {
 		entry = msi_desc[irq];
 		write_msi_msg(irq, &entry->msg_save);
@@ -454,7 +426,6 @@ static void __pci_restore_msix_state(str
 		tail = msi_desc[irq]->link.tail;
 		irq = tail;
 	}
-	dev->irq = temp;
 
 	pci_write_config_word(dev, msi_control_reg(pos), save);
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
@@ -524,6 +495,7 @@ static int msi_capability_init(struct pc
 		return status;
 	}
 
+	dev->first_msi_irq = irq;
 	attach_msi_entry(entry, irq);
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
@@ -620,6 +592,7 @@ static int msix_capability_init(struct p
 			avail = -EBUSY;
 		return avail;
 	}
+	dev->first_msi_irq = entries[0].vector;
 	/* Set MSI-X enabled bits */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
 
@@ -667,13 +640,11 @@ int pci_msi_supported(struct pci_dev * d
  **/
 int pci_enable_msi(struct pci_dev* dev)
 {
-	int pos, temp, status;
+	int pos, status;
 
 	if (pci_msi_supported(dev) < 0)
 		return -EINVAL;
 
-	temp = dev->irq;
-
 	status = msi_init();
 	if (status < 0)
 		return status;
@@ -682,15 +653,14 @@ int pci_enable_msi(struct pci_dev* dev)
 	if (!pos)
 		return -EINVAL;
 
-	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSI));
+	WARN_ON(!!dev->msi_enabled);
 
 	/* Check whether driver already requested for MSI-X irqs */
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
-	if (pos > 0 && !msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
+	if (pos > 0 && dev->msix_enabled) {
 			printk(KERN_INFO "PCI: %s: Can't enable MSI.  "
-			       "Device already has MSI-X irq assigned\n",
+			       "Device already has MSI-X enabled\n",
 			       pci_name(dev));
-			dev->irq = temp;
 			return -EINVAL;
 	}
 	status = msi_capability_init(dev);
@@ -709,6 +679,9 @@ void pci_disable_msi(struct pci_dev* dev
 	if (!dev)
 		return;
 
+	if (!dev->msi_enabled)
+		return;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
 	if (!pos)
 		return;
@@ -717,28 +690,30 @@ void pci_disable_msi(struct pci_dev* dev
 	if (!(control & PCI_MSI_FLAGS_ENABLE))
 		return;
 
+
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
 	spin_lock_irqsave(&msi_lock, flags);
-	entry = msi_desc[dev->irq];
+	entry = msi_desc[dev->first_msi_irq];
 	if (!entry || !entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI) {
 		spin_unlock_irqrestore(&msi_lock, flags);
 		return;
 	}
-	if (irq_has_action(dev->irq)) {
+	if (irq_has_action(dev->first_msi_irq)) {
 		spin_unlock_irqrestore(&msi_lock, flags);
 		printk(KERN_WARNING "PCI: %s: pci_disable_msi() called without "
 		       "free_irq() on MSI irq %d\n",
-		       pci_name(dev), dev->irq);
-		BUG_ON(irq_has_action(dev->irq));
+		       pci_name(dev), dev->first_msi_irq);
+		BUG_ON(irq_has_action(dev->first_msi_irq));
 	} else {
 		default_irq = entry->msi_attrib.default_irq;
 		spin_unlock_irqrestore(&msi_lock, flags);
-		msi_free_irq(dev, dev->irq);
+		msi_free_irq(dev, dev->first_msi_irq);
 
 		/* Restore dev->irq to its default pin-assertion irq */
 		dev->irq = default_irq;
 	}
+	dev->first_msi_irq = 0;
 }
 
 static int msi_free_irq(struct pci_dev* dev, int irq)
@@ -797,7 +772,7 @@ static int msi_free_irq(struct pci_dev* 
 int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 {
 	int status, pos, nr_entries;
-	int i, j, temp;
+	int i, j;
 	u16 control;
 
 	if (!entries || pci_msi_supported(dev) < 0)
@@ -825,16 +800,14 @@ int pci_enable_msix(struct pci_dev* dev,
 				return -EINVAL;	/* duplicate entry */
 		}
 	}
-	temp = dev->irq;
-	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSIX));
+	WARN_ON(!!dev->msix_enabled);
 
 	/* Check whether driver already requested for MSI irq */
    	if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0 &&
-		!msi_lookup_irq(dev, PCI_CAP_ID_MSI)) {
+		dev->msi_enabled) {
 		printk(KERN_INFO "PCI: %s: Can't enable MSI-X.  "
 		       "Device already has an MSI irq assigned\n",
 		       pci_name(dev));
-		dev->irq = temp;
 		return -EINVAL;
 	}
 	status = msix_capability_init(dev, entries, nvec);
@@ -843,7 +816,9 @@ int pci_enable_msix(struct pci_dev* dev,
 
 void pci_disable_msix(struct pci_dev* dev)
 {
-	int pos, temp;
+	int irq, head, tail = 0, warning = 0;
+	unsigned long flags;
+	int pos;
 	u16 control;
 
 	if (!pci_msi_enable)
@@ -851,6 +826,9 @@ void pci_disable_msix(struct pci_dev* de
 	if (!dev)
 		return;
 
+	if (!dev->msix_enabled)
+		return;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
 	if (!pos)
 		return;
@@ -861,31 +839,25 @@ void pci_disable_msix(struct pci_dev* de
 
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
 
-	temp = dev->irq;
-	if (!msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
-		int irq, head, tail = 0, warning = 0;
-		unsigned long flags;
-
-		irq = head = dev->irq;
-		dev->irq = temp;			/* Restore pin IRQ */
-		while (head != tail) {
-			spin_lock_irqsave(&msi_lock, flags);
-			tail = msi_desc[irq]->link.tail;
-			spin_unlock_irqrestore(&msi_lock, flags);
-			if (irq_has_action(irq))
-				warning = 1;
-			else if (irq != head)	/* Release MSI-X irq */
-				msi_free_irq(dev, irq);
-			irq = tail;
-		}
-		msi_free_irq(dev, irq);
-		if (warning) {
-			printk(KERN_WARNING "PCI: %s: pci_disable_msix() called without "
-			       "free_irq() on all MSI-X irqs\n",
-			       pci_name(dev));
-			BUG_ON(warning > 0);
-		}
+	irq = head = dev->first_msi_irq;
+	while (head != tail) {
+		spin_lock_irqsave(&msi_lock, flags);
+		tail = msi_desc[irq]->link.tail;
+		spin_unlock_irqrestore(&msi_lock, flags);
+		if (irq_has_action(irq))
+			warning = 1;
+		else if (irq != head)	/* Release MSI-X irq */
+			msi_free_irq(dev, irq);
+		irq = tail;
+	}
+	msi_free_irq(dev, irq);
+	if (warning) {
+		printk(KERN_WARNING "PCI: %s: pci_disable_msix() called without "
+			"free_irq() on all MSI-X irqs\n",
+			pci_name(dev));
+		BUG_ON(warning > 0);
 	}
+	dev->first_msi_irq = 0;
 }
 
 /**
@@ -899,30 +871,28 @@ void pci_disable_msix(struct pci_dev* de
  **/
 void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 {
-	int pos, temp;
+	int pos;
 	unsigned long flags;
 
 	if (!pci_msi_enable || !dev)
  		return;
 
-	temp = dev->irq;		/* Save IOAPIC IRQ */
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
-	if (pos > 0 && !msi_lookup_irq(dev, PCI_CAP_ID_MSI)) {
-		if (irq_has_action(dev->irq)) {
+	if (pos > 0 && dev->msi_enabled) {
+		if (irq_has_action(dev->first_msi_irq)) {
 			printk(KERN_WARNING "PCI: %s: msi_remove_pci_irq_vectors() "
 			       "called without free_irq() on MSI irq %d\n",
-			       pci_name(dev), dev->irq);
-			BUG_ON(irq_has_action(dev->irq));
+			       pci_name(dev), dev->first_msi_irq);
+			BUG_ON(irq_has_action(dev->first_msi_irq));
 		} else /* Release MSI irq assigned to this device */
-			msi_free_irq(dev, dev->irq);
-		dev->irq = temp;		/* Restore IOAPIC IRQ */
+			msi_free_irq(dev, dev->first_msi_irq);
 	}
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
-	if (pos > 0 && !msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
+	if (pos > 0 && dev->msix_enabled) {
 		int irq, head, tail = 0, warning = 0;
 		void __iomem *base = NULL;
 
-		irq = head = dev->irq;
+		irq = head = dev->first_msi_irq;
 		while (head != tail) {
 			spin_lock_irqsave(&msi_lock, flags);
 			tail = msi_desc[irq]->link.tail;
@@ -942,7 +912,6 @@ void msi_remove_pci_irq_vectors(struct p
 			       pci_name(dev));
 			BUG_ON(warning > 0);
 		}
-		dev->irq = temp;		/* Restore IOAPIC IRQ */
 	}
 }
 
--- gregkh-2.6.orig/include/linux/pci.h
+++ gregkh-2.6/include/linux/pci.h
@@ -174,6 +174,9 @@ struct pci_dev {
 	struct bin_attribute *rom_attr; /* attribute descriptor for sysfs ROM entry */
 	int rom_attr_enabled;		/* has display of the rom attribute been enabled? */
 	struct bin_attribute *res_attr[DEVICE_COUNT_RESOURCE]; /* sysfs file for resources */
+#ifdef CONFIG_PCI_MSI
+	unsigned int first_msi_irq;
+#endif
 };
 
 #define pci_dev_g(n) list_entry(n, struct pci_dev, global_list)


Patches currently in gregkh-2.6 which might be from greg@kroah.com are

^ permalink raw reply	[flat|nested] 178+ messages in thread

* patch msi-kill-the-msi_desc-array.patch added to gregkh-2.6 tree
  2007-01-28 19:52             ` Eric W. Biederman
  (?)
  (?)
@ 2007-02-01  6:07             ` gregkh
  -1 siblings, 0 replies; 178+ messages in thread
From: gregkh @ 2007-02-01  6:07 UTC (permalink / raw)
  To: greg, brice, davem, ebiederm, gregkh, grundler, kyle,
	linuxppc-dev, michael, mingo, shaohua.li, tony.luck


This is a note to let you know that I've just added the patch titled

     Subject: msi: Kill the msi_desc array.

to my gregkh-2.6 tree.  Its filename is

     msi-kill-the-msi_desc-array.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From owner-linux-pci@atrey.karlin.mff.cuni.cz  Wed Jan 31 22:01:24 2007
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 12:52:03 -0700
Subject: msi: Kill the msi_desc array.
To: Greg Kroah-Hartman <greg@kroah.com>
Cc: "David S. Miller" <davem@davemloft.net>, Kyle McMartin <kyle@parisc-linux.org>, <linuxppc-dev@ozlabs.org>, Brice Goglin <brice@myri.com>, <shaohua.li@intel.com>, Michael Ellerman <michael@ellerman.id.au>, Grant Grundler <grundler@parisc-linux.org>, Tony Luck <tony.luck@intel.com>, Ingo Molnar <mingo@elte.hu>
Message-ID: <m1veiroroc.fsf_-_@ebiederm.dsl.xmission.com>


We need to be able to get from an irq number to a struct msi_desc.
The msi_desc array in msi.c had several short comings the big one was
that it could not be used outside of msi.c.  Using irq_data in struct
irq_desc almost worked except on some architectures irq_data needs to
be used for something else.

So this patch adds a msi_desc pointer to irq_desc, adds the appropriate
wrappers and changes all of the msi code to use them.

The dynamic_irq_init/cleanup code was tweaked to ensure the new
field is left in a well defined state.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 arch/ia64/sn/kernel/msi_sn.c |    2 -
 drivers/pci/msi.c            |   44 ++++++++++++++++++++-----------------------
 include/linux/irq.h          |    4 +++
 kernel/irq/chip.c            |   28 +++++++++++++++++++++++++++
 4 files changed, 54 insertions(+), 24 deletions(-)

--- gregkh-2.6.orig/arch/ia64/sn/kernel/msi_sn.c
+++ gregkh-2.6/arch/ia64/sn/kernel/msi_sn.c
@@ -74,7 +74,7 @@ int sn_setup_msi_irq(unsigned int irq, s
 	struct pcibus_bussoft *bussoft = SN_PCIDEV_BUSSOFT(pdev);
 	struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
 
-	entry = get_irq_data(irq);
+	entry = get_irq_msi(irq);
 	if (!entry->msi_attrib.is_64)
 		return -EINVAL;
 
--- gregkh-2.6.orig/drivers/pci/msi.c
+++ gregkh-2.6/drivers/pci/msi.c
@@ -24,7 +24,6 @@
 #include "pci.h"
 #include "msi.h"
 
-static struct msi_desc* msi_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = NULL };
 static struct kmem_cache* msi_cachep;
 
 static int pci_msi_enable = 1;
@@ -43,7 +42,7 @@ static void msi_set_mask_bit(unsigned in
 {
 	struct msi_desc *entry;
 
-	entry = msi_desc[irq];
+	entry = get_irq_msi(irq);
 	BUG_ON(!entry || !entry->dev);
 	switch (entry->msi_attrib.type) {
 	case PCI_CAP_ID_MSI:
@@ -73,7 +72,7 @@ static void msi_set_mask_bit(unsigned in
 
 void read_msi_msg(unsigned int irq, struct msi_msg *msg)
 {
-	struct msi_desc *entry = get_irq_data(irq);
+	struct msi_desc *entry = get_irq_msi(irq);
 	switch(entry->msi_attrib.type) {
 	case PCI_CAP_ID_MSI:
 	{
@@ -112,7 +111,7 @@ void read_msi_msg(unsigned int irq, stru
 
 void write_msi_msg(unsigned int irq, struct msi_msg *msg)
 {
-	struct msi_desc *entry = get_irq_data(irq);
+	struct msi_desc *entry = get_irq_msi(irq);
 	switch (entry->msi_attrib.type) {
 	case PCI_CAP_ID_MSI:
 	{
@@ -208,7 +207,7 @@ static int create_msi_irq(void)
 		return -EBUSY;
 	}
 
-	set_irq_data(irq, entry);
+	set_irq_msi(irq, entry);
 
 	return irq;
 }
@@ -217,9 +216,9 @@ static void destroy_msi_irq(unsigned int
 {
 	struct msi_desc *entry;
 
-	entry = get_irq_data(irq);
+	entry = get_irq_msi(irq);
 	set_irq_chip(irq, NULL);
-	set_irq_data(irq, NULL);
+	set_irq_msi(irq, NULL);
 	destroy_irq(irq);
 	kmem_cache_free(msi_cachep, entry);
 }
@@ -360,10 +359,10 @@ static int __pci_save_msix_state(struct 
 	while (head != tail) {
 		struct msi_desc *entry;
 
-		entry = msi_desc[irq];
+		entry = get_irq_msi(irq);
 		read_msi_msg(irq, &entry->msg_save);
 
-		tail = msi_desc[irq]->link.tail;
+		tail = entry->link.tail;
 		irq = tail;
 	}
 
@@ -410,10 +409,10 @@ static void __pci_restore_msix_state(str
 	/* route the table */
 	irq = head = dev->first_msi_irq;
 	while (head != tail) {
-		entry = msi_desc[irq];
+		entry = get_irq_msi(irq);
 		write_msi_msg(irq, &entry->msg_save);
 
-		tail = msi_desc[irq]->link.tail;
+		tail = entry->link.tail;
 		irq = tail;
 	}
 
@@ -451,7 +450,7 @@ static int msi_capability_init(struct pc
 	if (irq < 0)
 		return irq;
 
-	entry = get_irq_data(irq);
+	entry = get_irq_msi(irq);
 	entry->link.head = irq;
 	entry->link.tail = irq;
 	entry->msi_attrib.type = PCI_CAP_ID_MSI;
@@ -486,7 +485,7 @@ static int msi_capability_init(struct pc
 	}
 
 	dev->first_msi_irq = irq;
-	msi_desc[irq] = entry;
+	set_irq_msi(irq, entry);
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
@@ -535,7 +534,7 @@ static int msix_capability_init(struct p
 		if (irq < 0)
 			break;
 
-		entry = get_irq_data(irq);
+		entry = get_irq_msi(irq);
  		j = entries[i].entry;
  		entries[i].vector = irq;
 		entry->msi_attrib.type = PCI_CAP_ID_MSIX;
@@ -565,7 +564,7 @@ static int msix_capability_init(struct p
 			break;
 		}
 
-		msi_desc[irq] = entry;
+		set_irq_msi(irq, entry);
 	}
 	if (i != nvec) {
 		int avail = i - 1;
@@ -682,7 +681,7 @@ void pci_disable_msi(struct pci_dev* dev
 
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
-	entry = msi_desc[dev->first_msi_irq];
+	entry = get_irq_msi(dev->first_msi_irq);
 	if (!entry || !entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI) {
 		return;
 	}
@@ -709,7 +708,7 @@ static int msi_free_irq(struct pci_dev* 
 
 	arch_teardown_msi_irq(irq);
 
-	entry = msi_desc[irq];
+	entry = get_irq_msi(irq);
 	if (!entry || entry->dev != dev) {
 		return -EINVAL;
 	}
@@ -717,10 +716,9 @@ static int msi_free_irq(struct pci_dev* 
 	entry_nr = entry->msi_attrib.entry_nr;
 	head = entry->link.head;
 	base = entry->mask_base;
-	msi_desc[entry->link.head]->link.tail = entry->link.tail;
-	msi_desc[entry->link.tail]->link.head = entry->link.head;
+	get_irq_msi(entry->link.head)->link.tail = entry->link.tail;
+	get_irq_msi(entry->link.tail)->link.head = entry->link.head;
 	entry->dev = NULL;
-	msi_desc[irq] = NULL;
 
 	destroy_msi_irq(irq);
 
@@ -821,7 +819,7 @@ void pci_disable_msix(struct pci_dev* de
 
 	irq = head = dev->first_msi_irq;
 	while (head != tail) {
-		tail = msi_desc[irq]->link.tail;
+		tail = get_irq_msi(irq)->link.tail;
 		if (irq_has_action(irq))
 			warning = 1;
 		else if (irq != head)	/* Release MSI-X irq */
@@ -867,8 +865,8 @@ void msi_remove_pci_irq_vectors(struct p
 
 		irq = head = dev->first_msi_irq;
 		while (head != tail) {
-			tail = msi_desc[irq]->link.tail;
-			base = msi_desc[irq]->mask_base;
+			tail = get_irq_msi(irq)->link.tail;
+			base = get_irq_msi(irq)->mask_base;
 			if (irq_has_action(irq))
 				warning = 1;
 			else if (irq != head) /* Release MSI-X irq */
--- gregkh-2.6.orig/include/linux/irq.h
+++ gregkh-2.6/include/linux/irq.h
@@ -68,6 +68,7 @@ typedef	void fastcall (*irq_flow_handler
 #define IRQ_MOVE_PENDING	0x40000000	/* need to re-target IRQ destination */
 
 struct proc_dir_entry;
+struct msi_desc;
 
 /**
  * struct irq_chip - hardware interrupt chip descriptor
@@ -148,6 +149,7 @@ struct irq_chip {
 struct irq_desc {
 	irq_flow_handler_t	handle_irq;
 	struct irq_chip		*chip;
+	struct msi_desc		*msi_desc;
 	void			*handler_data;
 	void			*chip_data;
 	struct irqaction	*action;	/* IRQ action list */
@@ -373,10 +375,12 @@ extern int set_irq_chip(unsigned int irq
 extern int set_irq_data(unsigned int irq, void *data);
 extern int set_irq_chip_data(unsigned int irq, void *data);
 extern int set_irq_type(unsigned int irq, unsigned int type);
+extern int set_irq_msi(unsigned int irq, struct msi_desc *entry);
 
 #define get_irq_chip(irq)	(irq_desc[irq].chip)
 #define get_irq_chip_data(irq)	(irq_desc[irq].chip_data)
 #define get_irq_data(irq)	(irq_desc[irq].handler_data)
+#define get_irq_msi(irq)	(irq_desc[irq].msi_desc)
 
 #endif /* CONFIG_GENERIC_HARDIRQS */
 
--- gregkh-2.6.orig/kernel/irq/chip.c
+++ gregkh-2.6/kernel/irq/chip.c
@@ -39,6 +39,7 @@ void dynamic_irq_init(unsigned int irq)
 	desc->chip = &no_irq_chip;
 	desc->handle_irq = handle_bad_irq;
 	desc->depth = 1;
+	desc->msi_desc = NULL;
 	desc->handler_data = NULL;
 	desc->chip_data = NULL;
 	desc->action = NULL;
@@ -74,6 +75,9 @@ void dynamic_irq_cleanup(unsigned int ir
 		WARN_ON(1);
 		return;
 	}
+	desc->msi_desc = NULL;
+	desc->handler_data = NULL;
+	desc->chip_data = NULL;
 	desc->handle_irq = handle_bad_irq;
 	desc->chip = &no_irq_chip;
 	spin_unlock_irqrestore(&desc->lock, flags);
@@ -162,6 +166,30 @@ int set_irq_data(unsigned int irq, void 
 EXPORT_SYMBOL(set_irq_data);
 
 /**
+ *	set_irq_data - set irq type data for an irq
+ *	@irq:	Interrupt number
+ *	@data:	Pointer to interrupt specific data
+ *
+ *	Set the hardware irq controller data for an irq
+ */
+int set_irq_msi(unsigned int irq, struct msi_desc *entry)
+{
+	struct irq_desc *desc;
+	unsigned long flags;
+
+	if (irq >= NR_IRQS) {
+		printk(KERN_ERR
+		       "Trying to install msi data for IRQ%d\n", irq);
+		return -EINVAL;
+	}
+	desc = irq_desc + irq;
+	spin_lock_irqsave(&desc->lock, flags);
+	desc->msi_desc = entry;
+	spin_unlock_irqrestore(&desc->lock, flags);
+	return 0;
+}
+
+/**
  *	set_irq_chip_data - set irq chip data for an irq
  *	@irq:	Interrupt number
  *	@data:	Pointer to chip specific data


Patches currently in gregkh-2.6 which might be from greg@kroah.com are

^ permalink raw reply	[flat|nested] 178+ messages in thread

* patch msi-make-msi-useable-more-architectures.patch added to gregkh-2.6 tree
  2007-01-28 19:56               ` Eric W. Biederman
  (?)
@ 2007-02-01  6:08               ` gregkh
  -1 siblings, 0 replies; 178+ messages in thread
From: gregkh @ 2007-02-01  6:08 UTC (permalink / raw)
  To: greg, brice, davem, ebiederm, gregkh, grundler, kyle,
	linuxppc-dev, michael, mingo, shaohua.li, tony.luck


This is a note to let you know that I've just added the patch titled

     Subject: msi: Make MSI useable more architectures

to my gregkh-2.6 tree.  Its filename is

     msi-make-msi-useable-more-architectures.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From owner-linux-pci@atrey.karlin.mff.cuni.cz  Wed Jan 31 22:01:54 2007
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 12:56:37 -0700
Subject: msi: Make MSI useable more architectures
To: Greg Kroah-Hartman <greg@kroah.com>
Cc: "David S. Miller" <davem@davemloft.net>, Kyle McMartin <kyle@parisc-linux.org>, <linuxppc-dev@ozlabs.org>, Brice Goglin <brice@myri.com>, <shaohua.li@intel.com>, Michael Ellerman <michael@ellerman.id.au>, Grant Grundler <grundler@parisc-linux.org>, Tony Luck <tony.luck@intel.com>, Ingo Molnar <mingo@elte.hu>
Message-ID: <m1r6tforgq.fsf_-_@ebiederm.dsl.xmission.com>

The arch hooks arch_setup_msi_irq and arch_teardown_msi_irq are now
responsible for allocating and freeing the linux irq in addition to
setting up the the linux irq to work with the interrupt.

arch_setup_msi_irq now takes a pci_device and a msi_desc and returns
an irq.

With this change in place this code should be useable by all platforms
except those that won't let the OS touch the hardware like ppc RTAS.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 arch/i386/kernel/io_apic.c   |   17 ++++++---
 arch/ia64/kernel/msi_ia64.c  |   19 ++++++----
 arch/ia64/sn/kernel/msi_sn.c |   20 +++++++---
 arch/x86_64/kernel/io_apic.c |   17 ++++++---
 drivers/pci/msi.c            |   80 +++++++++++--------------------------------
 include/asm-ia64/machvec.h   |    3 +
 include/linux/msi.h          |    2 -
 7 files changed, 75 insertions(+), 83 deletions(-)

--- gregkh-2.6.orig/arch/i386/kernel/io_apic.c
+++ gregkh-2.6/arch/i386/kernel/io_apic.c
@@ -2606,25 +2606,32 @@ static struct irq_chip msi_chip = {
 	.retrigger	= ioapic_retrigger_irq,
 };
 
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *dev)
+int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc)
 {
 	struct msi_msg msg;
-	int ret;
+	int irq, ret;
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, desc);
 	ret = msi_compose_msg(dev, irq, &msg);
-	if (ret < 0)
+	if (ret < 0) {
+		destroy_irq(irq);
 		return ret;
+	}
 
 	write_msi_msg(irq, &msg);
 
 	set_irq_chip_and_handler_name(irq, &msi_chip, handle_edge_irq,
 				      "edge");
 
-	return 0;
+	return irq;
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
 {
-	return;
+	destroy_irq(irq);
 }
 
 #endif /* CONFIG_PCI_MSI */
--- gregkh-2.6.orig/arch/ia64/kernel/msi_ia64.c
+++ gregkh-2.6/arch/ia64/kernel/msi_ia64.c
@@ -64,12 +64,17 @@ static void ia64_set_msi_irq_affinity(un
 }
 #endif /* CONFIG_SMP */
 
-int ia64_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
+int ia64_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *desc)
 {
 	struct msi_msg	msg;
 	unsigned long	dest_phys_id;
-	unsigned int	vector;
+	unsigned int	irq, vector;
 
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, desc);
 	dest_phys_id = cpu_physical_id(first_cpu(cpu_online_map));
 	vector = irq;
 
@@ -89,12 +94,12 @@ int ia64_setup_msi_irq(unsigned int irq,
 	write_msi_msg(irq, &msg);
 	set_irq_chip_and_handler(irq, &ia64_msi_chip, handle_edge_irq);
 
-	return 0;
+	return irq;
 }
 
 void ia64_teardown_msi_irq(unsigned int irq)
 {
-	return;		/* no-op */
+	destroy_irq(irq);
 }
 
 static void ia64_ack_msi_irq(unsigned int irq)
@@ -126,12 +131,12 @@ static struct irq_chip ia64_msi_chip = {
 };
 
 
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
+int arch_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *desc)
 {
 	if (platform_setup_msi_irq)
-		return platform_setup_msi_irq(irq, pdev);
+		return platform_setup_msi_irq(pdev, desc);
 
-	return ia64_setup_msi_irq(irq, pdev);
+	return ia64_setup_msi_irq(pdev, desc);
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
--- gregkh-2.6.orig/arch/ia64/sn/kernel/msi_sn.c
+++ gregkh-2.6/arch/ia64/sn/kernel/msi_sn.c
@@ -59,13 +59,12 @@ void sn_teardown_msi_irq(unsigned int ir
 	sn_intr_free(nasid, widget, sn_irq_info);
 	sn_msi_info[irq].sn_irq_info = NULL;
 
-	return;
+	destroy_irq(irq);
 }
 
-int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
+int sn_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *entry)
 {
 	struct msi_msg msg;
-	struct msi_desc *entry;
 	int widget;
 	int status;
 	nasid_t nasid;
@@ -73,8 +72,8 @@ int sn_setup_msi_irq(unsigned int irq, s
 	struct sn_irq_info *sn_irq_info;
 	struct pcibus_bussoft *bussoft = SN_PCIDEV_BUSSOFT(pdev);
 	struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
+	int irq;
 
-	entry = get_irq_msi(irq);
 	if (!entry->msi_attrib.is_64)
 		return -EINVAL;
 
@@ -84,6 +83,11 @@ int sn_setup_msi_irq(unsigned int irq, s
 	if (provider == NULL || provider->dma_map_consistent == NULL)
 		return -EINVAL;
 
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, entry);
 	/*
 	 * Set up the vector plumbing.  Let the prom (via sn_intr_alloc)
 	 * decide which cpu to direct this msi at by default.
@@ -95,12 +99,15 @@ int sn_setup_msi_irq(unsigned int irq, s
 			SWIN_WIDGETNUM(bussoft->bs_base);
 
 	sn_irq_info = kzalloc(sizeof(struct sn_irq_info), GFP_KERNEL);
-	if (! sn_irq_info)
+	if (! sn_irq_info) {
+		destroy_irq(irq);
 		return -ENOMEM;
+	}
 
 	status = sn_intr_alloc(nasid, widget, sn_irq_info, irq, -1, -1);
 	if (status) {
 		kfree(sn_irq_info);
+		destroy_irq(irq);
 		return -ENOMEM;
 	}
 
@@ -121,6 +128,7 @@ int sn_setup_msi_irq(unsigned int irq, s
 	if (! bus_addr) {
 		sn_intr_free(nasid, widget, sn_irq_info);
 		kfree(sn_irq_info);
+		destroy_irq(irq);
 		return -ENOMEM;
 	}
 
@@ -139,7 +147,7 @@ int sn_setup_msi_irq(unsigned int irq, s
 	write_msi_msg(irq, &msg);
 	set_irq_chip_and_handler(irq, &sn_msi_chip, handle_edge_irq);
 
-	return 0;
+	return irq;
 }
 
 #ifdef CONFIG_SMP
--- gregkh-2.6.orig/arch/x86_64/kernel/io_apic.c
+++ gregkh-2.6/arch/x86_64/kernel/io_apic.c
@@ -1956,24 +1956,31 @@ static struct irq_chip msi_chip = {
 	.retrigger	= ioapic_retrigger_irq,
 };
 
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *dev)
+int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc)
 {
 	struct msi_msg msg;
-	int ret;
+	int irq, ret;
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, desc);
 	ret = msi_compose_msg(dev, irq, &msg);
-	if (ret < 0)
+	if (ret < 0) {
+		destroy_irq(irq);
 		return ret;
+	}
 
 	write_msi_msg(irq, &msg);
 
 	set_irq_chip_and_handler_name(irq, &msi_chip, handle_edge_irq, "edge");
 
-	return 0;
+	return irq;
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
 {
-	return;
+	destroy_irq(irq);
 }
 
 #endif /* CONFIG_PCI_MSI */
--- gregkh-2.6.orig/drivers/pci/msi.c
+++ gregkh-2.6/drivers/pci/msi.c
@@ -192,37 +192,6 @@ static struct msi_desc* alloc_msi_entry(
 	return entry;
 }
 
-static int create_msi_irq(void)
-{
-	struct msi_desc *entry;
-	int irq;
-
-	entry = alloc_msi_entry();
-	if (!entry)
-		return -ENOMEM;
-
-	irq = create_irq();
-	if (irq < 0) {
-		kmem_cache_free(msi_cachep, entry);
-		return -EBUSY;
-	}
-
-	set_irq_msi(irq, entry);
-
-	return irq;
-}
-
-static void destroy_msi_irq(unsigned int irq)
-{
-	struct msi_desc *entry;
-
-	entry = get_irq_msi(irq);
-	set_irq_chip(irq, NULL);
-	set_irq_msi(irq, NULL);
-	destroy_irq(irq);
-	kmem_cache_free(msi_cachep, entry);
-}
-
 static void enable_msi_mode(struct pci_dev *dev, int pos, int type)
 {
 	u16 control;
@@ -438,7 +407,6 @@ void pci_restore_msi_state(struct pci_de
  **/
 static int msi_capability_init(struct pci_dev *dev)
 {
-	int status;
 	struct msi_desc *entry;
 	int pos, irq;
 	u16 control;
@@ -446,13 +414,10 @@ static int msi_capability_init(struct pc
    	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
 	pci_read_config_word(dev, msi_control_reg(pos), &control);
 	/* MSI Entry Initialization */
-	irq = create_msi_irq();
-	if (irq < 0)
-		return irq;
+	entry = alloc_msi_entry();
+	if (!entry)
+		return -ENOMEM;
 
-	entry = get_irq_msi(irq);
-	entry->link.head = irq;
-	entry->link.tail = irq;
 	entry->msi_attrib.type = PCI_CAP_ID_MSI;
 	entry->msi_attrib.is_64 = is_64bit_address(control);
 	entry->msi_attrib.entry_nr = 0;
@@ -478,14 +443,16 @@ static int msi_capability_init(struct pc
 			maskbits);
 	}
 	/* Configure MSI capability structure */
-	status = arch_setup_msi_irq(irq, dev);
-	if (status < 0) {
-		destroy_msi_irq(irq);
-		return status;
+	irq = arch_setup_msi_irq(dev, entry);
+	if (irq < 0) {
+		kmem_cache_free(msi_cachep, entry);
+		return irq;
 	}
-
+	entry->link.head = irq;
+	entry->link.tail = irq;
 	dev->first_msi_irq = irq;
 	set_irq_msi(irq, entry);
+
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
@@ -507,7 +474,6 @@ static int msix_capability_init(struct p
 				struct msix_entry *entries, int nvec)
 {
 	struct msi_desc *head = NULL, *tail = NULL, *entry = NULL;
-	int status;
 	int irq, pos, i, j, nr_entries, temp = 0;
 	unsigned long phys_addr;
 	u32 table_offset;
@@ -530,13 +496,11 @@ static int msix_capability_init(struct p
 
 	/* MSI-X Table Initialization */
 	for (i = 0; i < nvec; i++) {
-		irq = create_msi_irq();
-		if (irq < 0)
+		entry = alloc_msi_entry();
+		if (!entry)
 			break;
 
-		entry = get_irq_msi(irq);
  		j = entries[i].entry;
- 		entries[i].vector = irq;
 		entry->msi_attrib.type = PCI_CAP_ID_MSIX;
 		entry->msi_attrib.is_64 = 1;
 		entry->msi_attrib.entry_nr = j;
@@ -545,6 +509,14 @@ static int msix_capability_init(struct p
 		entry->msi_attrib.pos = pos;
 		entry->dev = dev;
 		entry->mask_base = base;
+
+		/* Configure MSI-X capability structure */
+		irq = arch_setup_msi_irq(dev, entry);
+		if (irq < 0) {
+			kmem_cache_free(msi_cachep, entry);
+			break;
+		}
+ 		entries[i].vector = irq;
 		if (!head) {
 			entry->link.head = irq;
 			entry->link.tail = irq;
@@ -557,12 +529,6 @@ static int msix_capability_init(struct p
 		}
 		temp = irq;
 		tail = entry;
-		/* Configure MSI-X capability structure */
-		status = arch_setup_msi_irq(irq, dev);
-		if (status < 0) {
-			destroy_msi_irq(irq);
-			break;
-		}
 
 		set_irq_msi(irq, entry);
 	}
@@ -706,8 +672,6 @@ static int msi_free_irq(struct pci_dev* 
 	int head, entry_nr, type;
 	void __iomem *base;
 
-	arch_teardown_msi_irq(irq);
-
 	entry = get_irq_msi(irq);
 	if (!entry || entry->dev != dev) {
 		return -EINVAL;
@@ -718,9 +682,9 @@ static int msi_free_irq(struct pci_dev* 
 	base = entry->mask_base;
 	get_irq_msi(entry->link.head)->link.tail = entry->link.tail;
 	get_irq_msi(entry->link.tail)->link.head = entry->link.head;
-	entry->dev = NULL;
 
-	destroy_msi_irq(irq);
+	arch_teardown_msi_irq(irq);
+	kmem_cache_free(msi_cachep, entry);
 
 	if (type == PCI_CAP_ID_MSIX) {
 		writel(1, base + entry_nr * PCI_MSIX_ENTRY_SIZE +
--- gregkh-2.6.orig/include/asm-ia64/machvec.h
+++ gregkh-2.6/include/asm-ia64/machvec.h
@@ -21,6 +21,7 @@ struct mm_struct;
 struct pci_bus;
 struct task_struct;
 struct pci_dev;
+struct msi_desc;
 
 typedef void ia64_mv_setup_t (char **);
 typedef void ia64_mv_cpu_init_t (void);
@@ -79,7 +80,7 @@ typedef unsigned short ia64_mv_readw_rel
 typedef unsigned int ia64_mv_readl_relaxed_t (const volatile void __iomem *);
 typedef unsigned long ia64_mv_readq_relaxed_t (const volatile void __iomem *);
 
-typedef int ia64_mv_setup_msi_irq_t (unsigned int irq, struct pci_dev *pdev);
+typedef int ia64_mv_setup_msi_irq_t (struct pci_dev *pdev, struct msi_desc *);
 typedef void ia64_mv_teardown_msi_irq_t (unsigned int irq);
 
 static inline void
--- gregkh-2.6.orig/include/linux/msi.h
+++ gregkh-2.6/include/linux/msi.h
@@ -41,7 +41,7 @@ struct msi_desc {
 /*
  * The arch hook for setup up msi irqs
  */
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *dev);
+int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
 void arch_teardown_msi_irq(unsigned int irq);
 
 


Patches currently in gregkh-2.6 which might be from greg@kroah.com are

^ permalink raw reply	[flat|nested] 178+ messages in thread

* patch msi-remove-attach_msi_entry.patch added to gregkh-2.6 tree
  2007-01-28 19:47           ` Eric W. Biederman
  (?)
  (?)
@ 2007-02-01  6:08           ` gregkh
  -1 siblings, 0 replies; 178+ messages in thread
From: gregkh @ 2007-02-01  6:08 UTC (permalink / raw)
  To: greg, brice, davem, ebiederm, gregkh, grundler, kyle,
	linuxppc-dev, michael, mingo, shaohua.li, tony.luck


This is a note to let you know that I've just added the patch titled

     Subject: msi: Remove attach_msi_entry.

to my gregkh-2.6 tree.  Its filename is

     msi-remove-attach_msi_entry.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From owner-linux-pci@atrey.karlin.mff.cuni.cz  Wed Jan 31 22:00:56 2007
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 12:47:52 -0700
Subject: msi: Remove attach_msi_entry.
To: Greg Kroah-Hartman <greg@kroah.com>
Cc: "David S. Miller" <davem@davemloft.net>, Kyle McMartin <kyle@parisc-linux.org>, <linuxppc-dev@ozlabs.org>, Brice Goglin <brice@myri.com>, <shaohua.li@intel.com>, Michael Ellerman <michael@ellerman.id.au>, Grant Grundler <grundler@parisc-linux.org>, Tony Luck <tony.luck@intel.com>, Ingo Molnar <mingo@elte.hu>
Message-ID: <m1zm83orvb.fsf_-_@ebiederm.dsl.xmission.com>

The attach_msi_entry has been reduced to a single simple assignment,
so for simplicity remove the abstraction and directory perform the
assignment.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 drivers/pci/msi.c |    9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

--- gregkh-2.6.orig/drivers/pci/msi.c
+++ gregkh-2.6/drivers/pci/msi.c
@@ -193,11 +193,6 @@ static struct msi_desc* alloc_msi_entry(
 	return entry;
 }
 
-static void attach_msi_entry(struct msi_desc *entry, int irq)
-{
-	msi_desc[irq] = entry;
-}
-
 static int create_msi_irq(void)
 {
 	struct msi_desc *entry;
@@ -491,7 +486,7 @@ static int msi_capability_init(struct pc
 	}
 
 	dev->first_msi_irq = irq;
-	attach_msi_entry(entry, irq);
+	msi_desc[irq] = entry;
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
@@ -570,7 +565,7 @@ static int msix_capability_init(struct p
 			break;
 		}
 
-		attach_msi_entry(entry, irq);
+		msi_desc[irq] = entry;
 	}
 	if (i != nvec) {
 		int avail = i - 1;


Patches currently in gregkh-2.6 which might be from greg@kroah.com are

^ permalink raw reply	[flat|nested] 178+ messages in thread

* patch msi-remove-msi_lock.patch added to gregkh-2.6 tree
  2007-01-28 19:44       ` Eric W. Biederman
  (?)
  (?)
@ 2007-02-01  6:08       ` gregkh
  -1 siblings, 0 replies; 178+ messages in thread
From: gregkh @ 2007-02-01  6:08 UTC (permalink / raw)
  To: greg, brice, davem, ebiederm, gregkh, grundler, kyle,
	linuxppc-dev, michael, mingo, shaohua.li, tony.luck


This is a note to let you know that I've just added the patch titled

     Subject: msi: Remove msi_lock.

to my gregkh-2.6 tree.  Its filename is

     msi-remove-msi_lock.patch

This tree can be found at 
    http://www.kernel.org/pub/linux/kernel/people/gregkh/gregkh-2.6/patches/


>From owner-linux-pci@atrey.karlin.mff.cuni.cz  Wed Jan 31 21:59:53 2007
From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 12:44:21 -0700
Subject: msi: Remove msi_lock.
To: Greg Kroah-Hartman <greg@kroah.com>
Cc: "David S. Miller" <davem@davemloft.net>, Kyle McMartin <kyle@parisc-linux.org>, <linuxppc-dev@ozlabs.org>, Brice Goglin <brice@myri.com>, <shaohua.li@intel.com>, Michael Ellerman <michael@ellerman.id.au>, Grant Grundler <grundler@parisc-linux.org>, Tony Luck <tony.luck@intel.com>, Ingo Molnar <mingo@elte.hu>
Message-ID: <m18xfnq6lm.fsf_-_@ebiederm.dsl.xmission.com>



With the removal of msi_lookup_irq all of the functions using msi_lock
operated on a single device and none of them could reasonably be
called on that device at the same time. 

Since what little synchronization that needs to happen needs to happen
outside of the msi functions, msi_lock could never be contended and as
such is useless and just complicates the code.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>

---
 drivers/pci/msi.c |   20 --------------------
 1 file changed, 20 deletions(-)

--- gregkh-2.6.orig/drivers/pci/msi.c
+++ gregkh-2.6/drivers/pci/msi.c
@@ -24,7 +24,6 @@
 #include "pci.h"
 #include "msi.h"
 
-static DEFINE_SPINLOCK(msi_lock);
 static struct msi_desc* msi_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = NULL };
 static struct kmem_cache* msi_cachep;
 
@@ -196,11 +195,7 @@ static struct msi_desc* alloc_msi_entry(
 
 static void attach_msi_entry(struct msi_desc *entry, int irq)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&msi_lock, flags);
 	msi_desc[irq] = entry;
-	spin_unlock_irqrestore(&msi_lock, flags);
 }
 
 static int create_msi_irq(void)
@@ -672,7 +667,6 @@ void pci_disable_msi(struct pci_dev* dev
 	struct msi_desc *entry;
 	int pos, default_irq;
 	u16 control;
-	unsigned long flags;
 
 	if (!pci_msi_enable)
 		return;
@@ -693,21 +687,17 @@ void pci_disable_msi(struct pci_dev* dev
 
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
-	spin_lock_irqsave(&msi_lock, flags);
 	entry = msi_desc[dev->first_msi_irq];
 	if (!entry || !entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI) {
-		spin_unlock_irqrestore(&msi_lock, flags);
 		return;
 	}
 	if (irq_has_action(dev->first_msi_irq)) {
-		spin_unlock_irqrestore(&msi_lock, flags);
 		printk(KERN_WARNING "PCI: %s: pci_disable_msi() called without "
 		       "free_irq() on MSI irq %d\n",
 		       pci_name(dev), dev->first_msi_irq);
 		BUG_ON(irq_has_action(dev->first_msi_irq));
 	} else {
 		default_irq = entry->msi_attrib.default_irq;
-		spin_unlock_irqrestore(&msi_lock, flags);
 		msi_free_irq(dev, dev->first_msi_irq);
 
 		/* Restore dev->irq to its default pin-assertion irq */
@@ -721,14 +711,11 @@ static int msi_free_irq(struct pci_dev* 
 	struct msi_desc *entry;
 	int head, entry_nr, type;
 	void __iomem *base;
-	unsigned long flags;
 
 	arch_teardown_msi_irq(irq);
 
-	spin_lock_irqsave(&msi_lock, flags);
 	entry = msi_desc[irq];
 	if (!entry || entry->dev != dev) {
-		spin_unlock_irqrestore(&msi_lock, flags);
 		return -EINVAL;
 	}
 	type = entry->msi_attrib.type;
@@ -739,7 +726,6 @@ static int msi_free_irq(struct pci_dev* 
 	msi_desc[entry->link.tail]->link.head = entry->link.head;
 	entry->dev = NULL;
 	msi_desc[irq] = NULL;
-	spin_unlock_irqrestore(&msi_lock, flags);
 
 	destroy_msi_irq(irq);
 
@@ -817,7 +803,6 @@ int pci_enable_msix(struct pci_dev* dev,
 void pci_disable_msix(struct pci_dev* dev)
 {
 	int irq, head, tail = 0, warning = 0;
-	unsigned long flags;
 	int pos;
 	u16 control;
 
@@ -841,9 +826,7 @@ void pci_disable_msix(struct pci_dev* de
 
 	irq = head = dev->first_msi_irq;
 	while (head != tail) {
-		spin_lock_irqsave(&msi_lock, flags);
 		tail = msi_desc[irq]->link.tail;
-		spin_unlock_irqrestore(&msi_lock, flags);
 		if (irq_has_action(irq))
 			warning = 1;
 		else if (irq != head)	/* Release MSI-X irq */
@@ -872,7 +855,6 @@ void pci_disable_msix(struct pci_dev* de
 void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 {
 	int pos;
-	unsigned long flags;
 
 	if (!pci_msi_enable || !dev)
  		return;
@@ -894,10 +876,8 @@ void msi_remove_pci_irq_vectors(struct p
 
 		irq = head = dev->first_msi_irq;
 		while (head != tail) {
-			spin_lock_irqsave(&msi_lock, flags);
 			tail = msi_desc[irq]->link.tail;
 			base = msi_desc[irq]->mask_base;
-			spin_unlock_irqrestore(&msi_lock, flags);
 			if (irq_has_action(irq))
 				warning = 1;
 			else if (irq != head) /* Release MSI-X irq */


Patches currently in gregkh-2.6 which might be from greg@kroah.com are

^ permalink raw reply	[flat|nested] 178+ messages in thread

end of thread, other threads:[~2007-02-01  6:09 UTC | newest]

Thread overview: 178+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2007-01-25  8:34 [RFC/PATCH 0/16] Ops based MSI Implementation Michael Ellerman
2007-01-25  8:34 ` [RFC/PATCH 1/16] Replace pci_msi_quirk with calls to pci_no_msi() Michael Ellerman
2007-01-25 22:33   ` patch msi-replace-pci_msi_quirk-with-calls-to-pci_no_msi.patch added to gregkh-2.6 tree gregkh
2007-01-25  8:34 ` [RFC/PATCH 3/16] Combine pci_(save|restore)_msi/msix_state Michael Ellerman
2007-01-25 22:33   ` patch msi-combine-pci__msi-msix_state.patch added to gregkh-2.6 tree gregkh
2007-01-25  8:34 ` [RFC/PATCH 2/16] Remove pci_scan_msi_device() Michael Ellerman
2007-01-25 22:33   ` patch msi-remove-pci_scan_msi_device.patch added to gregkh-2.6 tree gregkh
2007-01-25  8:34 ` [RFC/PATCH 5/16] Ops based MSI implementation Michael Ellerman
2007-01-25 21:52   ` Greg KH
2007-01-25 22:05     ` Roland Dreier
2007-01-25 22:10       ` Greg KH
2007-01-26  1:02     ` Michael Ellerman
2007-01-25  8:34 ` [RFC/PATCH 4/16] Abstract MSI suspend Michael Ellerman
2007-01-25 22:33   ` patch msi-abstract-msi-suspend.patch added to gregkh-2.6 tree gregkh
2007-01-28  8:27   ` [RFC/PATCH 4/16] Abstract MSI suspend Eric W. Biederman
2007-01-29  7:22     ` Michael Ellerman
2007-01-29  8:45       ` Eric W. Biederman
2007-01-29  9:47         ` Michael Ellerman
2007-01-29 16:52           ` Grant Grundler
2007-01-29 16:57             ` Roland Dreier
2007-01-29 17:02               ` Roland Dreier
2007-01-29 17:25                 ` Eric W. Biederman
2007-01-29 17:32                   ` Roland Dreier
2007-01-29 22:03               ` Grant Grundler
2007-01-29 17:20           ` Eric W. Biederman
2007-02-01  4:24       ` Greg KH
2007-01-25  8:34 ` [RFC/PATCH 6/16] Add bare metal MSI enable & disable routines Michael Ellerman
2007-01-26  5:35   ` Eric W. Biederman
2007-01-25  8:34 ` [RFC/PATCH 7/16] Rip out the existing powerpc msi stubs Michael Ellerman
2007-01-25  8:34 ` [RFC/PATCH 9/16] RTAS MSI implementation Michael Ellerman
2007-01-25  8:34 ` [RFC/PATCH 8/16] Enable MSI on Powerpc Michael Ellerman
2007-01-25  8:34 ` [RFC/PATCH 10/16] Add a pci_irq_fixup for MSI via RTAS Michael Ellerman
2007-01-25  8:34 ` [RFC/PATCH 11/16] Activate MSI via RTAS on pseries Michael Ellerman
2007-01-25  8:34 ` [RFC/PATCH 12/16] Tell firmware we support MSI Michael Ellerman
2007-01-25  8:34 ` [RFC/PATCH 13/16] MPIC MSI allocator Michael Ellerman
2007-01-25  8:34 ` [RFC/PATCH 14/16] MPIC MSI backend Michael Ellerman
2007-01-26  6:43   ` Grant Grundler
2007-01-26  7:02     ` Eric W. Biederman
2007-01-26  8:47       ` Segher Boessenkool
2007-01-26 16:32         ` Eric W. Biederman
2007-01-26 17:19           ` Grant Grundler
2007-01-26 17:56             ` Eric W. Biederman
2007-01-26 22:48               ` Benjamin Herrenschmidt
2007-01-27  7:01               ` Michael Ellerman
2007-01-26 22:40             ` Benjamin Herrenschmidt
2007-01-27  2:11               ` David Miller
2007-01-26 22:08           ` Benjamin Herrenschmidt
2007-01-27  6:54             ` Michael Ellerman
2007-01-26 20:50       ` Benjamin Herrenschmidt
2007-01-26 22:46       ` Paul Mackerras
2007-01-27  2:46         ` Eric W. Biederman
2007-01-27  3:02           ` David Miller
2007-01-27  4:28             ` Eric W. Biederman
2007-01-27 18:30         ` Grant Grundler
2007-01-27 20:02           ` Benjamin Herrenschmidt
2007-01-26 20:41     ` Benjamin Herrenschmidt
2007-01-26  9:11   ` Segher Boessenkool
2007-01-27  6:33     ` Michael Ellerman
2007-01-25  8:34 ` [RFC/PATCH 15/16] Enable MSI mappings for MPIC Michael Ellerman
2007-01-25  8:34 ` [RFC/PATCH 16/16] Activate MSI for the MPIC backend on U3 Michael Ellerman
2007-01-25 21:53 ` [RFC/PATCH 0/16] Ops based MSI Implementation Greg KH
2007-01-25 21:55   ` David Miller
2007-01-26  1:05     ` Michael Ellerman
2007-01-26  1:03   ` Michael Ellerman
2007-01-26  6:18 ` Eric W. Biederman
2007-01-26  6:56   ` Grant Grundler
2007-01-26  7:15     ` Eric W. Biederman
2007-01-26  7:48       ` Grant Grundler
2007-01-26 15:26         ` Eric W. Biederman
2007-01-26 21:58         ` Benjamin Herrenschmidt
2007-01-26  8:57     ` Segher Boessenkool
2007-01-26 17:27       ` Grant Grundler
2007-01-26 20:57     ` Benjamin Herrenschmidt
2007-01-26 21:24   ` Benjamin Herrenschmidt
2007-01-27  5:41   ` Michael Ellerman
2007-01-28  6:16     ` Eric W. Biederman
2007-01-28  8:12       ` Michael Ellerman
2007-01-28  8:36         ` Eric W. Biederman
2007-01-28 20:14           ` Benjamin Herrenschmidt
2007-01-28 20:53             ` Eric W. Biederman
2007-01-28 21:17               ` Benjamin Herrenschmidt
2007-01-28 22:36                 ` Eric W. Biederman
2007-01-28 23:17                   ` Benjamin Herrenschmidt
2007-01-28 23:38                     ` Eric W. Biederman
2007-01-28 23:51                       ` David Miller
2007-01-29  0:58                         ` Benjamin Herrenschmidt
2007-01-29  1:13                           ` David Miller
2007-01-29  3:17                             ` Benjamin Herrenschmidt
2007-01-29  4:19                               ` David Miller
2007-01-29  4:44                                 ` Benjamin Herrenschmidt
2007-01-29  5:46                             ` Eric W. Biederman
2007-01-29  6:08                               ` Benjamin Herrenschmidt
2007-01-31  6:52                           ` David Miller
2007-01-31  7:40                             ` Eric W. Biederman
2007-02-01  0:55                               ` David Miller
2007-01-29  0:26                       ` Benjamin Herrenschmidt
2007-01-29  0:59                       ` Michael Ellerman
2007-01-28 23:31                   ` David Miller
2007-01-28 23:59                     ` Benjamin Herrenschmidt
2007-01-28 23:26               ` David Miller
2007-01-28 23:25             ` David Miller
2007-01-27  4:59 ` Michael Ellerman
2007-01-28 19:40 ` [PATCH 0/6] MSI portability cleanups Eric W. Biederman
2007-01-28 19:40   ` Eric W. Biederman
2007-01-28 19:42   ` [PATCH 1/6] msi: Kill msi_lookup_irq Eric W. Biederman
2007-01-28 19:42     ` Eric W. Biederman
2007-01-28 19:44     ` [PATCH 2/6] msi: Remove msi_lock Eric W. Biederman
2007-01-28 19:44       ` Eric W. Biederman
2007-01-28 19:45       ` [PATCH 3/6] msi: Fix msi_remove_pci_irq_vectors Eric W. Biederman
2007-01-28 19:45         ` Eric W. Biederman
2007-01-28 19:47         ` [PATCH 4/6] msi: Remove attach_msi_entry Eric W. Biederman
2007-01-28 19:47           ` Eric W. Biederman
2007-01-28 19:52           ` [PATCH 5/6] msi: Kill the msi_desc array Eric W. Biederman
2007-01-28 19:52             ` Eric W. Biederman
2007-01-28 19:56             ` [PATCH 6/6] msi: Make MSI useable more architectures Eric W. Biederman
2007-01-28 19:56               ` Eric W. Biederman
2007-02-01  6:08               ` patch msi-make-msi-useable-more-architectures.patch added to gregkh-2.6 tree gregkh
2007-02-01  6:07             ` patch msi-kill-the-msi_desc-array.patch " gregkh
2007-02-01  6:08           ` patch msi-remove-attach_msi_entry.patch " gregkh
2007-02-01  6:07         ` patch msi-fix-msi_remove_pci_irq_vectors.patch " gregkh
2007-02-01  6:08       ` patch msi-remove-msi_lock.patch " gregkh
2007-01-28 22:01     ` [PATCH 1/6] msi: Kill msi_lookup_irq Paul Mackerras
2007-01-28 22:01       ` Paul Mackerras
2007-01-28 22:18       ` Eric W. Biederman
2007-01-28 22:18         ` Eric W. Biederman
2007-02-01  6:07     ` patch msi-kill-msi_lookup_irq.patch added to gregkh-2.6 tree gregkh
2007-01-28 20:23   ` [PATCH 0/6] MSI portability cleanups Benjamin Herrenschmidt
2007-01-28 20:23     ` Benjamin Herrenschmidt
2007-01-28 20:47     ` Jeff Garzik
2007-01-28 20:47       ` Jeff Garzik
2007-01-28 21:20       ` Eric W. Biederman
2007-01-28 21:20         ` Eric W. Biederman
2007-01-28 21:26         ` Ingo Molnar
2007-01-28 21:26           ` Ingo Molnar
2007-01-28 22:09         ` Benjamin Herrenschmidt
2007-01-28 22:09           ` Benjamin Herrenschmidt
2007-01-28 23:26           ` Eric W. Biederman
2007-01-28 23:26             ` Eric W. Biederman
2007-01-28 23:37             ` David Miller
2007-01-28 23:37               ` David Miller
2007-01-29  5:18               ` Eric W. Biederman
2007-01-29  5:18                 ` Eric W. Biederman
2007-01-29  5:25                 ` David Miller
2007-01-29  5:25                   ` David Miller
2007-01-29  5:58                   ` Eric W. Biederman
2007-01-29  5:58                     ` Eric W. Biederman
2007-01-29  6:05                   ` Benjamin Herrenschmidt
2007-01-29  6:05                     ` Benjamin Herrenschmidt
2007-01-29  8:28                     ` Eric W. Biederman
2007-01-29  8:28                       ` Eric W. Biederman
2007-01-29  9:03                     ` Eric W. Biederman
2007-01-29  9:03                       ` Eric W. Biederman
2007-01-29 10:11                       ` Michael Ellerman
2007-01-29 10:11                         ` Michael Ellerman
2007-01-29 20:32                         ` Benjamin Herrenschmidt
2007-01-29 20:32                           ` Benjamin Herrenschmidt
2007-01-29 23:29                         ` Paul Mackerras
2007-01-29 23:29                           ` Paul Mackerras
2007-01-29 23:40                           ` Benjamin Herrenschmidt
2007-01-29 23:40                             ` Benjamin Herrenschmidt
2007-01-29 20:22                       ` Benjamin Herrenschmidt
2007-01-29 20:22                         ` Benjamin Herrenschmidt
2007-01-29 23:05                         ` Paul Mackerras
2007-01-29 23:05                           ` Paul Mackerras
2007-01-30 19:32                           ` Segher Boessenkool
2007-01-30 19:32                             ` Segher Boessenkool
2007-01-29  1:33             ` Benjamin Herrenschmidt
2007-01-29  1:33               ` Benjamin Herrenschmidt
2007-02-01  4:29           ` Greg KH
2007-02-01  4:29             ` Greg KH
2007-01-28 23:44         ` David Miller
2007-01-28 23:44           ` David Miller
2007-01-28 22:11       ` Eric W. Biederman
2007-01-28 22:11         ` Eric W. Biederman
2007-01-28 23:42       ` David Miller
2007-01-28 23:42         ` David Miller
2007-01-28 21:34     ` Eric W. Biederman
2007-01-28 21:34       ` Eric W. Biederman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.