linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/6] MSI portability cleanups
       [not found] <1169714047.65693.647693675533.qpush@cradle>
@ 2007-01-28 19:40 ` Eric W. Biederman
  2007-01-28 19:42   ` [PATCH 1/6] msi: Kill msi_lookup_irq Eric W. Biederman
  2007-01-28 20:23   ` [PATCH 0/6] MSI portability cleanups Benjamin Herrenschmidt
  0 siblings, 2 replies; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:40 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


This patchset is against gregkh-pci but except for the context around
msi_lookup_irq being completely different it applies cleanly to 2.6.20-rc6
as well.

When I first looked at this problem I thought no big deal it will one
or two simple patches and that is it.

When I looked more closely I discovered that to be certain of not introducing
bugs I would have to kill msi_lock, which made the problem a little more
difficult.

The result of this patchset is that architecture hooks become:

int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
void arch_teardown_msi_irq(unsigned int irq);

and are responsible for allocating and freeing the irq as well
as setting it up.

This touches the architecture code for i386, x86_64, and ia64 to
accomplish this.

Since I couldn't test ia64 I reviewed the code closely, and compile
tested it.

The other big change is that I added a field to irq_desc to point
at the msi_desc.  This removes the conflicts with the existing pointer
fields and makes the irq -> msi_desc mapping useable outside of msi.c

The only architecture problem that isn't solvable in this context is
the problem of supporting the crazy hypervisor on the ppc RTAS, which
asks us to drive the hardware but does not give us access to the
hardware registers.

Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH 1/6] msi: Kill msi_lookup_irq
  2007-01-28 19:40 ` [PATCH 0/6] MSI portability cleanups Eric W. Biederman
@ 2007-01-28 19:42   ` Eric W. Biederman
  2007-01-28 19:44     ` [PATCH 2/6] msi: Remove msi_lock Eric W. Biederman
  2007-01-28 22:01     ` [PATCH 1/6] msi: Kill msi_lookup_irq Paul Mackerras
  2007-01-28 20:23   ` [PATCH 0/6] MSI portability cleanups Benjamin Herrenschmidt
  1 sibling, 2 replies; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:42 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


The function msi_lookup_irq was horrible.  As a side effect of running
it changed dev->irq, and then the callers would need to change it
back.  In addition it does a global scan through all of the irqs,
which seems to be the sole justification of the msi_lock.

To remove the neede for msi_lookup_irq I added first_msi_irq to struct
pci_dev.  Then depending on the context I replaced msi_lookup_irq with
dev->first_msi_irq, dev->msi_enabled, or dev->msix_enabled.

msi_enabled and msix_enabled were already present in pci_dev for other
reasons.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/pci/msi.c   |  149 ++++++++++++++++++++-------------------------------
 include/linux/pci.h |    3 +
 2 files changed, 62 insertions(+), 90 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index bca5a8a..71080c9 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -283,28 +283,6 @@ void pci_disable_device_msi(struct pci_dev *dev)
 			PCI_CAP_ID_MSIX);
 }
 
-static int msi_lookup_irq(struct pci_dev *dev, int type)
-{
-	int irq;
-	unsigned long flags;
-
-	spin_lock_irqsave(&msi_lock, flags);
-	for (irq = 0; irq < NR_IRQS; irq++) {
-		if (!msi_desc[irq] || msi_desc[irq]->dev != dev ||
-			msi_desc[irq]->msi_attrib.type != type ||
-			msi_desc[irq]->msi_attrib.default_irq != dev->irq)
-			continue;
-		spin_unlock_irqrestore(&msi_lock, flags);
-		/* This pre-assigned MSI irq for this device
-		   already exists. Override dev->irq with this irq */
-		dev->irq = irq;
-		return 0;
-	}
-	spin_unlock_irqrestore(&msi_lock, flags);
-
-	return -EACCES;
-}
-
 #ifdef CONFIG_PM
 static int __pci_save_msi_state(struct pci_dev *dev)
 {
@@ -375,11 +353,13 @@ static void __pci_restore_msi_state(struct pci_dev *dev)
 static int __pci_save_msix_state(struct pci_dev *dev)
 {
 	int pos;
-	int temp;
 	int irq, head, tail = 0;
 	u16 control;
 	struct pci_cap_saved_state *save_state;
 
+	if (!dev->msix_enabled)
+		return 0;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
 	if (pos <= 0 || dev->no_msi)
 		return 0;
@@ -397,13 +377,7 @@ static int __pci_save_msix_state(struct pci_dev *dev)
 	*((u16 *)&save_state->data[0]) = control;
 
 	/* save the table */
-	temp = dev->irq;
-	if (msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
-		kfree(save_state);
-		return -EINVAL;
-	}
-
-	irq = head = dev->irq;
+	irq = head = dev->first_msi_irq;
 	while (head != tail) {
 		struct msi_desc *entry;
 
@@ -413,7 +387,6 @@ static int __pci_save_msix_state(struct pci_dev *dev)
 		tail = msi_desc[irq]->link.tail;
 		irq = tail;
 	}
-	dev->irq = temp;
 
 	save_state->cap_nr = PCI_CAP_ID_MSIX;
 	pci_add_saved_cap(dev, save_state);
@@ -439,9 +412,11 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
 	int pos;
 	int irq, head, tail = 0;
 	struct msi_desc *entry;
-	int temp;
 	struct pci_cap_saved_state *save_state;
 
+	if (!dev->msix_enabled)
+		return;
+
 	save_state = pci_find_saved_cap(dev, PCI_CAP_ID_MSIX);
 	if (!save_state)
 		return;
@@ -454,10 +429,7 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
 		return;
 
 	/* route the table */
-	temp = dev->irq;
-	if (msi_lookup_irq(dev, PCI_CAP_ID_MSIX))
-		return;
-	irq = head = dev->irq;
+	irq = head = dev->first_msi_irq;
 	while (head != tail) {
 		entry = msi_desc[irq];
 		write_msi_msg(irq, &entry->msg_save);
@@ -465,7 +437,6 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
 		tail = msi_desc[irq]->link.tail;
 		irq = tail;
 	}
-	dev->irq = temp;
 
 	pci_write_config_word(dev, msi_control_reg(pos), save);
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
@@ -535,6 +506,7 @@ static int msi_capability_init(struct pci_dev *dev)
 		return status;
 	}
 
+	dev->first_msi_irq = irq;
 	attach_msi_entry(entry, irq);
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
@@ -631,6 +603,7 @@ static int msix_capability_init(struct pci_dev *dev,
 			avail = -EBUSY;
 		return avail;
 	}
+	dev->first_msi_irq = entries[0].vector;
 	/* Set MSI-X enabled bits */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
 
@@ -678,13 +651,11 @@ int pci_msi_supported(struct pci_dev * dev)
  **/
 int pci_enable_msi(struct pci_dev* dev)
 {
-	int pos, temp, status;
+	int pos, status;
 
 	if (pci_msi_supported(dev) < 0)
 		return -EINVAL;
 
-	temp = dev->irq;
-
 	status = msi_init();
 	if (status < 0)
 		return status;
@@ -693,15 +664,14 @@ int pci_enable_msi(struct pci_dev* dev)
 	if (!pos)
 		return -EINVAL;
 
-	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSI));
+	WARN_ON(!!dev->msi_enabled);
 
 	/* Check whether driver already requested for MSI-X irqs */
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
-	if (pos > 0 && !msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
+	if (pos > 0 && dev->msix_enabled) {
 			printk(KERN_INFO "PCI: %s: Can't enable MSI.  "
-			       "Device already has MSI-X irq assigned\n",
+			       "Device already has MSI-X enabled\n",
 			       pci_name(dev));
-			dev->irq = temp;
 			return -EINVAL;
 	}
 	status = msi_capability_init(dev);
@@ -720,6 +690,9 @@ void pci_disable_msi(struct pci_dev* dev)
 	if (!dev)
 		return;
 
+	if (!dev->msi_enabled)
+		return;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
 	if (!pos)
 		return;
@@ -728,28 +701,30 @@ void pci_disable_msi(struct pci_dev* dev)
 	if (!(control & PCI_MSI_FLAGS_ENABLE))
 		return;
 
+
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
 	spin_lock_irqsave(&msi_lock, flags);
-	entry = msi_desc[dev->irq];
+	entry = msi_desc[dev->first_msi_irq];
 	if (!entry || !entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI) {
 		spin_unlock_irqrestore(&msi_lock, flags);
 		return;
 	}
-	if (irq_has_action(dev->irq)) {
+	if (irq_has_action(dev->first_msi_irq)) {
 		spin_unlock_irqrestore(&msi_lock, flags);
 		printk(KERN_WARNING "PCI: %s: pci_disable_msi() called without "
 		       "free_irq() on MSI irq %d\n",
-		       pci_name(dev), dev->irq);
-		BUG_ON(irq_has_action(dev->irq));
+		       pci_name(dev), dev->first_msi_irq);
+		BUG_ON(irq_has_action(dev->first_msi_irq));
 	} else {
 		default_irq = entry->msi_attrib.default_irq;
 		spin_unlock_irqrestore(&msi_lock, flags);
-		msi_free_irq(dev, dev->irq);
+		msi_free_irq(dev, dev->first_msi_irq);
 
 		/* Restore dev->irq to its default pin-assertion irq */
 		dev->irq = default_irq;
 	}
+	dev->first_msi_irq = 0;
 }
 
 static int msi_free_irq(struct pci_dev* dev, int irq)
@@ -808,7 +783,7 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 {
 	int status, pos, nr_entries;
-	int i, j, temp;
+	int i, j;
 	u16 control;
 
 	if (!entries || pci_msi_supported(dev) < 0)
@@ -836,16 +811,14 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 				return -EINVAL;	/* duplicate entry */
 		}
 	}
-	temp = dev->irq;
-	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSIX));
+	WARN_ON(!!dev->msix_enabled);
 
 	/* Check whether driver already requested for MSI irq */
    	if (pci_find_capability(dev, PCI_CAP_ID_MSI) > 0 &&
-		!msi_lookup_irq(dev, PCI_CAP_ID_MSI)) {
+		dev->msi_enabled) {
 		printk(KERN_INFO "PCI: %s: Can't enable MSI-X.  "
 		       "Device already has an MSI irq assigned\n",
 		       pci_name(dev));
-		dev->irq = temp;
 		return -EINVAL;
 	}
 	status = msix_capability_init(dev, entries, nvec);
@@ -854,7 +827,9 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 
 void pci_disable_msix(struct pci_dev* dev)
 {
-	int pos, temp;
+	int irq, head, tail = 0, warning = 0;
+	unsigned long flags;
+	int pos;
 	u16 control;
 
 	if (!pci_msi_enable)
@@ -862,6 +837,9 @@ void pci_disable_msix(struct pci_dev* dev)
 	if (!dev)
 		return;
 
+	if (!dev->msix_enabled)
+		return;
+
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
 	if (!pos)
 		return;
@@ -872,31 +850,25 @@ void pci_disable_msix(struct pci_dev* dev)
 
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSIX);
 
-	temp = dev->irq;
-	if (!msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
-		int irq, head, tail = 0, warning = 0;
-		unsigned long flags;
-
-		irq = head = dev->irq;
-		dev->irq = temp;			/* Restore pin IRQ */
-		while (head != tail) {
-			spin_lock_irqsave(&msi_lock, flags);
-			tail = msi_desc[irq]->link.tail;
-			spin_unlock_irqrestore(&msi_lock, flags);
-			if (irq_has_action(irq))
-				warning = 1;
-			else if (irq != head)	/* Release MSI-X irq */
-				msi_free_irq(dev, irq);
-			irq = tail;
-		}
-		msi_free_irq(dev, irq);
-		if (warning) {
-			printk(KERN_WARNING "PCI: %s: pci_disable_msix() called without "
-			       "free_irq() on all MSI-X irqs\n",
-			       pci_name(dev));
-			BUG_ON(warning > 0);
-		}
+	irq = head = dev->first_msi_irq;
+	while (head != tail) {
+		spin_lock_irqsave(&msi_lock, flags);
+		tail = msi_desc[irq]->link.tail;
+		spin_unlock_irqrestore(&msi_lock, flags);
+		if (irq_has_action(irq))
+			warning = 1;
+		else if (irq != head)	/* Release MSI-X irq */
+			msi_free_irq(dev, irq);
+		irq = tail;
+	}
+	msi_free_irq(dev, irq);
+	if (warning) {
+		printk(KERN_WARNING "PCI: %s: pci_disable_msix() called without "
+			"free_irq() on all MSI-X irqs\n",
+			pci_name(dev));
+		BUG_ON(warning > 0);
 	}
+	dev->first_msi_irq = 0;
 }
 
 /**
@@ -910,30 +882,28 @@ void pci_disable_msix(struct pci_dev* dev)
  **/
 void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 {
-	int pos, temp;
+	int pos;
 	unsigned long flags;
 
 	if (!pci_msi_enable || !dev)
  		return;
 
-	temp = dev->irq;		/* Save IOAPIC IRQ */
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
-	if (pos > 0 && !msi_lookup_irq(dev, PCI_CAP_ID_MSI)) {
-		if (irq_has_action(dev->irq)) {
+	if (pos > 0 && dev->msi_enabled) {
+		if (irq_has_action(dev->first_msi_irq)) {
 			printk(KERN_WARNING "PCI: %s: msi_remove_pci_irq_vectors() "
 			       "called without free_irq() on MSI irq %d\n",
-			       pci_name(dev), dev->irq);
-			BUG_ON(irq_has_action(dev->irq));
+			       pci_name(dev), dev->first_msi_irq);
+			BUG_ON(irq_has_action(dev->first_msi_irq));
 		} else /* Release MSI irq assigned to this device */
-			msi_free_irq(dev, dev->irq);
-		dev->irq = temp;		/* Restore IOAPIC IRQ */
+			msi_free_irq(dev, dev->first_msi_irq);
 	}
 	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
-	if (pos > 0 && !msi_lookup_irq(dev, PCI_CAP_ID_MSIX)) {
+	if (pos > 0 && dev->msix_enabled) {
 		int irq, head, tail = 0, warning = 0;
 		void __iomem *base = NULL;
 
-		irq = head = dev->irq;
+		irq = head = dev->first_msi_irq;
 		while (head != tail) {
 			spin_lock_irqsave(&msi_lock, flags);
 			tail = msi_desc[irq]->link.tail;
@@ -953,7 +923,6 @@ void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 			       pci_name(dev));
 			BUG_ON(warning > 0);
 		}
-		dev->irq = temp;		/* Restore IOAPIC IRQ */
 	}
 }
 
diff --git a/include/linux/pci.h b/include/linux/pci.h
index 29765e2..1507f8c 100644
--- a/include/linux/pci.h
+++ b/include/linux/pci.h
@@ -174,6 +174,9 @@ struct pci_dev {
 	struct bin_attribute *rom_attr; /* attribute descriptor for sysfs ROM entry */
 	int rom_attr_enabled;		/* has display of the rom attribute been enabled? */
 	struct bin_attribute *res_attr[DEVICE_COUNT_RESOURCE]; /* sysfs file for resources */
+#ifdef CONFIG_PCI_MSI
+	unsigned int first_msi_irq;
+#endif
 };
 
 #define pci_dev_g(n) list_entry(n, struct pci_dev, global_list)
-- 
1.4.4.1.g278f


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 2/6] msi: Remove msi_lock.
  2007-01-28 19:42   ` [PATCH 1/6] msi: Kill msi_lookup_irq Eric W. Biederman
@ 2007-01-28 19:44     ` Eric W. Biederman
  2007-01-28 19:45       ` [PATCH 3/6] msi: Fix msi_remove_pci_irq_vectors Eric W. Biederman
  2007-01-28 22:01     ` [PATCH 1/6] msi: Kill msi_lookup_irq Paul Mackerras
  1 sibling, 1 reply; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:44 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


With the removal of msi_lookup_irq all of the functions using msi_lock
operated on a single device and none of them could reasonably be
called on that device at the same time. 

Since what little synchronization that needs to happen needs to happen
outside of the msi functions, msi_lock could never be contended and as
such is useless and just complicates the code.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/pci/msi.c |   20 --------------------
 1 files changed, 0 insertions(+), 20 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 71080c9..5e7a187 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -24,7 +24,6 @@
 #include "pci.h"
 #include "msi.h"
 
-static DEFINE_SPINLOCK(msi_lock);
 static struct msi_desc* msi_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = NULL };
 static struct kmem_cache* msi_cachep;
 
@@ -196,11 +195,7 @@ static struct msi_desc* alloc_msi_entry(void)
 
 static void attach_msi_entry(struct msi_desc *entry, int irq)
 {
-	unsigned long flags;
-
-	spin_lock_irqsave(&msi_lock, flags);
 	msi_desc[irq] = entry;
-	spin_unlock_irqrestore(&msi_lock, flags);
 }
 
 static int create_msi_irq(void)
@@ -683,7 +678,6 @@ void pci_disable_msi(struct pci_dev* dev)
 	struct msi_desc *entry;
 	int pos, default_irq;
 	u16 control;
-	unsigned long flags;
 
 	if (!pci_msi_enable)
 		return;
@@ -704,21 +698,17 @@ void pci_disable_msi(struct pci_dev* dev)
 
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
-	spin_lock_irqsave(&msi_lock, flags);
 	entry = msi_desc[dev->first_msi_irq];
 	if (!entry || !entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI) {
-		spin_unlock_irqrestore(&msi_lock, flags);
 		return;
 	}
 	if (irq_has_action(dev->first_msi_irq)) {
-		spin_unlock_irqrestore(&msi_lock, flags);
 		printk(KERN_WARNING "PCI: %s: pci_disable_msi() called without "
 		       "free_irq() on MSI irq %d\n",
 		       pci_name(dev), dev->first_msi_irq);
 		BUG_ON(irq_has_action(dev->first_msi_irq));
 	} else {
 		default_irq = entry->msi_attrib.default_irq;
-		spin_unlock_irqrestore(&msi_lock, flags);
 		msi_free_irq(dev, dev->first_msi_irq);
 
 		/* Restore dev->irq to its default pin-assertion irq */
@@ -732,14 +722,11 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	struct msi_desc *entry;
 	int head, entry_nr, type;
 	void __iomem *base;
-	unsigned long flags;
 
 	arch_teardown_msi_irq(irq);
 
-	spin_lock_irqsave(&msi_lock, flags);
 	entry = msi_desc[irq];
 	if (!entry || entry->dev != dev) {
-		spin_unlock_irqrestore(&msi_lock, flags);
 		return -EINVAL;
 	}
 	type = entry->msi_attrib.type;
@@ -750,7 +737,6 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	msi_desc[entry->link.tail]->link.head = entry->link.head;
 	entry->dev = NULL;
 	msi_desc[irq] = NULL;
-	spin_unlock_irqrestore(&msi_lock, flags);
 
 	destroy_msi_irq(irq);
 
@@ -828,7 +814,6 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
 void pci_disable_msix(struct pci_dev* dev)
 {
 	int irq, head, tail = 0, warning = 0;
-	unsigned long flags;
 	int pos;
 	u16 control;
 
@@ -852,9 +837,7 @@ void pci_disable_msix(struct pci_dev* dev)
 
 	irq = head = dev->first_msi_irq;
 	while (head != tail) {
-		spin_lock_irqsave(&msi_lock, flags);
 		tail = msi_desc[irq]->link.tail;
-		spin_unlock_irqrestore(&msi_lock, flags);
 		if (irq_has_action(irq))
 			warning = 1;
 		else if (irq != head)	/* Release MSI-X irq */
@@ -883,7 +866,6 @@ void pci_disable_msix(struct pci_dev* dev)
 void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 {
 	int pos;
-	unsigned long flags;
 
 	if (!pci_msi_enable || !dev)
  		return;
@@ -905,10 +887,8 @@ void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 
 		irq = head = dev->first_msi_irq;
 		while (head != tail) {
-			spin_lock_irqsave(&msi_lock, flags);
 			tail = msi_desc[irq]->link.tail;
 			base = msi_desc[irq]->mask_base;
-			spin_unlock_irqrestore(&msi_lock, flags);
 			if (irq_has_action(irq))
 				warning = 1;
 			else if (irq != head) /* Release MSI-X irq */
-- 
1.4.4.1.g278f


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 3/6] msi: Fix msi_remove_pci_irq_vectors.
  2007-01-28 19:44     ` [PATCH 2/6] msi: Remove msi_lock Eric W. Biederman
@ 2007-01-28 19:45       ` Eric W. Biederman
  2007-01-28 19:47         ` [PATCH 4/6] msi: Remove attach_msi_entry Eric W. Biederman
  0 siblings, 1 reply; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:45 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


Since msi_remove_pci_irq_vectors is designed to be called during
hotplug remove it is actively wrong to query the hardware and expect
meaningful results back. 

To that end remove the pci_find_capability calls.  Testing
dev->msi_enabled and dev->msix_enabled gives us all of the information
we need.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/pci/msi.c |    8 ++------
 1 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 5e7a187..db9c1d7 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -865,13 +865,10 @@ void pci_disable_msix(struct pci_dev* dev)
  **/
 void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 {
-	int pos;
-
 	if (!pci_msi_enable || !dev)
  		return;
 
-	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
-	if (pos > 0 && dev->msi_enabled) {
+	if (dev->msi_enabled) {
 		if (irq_has_action(dev->first_msi_irq)) {
 			printk(KERN_WARNING "PCI: %s: msi_remove_pci_irq_vectors() "
 			       "called without free_irq() on MSI irq %d\n",
@@ -880,8 +877,7 @@ void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 		} else /* Release MSI irq assigned to this device */
 			msi_free_irq(dev, dev->first_msi_irq);
 	}
-	pos = pci_find_capability(dev, PCI_CAP_ID_MSIX);
-	if (pos > 0 && dev->msix_enabled) {
+	if (dev->msix_enabled) {
 		int irq, head, tail = 0, warning = 0;
 		void __iomem *base = NULL;
 
-- 
1.4.4.1.g278f


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 4/6] msi: Remove attach_msi_entry.
  2007-01-28 19:45       ` [PATCH 3/6] msi: Fix msi_remove_pci_irq_vectors Eric W. Biederman
@ 2007-01-28 19:47         ` Eric W. Biederman
  2007-01-28 19:52           ` [PATCH 5/6] msi: Kill the msi_desc array Eric W. Biederman
  0 siblings, 1 reply; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:47 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


The attach_msi_entry has been reduced to a single simple assignment,
so for simplicity remove the abstraction and directory perform the
assignment.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 drivers/pci/msi.c |    9 ++-------
 1 files changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index db9c1d7..b994012 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -193,11 +193,6 @@ static struct msi_desc* alloc_msi_entry(void)
 	return entry;
 }
 
-static void attach_msi_entry(struct msi_desc *entry, int irq)
-{
-	msi_desc[irq] = entry;
-}
-
 static int create_msi_irq(void)
 {
 	struct msi_desc *entry;
@@ -502,7 +497,7 @@ static int msi_capability_init(struct pci_dev *dev)
 	}
 
 	dev->first_msi_irq = irq;
-	attach_msi_entry(entry, irq);
+	msi_desc[irq] = entry;
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
@@ -581,7 +576,7 @@ static int msix_capability_init(struct pci_dev *dev,
 			break;
 		}
 
-		attach_msi_entry(entry, irq);
+		msi_desc[irq] = entry;
 	}
 	if (i != nvec) {
 		int avail = i - 1;
-- 
1.4.4.1.g278f


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 5/6] msi: Kill the msi_desc array.
  2007-01-28 19:47         ` [PATCH 4/6] msi: Remove attach_msi_entry Eric W. Biederman
@ 2007-01-28 19:52           ` Eric W. Biederman
  2007-01-28 19:56             ` [PATCH 6/6] msi: Make MSI useable more architectures Eric W. Biederman
  0 siblings, 1 reply; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:52 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


We need to be able to get from an irq number to a struct msi_desc.
The msi_desc array in msi.c had several short comings the big one was
that it could not be used outside of msi.c.  Using irq_data in struct
irq_desc almost worked except on some architectures irq_data needs to
be used for something else. 

So this patch adds a msi_desc pointer to irq_desc, adds the appropriate
wrappers and changes all of the msi code to use them.

The dynamic_irq_init/cleanup code was tweaked to ensure the new
field is left in a well defined state.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 arch/ia64/sn/kernel/msi_sn.c |    2 +-
 drivers/pci/msi.c            |   44 ++++++++++++++++++++---------------------
 include/linux/irq.h          |    4 +++
 kernel/irq/chip.c            |   28 ++++++++++++++++++++++++++
 4 files changed, 54 insertions(+), 24 deletions(-)

diff --git a/arch/ia64/sn/kernel/msi_sn.c b/arch/ia64/sn/kernel/msi_sn.c
index b3a435f..31fbb85 100644
--- a/arch/ia64/sn/kernel/msi_sn.c
+++ b/arch/ia64/sn/kernel/msi_sn.c
@@ -74,7 +74,7 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	struct pcibus_bussoft *bussoft = SN_PCIDEV_BUSSOFT(pdev);
 	struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
 
-	entry = get_irq_data(irq);
+	entry = get_irq_msi(irq);
 	if (!entry->msi_attrib.is_64)
 		return -EINVAL;
 
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index b994012..d7a2259 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -24,7 +24,6 @@
 #include "pci.h"
 #include "msi.h"
 
-static struct msi_desc* msi_desc[NR_IRQS] = { [0 ... NR_IRQS-1] = NULL };
 static struct kmem_cache* msi_cachep;
 
 static int pci_msi_enable = 1;
@@ -43,7 +42,7 @@ static void msi_set_mask_bit(unsigned int irq, int flag)
 {
 	struct msi_desc *entry;
 
-	entry = msi_desc[irq];
+	entry = get_irq_msi(irq);
 	BUG_ON(!entry || !entry->dev);
 	switch (entry->msi_attrib.type) {
 	case PCI_CAP_ID_MSI:
@@ -73,7 +72,7 @@ static void msi_set_mask_bit(unsigned int irq, int flag)
 
 void read_msi_msg(unsigned int irq, struct msi_msg *msg)
 {
-	struct msi_desc *entry = get_irq_data(irq);
+	struct msi_desc *entry = get_irq_msi(irq);
 	switch(entry->msi_attrib.type) {
 	case PCI_CAP_ID_MSI:
 	{
@@ -112,7 +111,7 @@ void read_msi_msg(unsigned int irq, struct msi_msg *msg)
 
 void write_msi_msg(unsigned int irq, struct msi_msg *msg)
 {
-	struct msi_desc *entry = get_irq_data(irq);
+	struct msi_desc *entry = get_irq_msi(irq);
 	switch (entry->msi_attrib.type) {
 	case PCI_CAP_ID_MSI:
 	{
@@ -208,7 +207,7 @@ static int create_msi_irq(void)
 		return -EBUSY;
 	}
 
-	set_irq_data(irq, entry);
+	set_irq_msi(irq, entry);
 
 	return irq;
 }
@@ -217,9 +216,9 @@ static void destroy_msi_irq(unsigned int irq)
 {
 	struct msi_desc *entry;
 
-	entry = get_irq_data(irq);
+	entry = get_irq_msi(irq);
 	set_irq_chip(irq, NULL);
-	set_irq_data(irq, NULL);
+	set_irq_msi(irq, NULL);
 	destroy_irq(irq);
 	kmem_cache_free(msi_cachep, entry);
 }
@@ -371,10 +370,10 @@ static int __pci_save_msix_state(struct pci_dev *dev)
 	while (head != tail) {
 		struct msi_desc *entry;
 
-		entry = msi_desc[irq];
+		entry = get_irq_msi(irq);
 		read_msi_msg(irq, &entry->msg_save);
 
-		tail = msi_desc[irq]->link.tail;
+		tail = entry->link.tail;
 		irq = tail;
 	}
 
@@ -421,10 +420,10 @@ static void __pci_restore_msix_state(struct pci_dev *dev)
 	/* route the table */
 	irq = head = dev->first_msi_irq;
 	while (head != tail) {
-		entry = msi_desc[irq];
+		entry = get_irq_msi(irq);
 		write_msi_msg(irq, &entry->msg_save);
 
-		tail = msi_desc[irq]->link.tail;
+		tail = entry->link.tail;
 		irq = tail;
 	}
 
@@ -462,7 +461,7 @@ static int msi_capability_init(struct pci_dev *dev)
 	if (irq < 0)
 		return irq;
 
-	entry = get_irq_data(irq);
+	entry = get_irq_msi(irq);
 	entry->link.head = irq;
 	entry->link.tail = irq;
 	entry->msi_attrib.type = PCI_CAP_ID_MSI;
@@ -497,7 +496,7 @@ static int msi_capability_init(struct pci_dev *dev)
 	}
 
 	dev->first_msi_irq = irq;
-	msi_desc[irq] = entry;
+	set_irq_msi(irq, entry);
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
@@ -546,7 +545,7 @@ static int msix_capability_init(struct pci_dev *dev,
 		if (irq < 0)
 			break;
 
-		entry = get_irq_data(irq);
+		entry = get_irq_msi(irq);
  		j = entries[i].entry;
  		entries[i].vector = irq;
 		entry->msi_attrib.type = PCI_CAP_ID_MSIX;
@@ -576,7 +575,7 @@ static int msix_capability_init(struct pci_dev *dev,
 			break;
 		}
 
-		msi_desc[irq] = entry;
+		set_irq_msi(irq, entry);
 	}
 	if (i != nvec) {
 		int avail = i - 1;
@@ -693,7 +692,7 @@ void pci_disable_msi(struct pci_dev* dev)
 
 	disable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
-	entry = msi_desc[dev->first_msi_irq];
+	entry = get_irq_msi(dev->first_msi_irq);
 	if (!entry || !entry->dev || entry->msi_attrib.type != PCI_CAP_ID_MSI) {
 		return;
 	}
@@ -720,7 +719,7 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 
 	arch_teardown_msi_irq(irq);
 
-	entry = msi_desc[irq];
+	entry = get_irq_msi(irq);
 	if (!entry || entry->dev != dev) {
 		return -EINVAL;
 	}
@@ -728,10 +727,9 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	entry_nr = entry->msi_attrib.entry_nr;
 	head = entry->link.head;
 	base = entry->mask_base;
-	msi_desc[entry->link.head]->link.tail = entry->link.tail;
-	msi_desc[entry->link.tail]->link.head = entry->link.head;
+	get_irq_msi(entry->link.head)->link.tail = entry->link.tail;
+	get_irq_msi(entry->link.tail)->link.head = entry->link.head;
 	entry->dev = NULL;
-	msi_desc[irq] = NULL;
 
 	destroy_msi_irq(irq);
 
@@ -832,7 +830,7 @@ void pci_disable_msix(struct pci_dev* dev)
 
 	irq = head = dev->first_msi_irq;
 	while (head != tail) {
-		tail = msi_desc[irq]->link.tail;
+		tail = get_irq_msi(irq)->link.tail;
 		if (irq_has_action(irq))
 			warning = 1;
 		else if (irq != head)	/* Release MSI-X irq */
@@ -878,8 +876,8 @@ void msi_remove_pci_irq_vectors(struct pci_dev* dev)
 
 		irq = head = dev->first_msi_irq;
 		while (head != tail) {
-			tail = msi_desc[irq]->link.tail;
-			base = msi_desc[irq]->mask_base;
+			tail = get_irq_msi(irq)->link.tail;
+			base = get_irq_msi(irq)->mask_base;
 			if (irq_has_action(irq))
 				warning = 1;
 			else if (irq != head) /* Release MSI-X irq */
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 52fc405..5504b67 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -68,6 +68,7 @@ typedef	void fastcall (*irq_flow_handler_t)(unsigned int irq,
 #define IRQ_MOVE_PENDING	0x40000000	/* need to re-target IRQ destination */
 
 struct proc_dir_entry;
+struct msi_desc;
 
 /**
  * struct irq_chip - hardware interrupt chip descriptor
@@ -148,6 +149,7 @@ struct irq_chip {
 struct irq_desc {
 	irq_flow_handler_t	handle_irq;
 	struct irq_chip		*chip;
+	struct msi_desc		*msi_desc;
 	void			*handler_data;
 	void			*chip_data;
 	struct irqaction	*action;	/* IRQ action list */
@@ -373,10 +375,12 @@ extern int set_irq_chip(unsigned int irq, struct irq_chip *chip);
 extern int set_irq_data(unsigned int irq, void *data);
 extern int set_irq_chip_data(unsigned int irq, void *data);
 extern int set_irq_type(unsigned int irq, unsigned int type);
+extern int set_irq_msi(unsigned int irq, struct msi_desc *entry);
 
 #define get_irq_chip(irq)	(irq_desc[irq].chip)
 #define get_irq_chip_data(irq)	(irq_desc[irq].chip_data)
 #define get_irq_data(irq)	(irq_desc[irq].handler_data)
+#define get_irq_msi(irq)	(irq_desc[irq].msi_desc)
 
 #endif /* CONFIG_GENERIC_HARDIRQS */
 
diff --git a/kernel/irq/chip.c b/kernel/irq/chip.c
index d27b258..475e8a7 100644
--- a/kernel/irq/chip.c
+++ b/kernel/irq/chip.c
@@ -39,6 +39,7 @@ void dynamic_irq_init(unsigned int irq)
 	desc->chip = &no_irq_chip;
 	desc->handle_irq = handle_bad_irq;
 	desc->depth = 1;
+	desc->msi_desc = NULL;
 	desc->handler_data = NULL;
 	desc->chip_data = NULL;
 	desc->action = NULL;
@@ -74,6 +75,9 @@ void dynamic_irq_cleanup(unsigned int irq)
 		WARN_ON(1);
 		return;
 	}
+	desc->msi_desc = NULL;
+	desc->handler_data = NULL;
+	desc->chip_data = NULL;
 	desc->handle_irq = handle_bad_irq;
 	desc->chip = &no_irq_chip;
 	spin_unlock_irqrestore(&desc->lock, flags);
@@ -162,6 +166,30 @@ int set_irq_data(unsigned int irq, void *data)
 EXPORT_SYMBOL(set_irq_data);
 
 /**
+ *	set_irq_data - set irq type data for an irq
+ *	@irq:	Interrupt number
+ *	@data:	Pointer to interrupt specific data
+ *
+ *	Set the hardware irq controller data for an irq
+ */
+int set_irq_msi(unsigned int irq, struct msi_desc *entry)
+{
+	struct irq_desc *desc;
+	unsigned long flags;
+
+	if (irq >= NR_IRQS) {
+		printk(KERN_ERR
+		       "Trying to install msi data for IRQ%d\n", irq);
+		return -EINVAL;
+	}
+	desc = irq_desc + irq;
+	spin_lock_irqsave(&desc->lock, flags);
+	desc->msi_desc = entry;
+	spin_unlock_irqrestore(&desc->lock, flags);
+	return 0;
+}
+
+/**
  *	set_irq_chip_data - set irq chip data for an irq
  *	@irq:	Interrupt number
  *	@data:	Pointer to chip specific data
-- 
1.4.4.1.g278f


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH 6/6] msi: Make MSI useable more architectures
  2007-01-28 19:52           ` [PATCH 5/6] msi: Kill the msi_desc array Eric W. Biederman
@ 2007-01-28 19:56             ` Eric W. Biederman
  0 siblings, 0 replies; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-28 19:56 UTC (permalink / raw)
  To: Greg Kroah-Hartman
  Cc: linux-pci, David S. Miller, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, Michael Ellerman, Grant Grundler,
	Tony Luck, linux-kernel, Ingo Molnar


The arch hooks arch_setup_msi_irq and arch_teardown_msi_irq are now
responsible for allocating and freeing the linux irq in addition to
setting up the the linux irq to work with the interrupt.

arch_setup_msi_irq now takes a pci_device and a msi_desc and returns
an irq.

With this change in place this code should be useable by all platforms
except those that won't let the OS touch the hardware like ppc RTAS.

Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
---
 arch/i386/kernel/io_apic.c   |   17 ++++++---
 arch/ia64/kernel/msi_ia64.c  |   19 ++++++----
 arch/ia64/sn/kernel/msi_sn.c |   20 +++++++---
 arch/x86_64/kernel/io_apic.c |   17 ++++++---
 drivers/pci/msi.c            |   80 +++++++++++------------------------------
 include/asm-ia64/machvec.h   |    3 +-
 include/linux/msi.h          |    2 +-
 7 files changed, 75 insertions(+), 83 deletions(-)

diff --git a/arch/i386/kernel/io_apic.c b/arch/i386/kernel/io_apic.c
index 2424cc9..9ba4f99 100644
--- a/arch/i386/kernel/io_apic.c
+++ b/arch/i386/kernel/io_apic.c
@@ -2600,25 +2600,32 @@ static struct irq_chip msi_chip = {
 	.retrigger	= ioapic_retrigger_irq,
 };
 
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *dev)
+int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc)
 {
 	struct msi_msg msg;
-	int ret;
+	int irq, ret;
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, desc);
 	ret = msi_compose_msg(dev, irq, &msg);
-	if (ret < 0)
+	if (ret < 0) {
+		destroy_irq(irq);
 		return ret;
+	}
 
 	write_msi_msg(irq, &msg);
 
 	set_irq_chip_and_handler_name(irq, &msi_chip, handle_edge_irq,
 				      "edge");
 
-	return 0;
+	return irq;
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
 {
-	return;
+	destroy_irq(irq);
 }
 
 #endif /* CONFIG_PCI_MSI */
diff --git a/arch/ia64/kernel/msi_ia64.c b/arch/ia64/kernel/msi_ia64.c
index 822e59a..0d05450 100644
--- a/arch/ia64/kernel/msi_ia64.c
+++ b/arch/ia64/kernel/msi_ia64.c
@@ -64,12 +64,17 @@ static void ia64_set_msi_irq_affinity(unsigned int irq, cpumask_t cpu_mask)
 }
 #endif /* CONFIG_SMP */
 
-int ia64_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
+int ia64_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *desc)
 {
 	struct msi_msg	msg;
 	unsigned long	dest_phys_id;
-	unsigned int	vector;
+	unsigned int	irq, vector;
 
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, desc);
 	dest_phys_id = cpu_physical_id(first_cpu(cpu_online_map));
 	vector = irq;
 
@@ -89,12 +94,12 @@ int ia64_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	write_msi_msg(irq, &msg);
 	set_irq_chip_and_handler(irq, &ia64_msi_chip, handle_edge_irq);
 
-	return 0;
+	return irq;
 }
 
 void ia64_teardown_msi_irq(unsigned int irq)
 {
-	return;		/* no-op */
+	destroy_irq(irq);
 }
 
 static void ia64_ack_msi_irq(unsigned int irq)
@@ -126,12 +131,12 @@ static struct irq_chip ia64_msi_chip = {
 };
 
 
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
+int arch_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *desc)
 {
 	if (platform_setup_msi_irq)
-		return platform_setup_msi_irq(irq, pdev);
+		return platform_setup_msi_irq(pdev, desc);
 
-	return ia64_setup_msi_irq(irq, pdev);
+	return ia64_setup_msi_irq(pdev, desc);
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
diff --git a/arch/ia64/sn/kernel/msi_sn.c b/arch/ia64/sn/kernel/msi_sn.c
index 31fbb85..ea3dc38 100644
--- a/arch/ia64/sn/kernel/msi_sn.c
+++ b/arch/ia64/sn/kernel/msi_sn.c
@@ -59,13 +59,12 @@ void sn_teardown_msi_irq(unsigned int irq)
 	sn_intr_free(nasid, widget, sn_irq_info);
 	sn_msi_info[irq].sn_irq_info = NULL;
 
-	return;
+	destroy_irq(irq);
 }
 
-int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
+int sn_setup_msi_irq(struct pci_dev *pdev, struct msi_desc *entry)
 {
 	struct msi_msg msg;
-	struct msi_desc *entry;
 	int widget;
 	int status;
 	nasid_t nasid;
@@ -73,8 +72,8 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	struct sn_irq_info *sn_irq_info;
 	struct pcibus_bussoft *bussoft = SN_PCIDEV_BUSSOFT(pdev);
 	struct sn_pcibus_provider *provider = SN_PCIDEV_BUSPROVIDER(pdev);
+	int irq;
 
-	entry = get_irq_msi(irq);
 	if (!entry->msi_attrib.is_64)
 		return -EINVAL;
 
@@ -84,6 +83,11 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	if (provider == NULL || provider->dma_map_consistent == NULL)
 		return -EINVAL;
 
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, entry);
 	/*
 	 * Set up the vector plumbing.  Let the prom (via sn_intr_alloc)
 	 * decide which cpu to direct this msi at by default.
@@ -95,12 +99,15 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 			SWIN_WIDGETNUM(bussoft->bs_base);
 
 	sn_irq_info = kzalloc(sizeof(struct sn_irq_info), GFP_KERNEL);
-	if (! sn_irq_info)
+	if (! sn_irq_info) {
+		destroy_irq(irq);
 		return -ENOMEM;
+	}
 
 	status = sn_intr_alloc(nasid, widget, sn_irq_info, irq, -1, -1);
 	if (status) {
 		kfree(sn_irq_info);
+		destroy_irq(irq);
 		return -ENOMEM;
 	}
 
@@ -121,6 +128,7 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	if (! bus_addr) {
 		sn_intr_free(nasid, widget, sn_irq_info);
 		kfree(sn_irq_info);
+		destroy_irq(irq);
 		return -ENOMEM;
 	}
 
@@ -139,7 +147,7 @@ int sn_setup_msi_irq(unsigned int irq, struct pci_dev *pdev)
 	write_msi_msg(irq, &msg);
 	set_irq_chip_and_handler(irq, &sn_msi_chip, handle_edge_irq);
 
-	return 0;
+	return irq;
 }
 
 #ifdef CONFIG_SMP
diff --git a/arch/x86_64/kernel/io_apic.c b/arch/x86_64/kernel/io_apic.c
index d7bad90..6be6730 100644
--- a/arch/x86_64/kernel/io_apic.c
+++ b/arch/x86_64/kernel/io_apic.c
@@ -1956,24 +1956,31 @@ static struct irq_chip msi_chip = {
 	.retrigger	= ioapic_retrigger_irq,
 };
 
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *dev)
+int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc)
 {
 	struct msi_msg msg;
-	int ret;
+	int irq, ret;
+	irq = create_irq();
+	if (irq < 0)
+		return irq;
+
+	set_irq_msi(irq, desc);
 	ret = msi_compose_msg(dev, irq, &msg);
-	if (ret < 0)
+	if (ret < 0) {
+		destroy_irq(irq);
 		return ret;
+	}
 
 	write_msi_msg(irq, &msg);
 
 	set_irq_chip_and_handler_name(irq, &msi_chip, handle_edge_irq, "edge");
 
-	return 0;
+	return irq;
 }
 
 void arch_teardown_msi_irq(unsigned int irq)
 {
-	return;
+	destroy_irq(irq);
 }
 
 #endif /* CONFIG_PCI_MSI */
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index d7a2259..c6a6d46 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -192,37 +192,6 @@ static struct msi_desc* alloc_msi_entry(void)
 	return entry;
 }
 
-static int create_msi_irq(void)
-{
-	struct msi_desc *entry;
-	int irq;
-
-	entry = alloc_msi_entry();
-	if (!entry)
-		return -ENOMEM;
-
-	irq = create_irq();
-	if (irq < 0) {
-		kmem_cache_free(msi_cachep, entry);
-		return -EBUSY;
-	}
-
-	set_irq_msi(irq, entry);
-
-	return irq;
-}
-
-static void destroy_msi_irq(unsigned int irq)
-{
-	struct msi_desc *entry;
-
-	entry = get_irq_msi(irq);
-	set_irq_chip(irq, NULL);
-	set_irq_msi(irq, NULL);
-	destroy_irq(irq);
-	kmem_cache_free(msi_cachep, entry);
-}
-
 static void enable_msi_mode(struct pci_dev *dev, int pos, int type)
 {
 	u16 control;
@@ -449,7 +418,6 @@ void pci_restore_msi_state(struct pci_dev *dev)
  **/
 static int msi_capability_init(struct pci_dev *dev)
 {
-	int status;
 	struct msi_desc *entry;
 	int pos, irq;
 	u16 control;
@@ -457,13 +425,10 @@ static int msi_capability_init(struct pci_dev *dev)
    	pos = pci_find_capability(dev, PCI_CAP_ID_MSI);
 	pci_read_config_word(dev, msi_control_reg(pos), &control);
 	/* MSI Entry Initialization */
-	irq = create_msi_irq();
-	if (irq < 0)
-		return irq;
+	entry = alloc_msi_entry();
+	if (!entry)
+		return -ENOMEM;
 
-	entry = get_irq_msi(irq);
-	entry->link.head = irq;
-	entry->link.tail = irq;
 	entry->msi_attrib.type = PCI_CAP_ID_MSI;
 	entry->msi_attrib.is_64 = is_64bit_address(control);
 	entry->msi_attrib.entry_nr = 0;
@@ -489,14 +454,16 @@ static int msi_capability_init(struct pci_dev *dev)
 			maskbits);
 	}
 	/* Configure MSI capability structure */
-	status = arch_setup_msi_irq(irq, dev);
-	if (status < 0) {
-		destroy_msi_irq(irq);
-		return status;
+	irq = arch_setup_msi_irq(dev, entry);
+	if (irq < 0) {
+		kmem_cache_free(msi_cachep, entry);
+		return irq;
 	}
-
+	entry->link.head = irq;
+	entry->link.tail = irq;
 	dev->first_msi_irq = irq;
 	set_irq_msi(irq, entry);
+
 	/* Set MSI enabled bits	 */
 	enable_msi_mode(dev, pos, PCI_CAP_ID_MSI);
 
@@ -518,7 +485,6 @@ static int msix_capability_init(struct pci_dev *dev,
 				struct msix_entry *entries, int nvec)
 {
 	struct msi_desc *head = NULL, *tail = NULL, *entry = NULL;
-	int status;
 	int irq, pos, i, j, nr_entries, temp = 0;
 	unsigned long phys_addr;
 	u32 table_offset;
@@ -541,13 +507,11 @@ static int msix_capability_init(struct pci_dev *dev,
 
 	/* MSI-X Table Initialization */
 	for (i = 0; i < nvec; i++) {
-		irq = create_msi_irq();
-		if (irq < 0)
+		entry = alloc_msi_entry();
+		if (!entry)
 			break;
 
-		entry = get_irq_msi(irq);
  		j = entries[i].entry;
- 		entries[i].vector = irq;
 		entry->msi_attrib.type = PCI_CAP_ID_MSIX;
 		entry->msi_attrib.is_64 = 1;
 		entry->msi_attrib.entry_nr = j;
@@ -556,6 +520,14 @@ static int msix_capability_init(struct pci_dev *dev,
 		entry->msi_attrib.pos = pos;
 		entry->dev = dev;
 		entry->mask_base = base;
+
+		/* Configure MSI-X capability structure */
+		irq = arch_setup_msi_irq(dev, entry);
+		if (irq < 0) {
+			kmem_cache_free(msi_cachep, entry);
+			break;
+		}
+ 		entries[i].vector = irq;
 		if (!head) {
 			entry->link.head = irq;
 			entry->link.tail = irq;
@@ -568,12 +540,6 @@ static int msix_capability_init(struct pci_dev *dev,
 		}
 		temp = irq;
 		tail = entry;
-		/* Configure MSI-X capability structure */
-		status = arch_setup_msi_irq(irq, dev);
-		if (status < 0) {
-			destroy_msi_irq(irq);
-			break;
-		}
 
 		set_irq_msi(irq, entry);
 	}
@@ -717,8 +683,6 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	int head, entry_nr, type;
 	void __iomem *base;
 
-	arch_teardown_msi_irq(irq);
-
 	entry = get_irq_msi(irq);
 	if (!entry || entry->dev != dev) {
 		return -EINVAL;
@@ -729,9 +693,9 @@ static int msi_free_irq(struct pci_dev* dev, int irq)
 	base = entry->mask_base;
 	get_irq_msi(entry->link.head)->link.tail = entry->link.tail;
 	get_irq_msi(entry->link.tail)->link.head = entry->link.head;
-	entry->dev = NULL;
 
-	destroy_msi_irq(irq);
+	arch_teardown_msi_irq(irq);
+	kmem_cache_free(msi_cachep, entry);
 
 	if (type == PCI_CAP_ID_MSIX) {
 		writel(1, base + entry_nr * PCI_MSIX_ENTRY_SIZE +
diff --git a/include/asm-ia64/machvec.h b/include/asm-ia64/machvec.h
index a3891eb..3c96ac1 100644
--- a/include/asm-ia64/machvec.h
+++ b/include/asm-ia64/machvec.h
@@ -21,6 +21,7 @@ struct mm_struct;
 struct pci_bus;
 struct task_struct;
 struct pci_dev;
+struct msi_desc;
 
 typedef void ia64_mv_setup_t (char **);
 typedef void ia64_mv_cpu_init_t (void);
@@ -79,7 +80,7 @@ typedef unsigned short ia64_mv_readw_relaxed_t (const volatile void __iomem *);
 typedef unsigned int ia64_mv_readl_relaxed_t (const volatile void __iomem *);
 typedef unsigned long ia64_mv_readq_relaxed_t (const volatile void __iomem *);
 
-typedef int ia64_mv_setup_msi_irq_t (unsigned int irq, struct pci_dev *pdev);
+typedef int ia64_mv_setup_msi_irq_t (struct pci_dev *pdev, struct msi_desc *);
 typedef void ia64_mv_teardown_msi_irq_t (unsigned int irq);
 
 static inline void
diff --git a/include/linux/msi.h b/include/linux/msi.h
index b99976b..74c8a2e 100644
--- a/include/linux/msi.h
+++ b/include/linux/msi.h
@@ -41,7 +41,7 @@ struct msi_desc {
 /*
  * The arch hook for setup up msi irqs
  */
-int arch_setup_msi_irq(unsigned int irq, struct pci_dev *dev);
+int arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *desc);
 void arch_teardown_msi_irq(unsigned int irq);
 
 
-- 
1.4.4.1.g278f


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 19:40 ` [PATCH 0/6] MSI portability cleanups Eric W. Biederman
  2007-01-28 19:42   ` [PATCH 1/6] msi: Kill msi_lookup_irq Eric W. Biederman
@ 2007-01-28 20:23   ` Benjamin Herrenschmidt
  2007-01-28 20:47     ` Jeff Garzik
  2007-01-28 21:34     ` Eric W. Biederman
  1 sibling, 2 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-28 20:23 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Tony Luck, Grant Grundler, Ingo Molnar,
	linux-kernel, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller


> The other big change is that I added a field to irq_desc to point
> at the msi_desc.  This removes the conflicts with the existing pointer
> fields and makes the irq -> msi_desc mapping useable outside of msi.c

I'm not even sure we would have needed that with Michael's mecanism in
fact. One other reason why I prefer it.

Basically, backends like MPIC etc... don't need it. The irq chip
operations are normal MPIC operations and don't need to know they are
done on an MSI nor what MSI etc... and thus we don't need it. Same with
RTAS.

On the other hand, x86 needs it, but then, x86 uses it's own MSI
specific irq_chip, in which case it can use irq_desc->chip_data as long
as it does it within the backend.

So I may have missed a case where a given backend might need both that
irq -> msi_desc mapping -and- use irq_desc->chip_data for other things,
but that's one thing I was hoping we could avoid with Michael's code.

> The only architecture problem that isn't solvable in this context is
> the problem of supporting the crazy hypervisor on the ppc RTAS, which
> asks us to drive the hardware but does not give us access to the
> hardware registers.

So you are saying that we should use your model while admitting that it
can't solve our problems...

I really don't understand why you seem so totally opposed to Michael's
approach which definitely looks to me like the sane thing to do. Note
that in the end, Michael's approach isn't -that- different from yours,
just a bit more abstracted.

Ben.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 20:23   ` [PATCH 0/6] MSI portability cleanups Benjamin Herrenschmidt
@ 2007-01-28 20:47     ` Jeff Garzik
  2007-01-28 21:20       ` Eric W. Biederman
                         ` (2 more replies)
  2007-01-28 21:34     ` Eric W. Biederman
  1 sibling, 3 replies; 35+ messages in thread
From: Jeff Garzik @ 2007-01-28 20:47 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Eric W. Biederman, Greg Kroah-Hartman, Tony Luck, Grant Grundler,
	Ingo Molnar, linux-kernel, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller

Benjamin Herrenschmidt wrote:
>> The only architecture problem that isn't solvable in this context is
>> the problem of supporting the crazy hypervisor on the ppc RTAS, which
>> asks us to drive the hardware but does not give us access to the
>> hardware registers.
> 
> So you are saying that we should use your model while admitting that it
> can't solve our problems...
> 
> I really don't understand why you seem so totally opposed to Michael's
> approach which definitely looks to me like the sane thing to do. Note
> that in the end, Michael's approach isn't -that- different from yours,
> just a bit more abstracted.


I think the high-level ops approach makes more sense.  It's more future 
proof, in addition to covering all existing implementations.

	Jeff



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 20:47     ` Jeff Garzik
@ 2007-01-28 21:20       ` Eric W. Biederman
  2007-01-28 21:26         ` Ingo Molnar
                           ` (2 more replies)
  2007-01-28 22:11       ` Eric W. Biederman
  2007-01-28 23:42       ` David Miller
  2 siblings, 3 replies; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-28 21:20 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Benjamin Herrenschmidt, Greg Kroah-Hartman, Tony Luck,
	Grant Grundler, Ingo Molnar, linux-kernel, Kyle McMartin,
	linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller

Jeff Garzik <jeff@garzik.org> writes:

> Benjamin Herrenschmidt wrote:
>>> The only architecture problem that isn't solvable in this context is
>>> the problem of supporting the crazy hypervisor on the ppc RTAS, which
>>> asks us to drive the hardware but does not give us access to the
>>> hardware registers.
>>
>> So you are saying that we should use your model while admitting that it
>> can't solve our problems...
>>
>> I really don't understand why you seem so totally opposed to Michael's
>> approach which definitely looks to me like the sane thing to do. Note
>> that in the end, Michael's approach isn't -that- different from yours,
>> just a bit more abstracted.
>
>
> I think the high-level ops approach makes more sense.  It's more future proof,
> in addition to covering all existing implementations.

I'm not arguing against an operations based approach.  I'm arguing for simple
obviously correct steps, and not throwing the baby out with the bath
water.

My patches should be a precursor to an operations based approach
because they are simple step from where we are now.

Every keeps telling me the operations approach is the right thing to
do and I see code that doesn't work, and can't work without extreme
difficulty on the architectures currently supported.  That makes me
irritated, and unfortunately much less accepting.

I see people pushing ridiculous interfaces like the RTAS hypervisor
interface at me, and saying we must support running firmware drivers
in the msi code.

I just ask for simple evolutionary change as I presented, so we don't
break things or loose requirements along the way.

Please argue with me on the details of what the ops based approach does
better, which specific problems does it solve. 

The proposed ops base approach mixes different kinds of operations
in the same structure:

We have the hardware operations:
+	/* enable - Enable the MSIs on the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs being requested.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * This routine enables the MSIs on the given PCI device.
+	 *
+	 * If the enable completes succesfully this routine must return 0.
+	 *
+	 * This callback is optional.
+	 */
+	int (*enable) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+
+	/* disable - disable the MSI for the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs to disable.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+         * This routine should perform the inverse of enable.
+	 */
+	void (*disable) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+

Which are either talking directly to the hardware, or are talking
to the hypervisor, which is using hardware isolation so it is safe to
talk directly to the hardware but isn't leting us?  If we could use
things to work around errata in card implementation details it would
make some sense to me (although we don't seem to have any cards with
that got the MSI registers wrong at this point).  Regardless these
operations clearly have a different granularity than the other
operations, and should have a different lookup method.


We have the irq operations.
+	/* check - Check that the requested MSI allocation is OK.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs being requested.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * This routine is responsible for checking that the given PCI device
+	 * can be allocated the requested type and number of MSIs.
+	 *
+	 * It is up to this routine to determine if the requested number of
+	 * MSIs is valid for the device in question. If the number of MSIs,
+	 * or the particular MSI entries, can not be supported for any
+	 * reason this routine must return non-zero.
+	 *
+	 * If the check is succesful this routine must return 0.
+	 */
+	int (*check) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+
+	/* alloc - Allocate MSIs for the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs being requested.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * This routine is responsible for allocating the number of
+	 * MSIs to the given PCI device.
+	 *
+	 * Upon completion there must be @num MSIs assigned to this device,
+	 * the "vector" member of each struct msix_entry must be filled in
+	 * with the Linux irq number allocated to it. The corresponding
+	 * irq_descs must also be setup with an appropriate handler if
+	 * required.
+	 *
+	 * If the allocation completes succesfully this routine must return 0.
+	 */
+	int (*alloc) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);
+
+	/* free - free the MSIs assigned to the device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @num:	The number of MSIs.
+	 * @entries:	An array of @num msix_entry structures.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * Free all MSIs and associated resources for the device. If any
+	 * MSIs have been enabled they will have been disabled already by
+	 * the generic code.
+	 */
+	void (*free) (struct pci_dev *pdev, int num,
+				struct msix_entry *entries, int type);

These because they are per irq make sense as per bus operations unless
you have a good architecture definition like x86 has.  Roughly those
operations are what we currently have except the current operations
are a little simpler and easier to deal with for the architecture
code.

And then there are the operations that are going in the wrong
direction.
+	/* setup_msi_msg - Setup an MSI message for the given device.
+	 *
+	 * @pdev:	PCI device structure.
+	 * @entry:	The MSI entry to create a msi_msg for.
+	 * @msg:	Written with the magic address and data.
+	 * @type:	The type, MSI or MSI-X.
+	 *
+	 * Returns the "magic address and data" used to trigger the msi.
+	 * If the setup is succesful this routine must return 0.
+	 *
+	 * This callback is optional.
+	 */
+	int (*setup_msi_msg) (struct pci_dev *pdev, struct msix_entry *entry,
+				struct msi_msg *msg, int type);

Much to much of the operations base approach as proposed looks like
when you have a hammer every problem looks like a nail, given how much
confusion about what was put into the operations structure.

I don't mind taking a small step and making the alloc/free primitives
per bus in a generic fashion.

I don't mind supporting poorly designed hypervisor interfaces, if it
is easy.

I do strongly mind code that doesn't work, or we can't git-bisect
through to find where bugs were introduced.

Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 21:20       ` Eric W. Biederman
@ 2007-01-28 21:26         ` Ingo Molnar
  2007-01-28 22:09         ` Benjamin Herrenschmidt
  2007-01-28 23:44         ` David Miller
  2 siblings, 0 replies; 35+ messages in thread
From: Ingo Molnar @ 2007-01-28 21:26 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeff Garzik, Benjamin Herrenschmidt, Greg Kroah-Hartman,
	Tony Luck, Grant Grundler, linux-kernel, Kyle McMartin,
	linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller


* Eric W. Biederman <ebiederm@xmission.com> wrote:

> I'm not arguing against an operations based approach.  I'm arguing for 
> simple obviously correct steps, and not throwing the baby out with the 
> bath water.
> 
> My patches should be a precursor to an operations based approach
> because they are simple step from where we are now.

yeah. I'd say your approach is to go from A to B:

  [A] -----------------------------------------------------> [B]
                                                              |
                                                             [C]

while there might be some other arguments that "no, lets go to C 
instead", i say lets not throw away the already implemented and already 
working and nicely layered [A]->[B] transition, just because there's an 
argument whether the end result should be 'B' or 'C'. Unless someone who 
wants to see 'C' produces a patchset that walks the whole way i dont see 
any reason to not go with your patchset. It clearly removes alot of 
cruft.

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 20:23   ` [PATCH 0/6] MSI portability cleanups Benjamin Herrenschmidt
  2007-01-28 20:47     ` Jeff Garzik
@ 2007-01-28 21:34     ` Eric W. Biederman
  1 sibling, 0 replies; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-28 21:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Greg Kroah-Hartman, Tony Luck, Grant Grundler, Ingo Molnar,
	linux-kernel, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

>> The other big change is that I added a field to irq_desc to point
>> at the msi_desc.  This removes the conflicts with the existing pointer
>> fields and makes the irq -> msi_desc mapping useable outside of msi.c
>
> I'm not even sure we would have needed that with Michael's mecanism in
> fact. One other reason why I prefer it.
>
> Basically, backends like MPIC etc... don't need it. The irq chip
> operations are normal MPIC operations and don't need to know they are
> done on an MSI nor what MSI etc... and thus we don't need it. Same with
> RTAS.

If you get rid of the bass ackwards setup_msi_msg operation they do,
so you can support at least one write_msi_msg call.

> On the other hand, x86 needs it, but then, x86 uses it's own MSI
> specific irq_chip, in which case it can use irq_desc->chip_data as long
> as it does it within the backend.

Most of the uses are within msi.c as the code is currently structured
which means you can't use it that way.

> So I may have missed a case where a given backend might need both that
> irq -> msi_desc mapping -and- use irq_desc->chip_data for other things,
> but that's one thing I was hoping we could avoid with Michael's code.

That is where we are today.  Find a way to remove the code that uses it
and it can go away.

>> The only architecture problem that isn't solvable in this context is
>> the problem of supporting the crazy hypervisor on the ppc RTAS, which
>> asks us to drive the hardware but does not give us access to the
>> hardware registers.
>
> So you are saying that we should use your model while admitting that it
> can't solve our problems...

My approach can solve your problems with a few tweaks just like Michaels
approach would have needed to solve mine.

> I really don't understand why you seem so totally opposed to Michael's
> approach which definitely looks to me like the sane thing to do. Note
> that in the end, Michael's approach isn't -that- different from yours,
> just a bit more abstracted.

1) Because every one tells me it is the greatest thing since sliced bread,
   and when I look it simply doesn't work, and my feeling would be it would
   be a complete retesting effort of all currently supported architectures
   to make Michaels code work.

2) Because it was scrap and replace, which is a horrible way to deal with
   a problem when we have 3 architectures working already.

Honestly I think Michael and I can get along but all of the cheer leaders seem
to be exacerbating the situation.

I do agree Michael's approach isn't that different than mine and I think we
can converge on a single implementation.  To a large extent that is what
my patchset is about.  Moving the current code far enough it is usable,
and a reasonable basis for more work.

I don't write the current code but since I touched it and started cleaning
it up I seem to be stuck with it.  So I will be happy to take care of it
until we get a version that all architectures can use.

Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] msi: Kill msi_lookup_irq
  2007-01-28 19:42   ` [PATCH 1/6] msi: Kill msi_lookup_irq Eric W. Biederman
  2007-01-28 19:44     ` [PATCH 2/6] msi: Remove msi_lock Eric W. Biederman
@ 2007-01-28 22:01     ` Paul Mackerras
  2007-01-28 22:18       ` Eric W. Biederman
  1 sibling, 1 reply; 35+ messages in thread
From: Paul Mackerras @ 2007-01-28 22:01 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Tony Luck, Grant Grundler, Ingo Molnar,
	linux-kernel, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Eric W. Biederman writes:

> @@ -693,15 +664,14 @@ int pci_enable_msi(struct pci_dev* dev)
>  	if (!pos)
>  		return -EINVAL;
>  
> -	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSI));
> +	WARN_ON(!!dev->msi_enabled);

Minor nit: what's wrong with just WARN_ON(dev->msi_enabled) ?
Also here:

> @@ -836,16 +811,14 @@ int pci_enable_msix(struct pci_dev* dev, struct msix_entry *entries, int nvec)
>  				return -EINVAL;	/* duplicate entry */
>  		}
>  	}
> -	temp = dev->irq;
> -	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSIX));
> +	WARN_ON(!!dev->msix_enabled);

Paul.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 21:20       ` Eric W. Biederman
  2007-01-28 21:26         ` Ingo Molnar
@ 2007-01-28 22:09         ` Benjamin Herrenschmidt
  2007-01-28 23:26           ` Eric W. Biederman
  2007-02-01  4:29           ` Greg KH
  2007-01-28 23:44         ` David Miller
  2 siblings, 2 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-28 22:09 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeff Garzik, Greg Kroah-Hartman, Tony Luck, Grant Grundler,
	Ingo Molnar, linux-kernel, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller


 .../... (enable/disable bits)

> Which are either talking directly to the hardware, or are talking
> to the hypervisor, which is using hardware isolation so it is safe to
> talk directly to the hardware but isn't leting us?  If we could use
> things to work around errata in card implementation details it would
> make some sense to me (although we don't seem to have any cards with
> that got the MSI registers wrong at this point).  Regardless these
> operations clearly have a different granularity than the other
> operations, and should have a different lookup method.

I'm not sure I undersdand the point of your rant here. The hypervisor
case hooks at alloc/free and does everything from there. It doens't use
an enable or a diable hook.

The enable/disable ops are optional for that reason. When not present,
it's assumed that alloc/free do it all.

When using a "direct" approach (what we call "raw"), we expect backends
to just plug the provided helper functions in enable/disable. It's still
a hook so that one can do additional platform specific bits if
necessary, but in that specific case, I do agree we could just remove it
and move the "raw" code back into the toplevel functions, with a way
(via a special return code from alloc maybe ?) for the HV case to tell
us not to go through there. That was one of our initial approaches when
working with Michael on the design.

However, that sort of hurts my sense of aestetics :-) I quite like the
toplevel to be just a toplevel, and clearly separate the raw "helpers"
and the backend. Provides more flexibility to handle all possible crazy
cases in the future.

You seem to absolutely want to get the HV case to go throuh the same
code path as the "raw" case, and that will not happen.

  .../... (irq operations)

> These because they are per irq make sense as per bus operations unless
> you have a good architecture definition like x86 has.  Roughly those
> operations are what we currently have except the current operations
> are a little simpler and easier to deal with for the architecture
> code.

Oh ? How so ? (easier/simpler ?)

> And then there are the operations that are going in the wrong
> direction.
> +	/* setup_msi_msg - Setup an MSI message for the given device.
> +	 *
> +	 * @pdev:	PCI device structure.
> +	 * @entry:	The MSI entry to create a msi_msg for.
> +	 * @msg:	Written with the magic address and data.
> +	 * @type:	The type, MSI or MSI-X.
> +	 *
> +	 * Returns the "magic address and data" used to trigger the msi.
> +	 * If the setup is succesful this routine must return 0.
> +	 *
> +	 * This callback is optional.
> +	 */
> +	int (*setup_msi_msg) (struct pci_dev *pdev, struct msix_entry *entry,
> +				struct msi_msg *msg, int type);
> 
> Much to much of the operations base approach as proposed looks like
> when you have a hammer every problem looks like a nail, given how much
> confusion about what was put into the operations structure.

This is indeed a lower level hook to be used by "raw" enable/disable. An
other approach would be to remove it, have each backend have it's own
enable/disable that obtains the address/data and calls into a helper to
program them. This would indeed be a little bit nicer in a layering
perspective. But it adds a bit more code to each backend, so we kept
things closer to the way they used to be. I don't have a firm reason not
to change it however, I need talk to Michael in case he has more good
reasons to keep it that way around. 

> I don't mind taking a small step and making the alloc/free primitives
> per bus in a generic fashion. 
>
> I don't mind supporting poorly designed hypervisor interfaces, if it
> is easy.

And it it's not, we don't support them ? Ugh ? Well, it happens to be
fairly easy but still, I don't understand your approach there.

> I do strongly mind code that doesn't work, or we can't git-bisect
> through to find where bugs were introduced.

It doesn't work yet for you which is why it's not -replacing- your
current code. Again, this was intended as arch code in the first place,
until other archs and maintainers voiced their opinion that we should
move that to generic code. It may not be perfect, we may still want to
change things, maybe make some things closer to the direction you are
taking for the x86 code, but I don't understand the root of such a
strong opposition except mayeb that you've spent time trying to fix the
x86 junk and now are annoyed to see some of that work possibly
replaced ?

I agree with the problem if small changes & bisecting in the general
case. In fact, it would be nice if we could use your fixed code with
little change to "plug" it in as the x86 backend in many ways. Michael's
work isn't a re-implementation of everything, it's a re-structuring,
lots of bits of code that are missing can possibly be lifted from the
existing working implementation.

If we followed that "only do incrementental changes" rule all the time,
imagine in what state would be our USB stack today since we couldn't
have dropped in Linus replacement one ...

Ben.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 20:47     ` Jeff Garzik
  2007-01-28 21:20       ` Eric W. Biederman
@ 2007-01-28 22:11       ` Eric W. Biederman
  2007-01-28 23:42       ` David Miller
  2 siblings, 0 replies; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-28 22:11 UTC (permalink / raw)
  To: Jeff Garzik
  Cc: Benjamin Herrenschmidt, Greg Kroah-Hartman, Tony Luck,
	Grant Grundler, Ingo Molnar, linux-kernel, Kyle McMartin,
	linuxppc-dev, Brice Goglin, shaohua.li, linux-pci,
	David S. Miller

Jeff Garzik <jeff@garzik.org> writes:

> I think the high-level ops approach makes more sense.  It's more future proof,
> in addition to covering all existing implementations.

To be precise in Michaels implementation one of the parameters passed is
a type parameter so that the architecture has to know about each different
type of msi implementation.   In my implementation that field does not exist,
because it is unnecessary.  So as long as the message on the bus is a msi
message my implementation can be adapted to support it without any architecture
changes.

Being future proof is about getting the abstraction correct, and exposing
those details that matter, and removing those detail that don't.

It is a minor nit, not a fundamental flaw in the operations concept.  But
one of the reasons I am opposed to throwing out the current working code.
Evolutionary change ensures that things only the code remembers don't get
left behind.

I guess that is the other part of the discussion that shows up here
is, as long as the change is an evolutionary change from what is
working today.  I don't have any fundamental problems with it, but I
am completely against a revolutionary change.

Meanwhile because Michael has proposed operations my position has been
perceived as against operations.  While I have a lot of technical nits
to pick with the Michaels operations approach, I'm not fundamentally
against it.  I just don't want to loose the information that only
the code remembers.

Most of my technical objections have been formed by looking at what
the code does today, looking at what Michaels code is doing and seeing
details he missed.  If we just start with the current code base and
fix it the whole approach is much easier.

Anyway last I heard Michael was working on starting with the current
msi.c and making his patch set work, and I am hoping that my work
will make that patchset cleaner, and easier to do.  Even if we do
conflict at the moment :)

Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 1/6] msi: Kill msi_lookup_irq
  2007-01-28 22:01     ` [PATCH 1/6] msi: Kill msi_lookup_irq Paul Mackerras
@ 2007-01-28 22:18       ` Eric W. Biederman
  0 siblings, 0 replies; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-28 22:18 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Greg Kroah-Hartman, Tony Luck, Grant Grundler, Ingo Molnar,
	linux-kernel, Kyle McMartin, linuxppc-dev, Brice Goglin,
	shaohua.li, linux-pci, David S. Miller

Paul Mackerras <paulus@samba.org> writes:

> Eric W. Biederman writes:
>
>> @@ -693,15 +664,14 @@ int pci_enable_msi(struct pci_dev* dev)
>>  	if (!pos)
>>  		return -EINVAL;
>>  
>> -	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSI));
>> +	WARN_ON(!!dev->msi_enabled);
>
> Minor nit: what's wrong with just WARN_ON(dev->msi_enabled) ?

It's a bitfield so gcc complains when something in WARN_ON calls
typeof on it.  So it is easier to just say !! than to dig into
WARN_ON and see if it made any sense to fix WARN_ON, or to see if gcc
needed the bug fix.

> Also here:
>
>> @@ -836,16 +811,14 @@ int pci_enable_msix(struct pci_dev* dev, struct
> msix_entry *entries, int nvec)
>>  				return -EINVAL;	/* duplicate entry */
>>  		}
>>  	}
>> -	temp = dev->irq;
>> -	WARN_ON(!msi_lookup_irq(dev, PCI_CAP_ID_MSIX));
>> +	WARN_ON(!!dev->msix_enabled);
>
> Paul.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 22:09         ` Benjamin Herrenschmidt
@ 2007-01-28 23:26           ` Eric W. Biederman
  2007-01-28 23:37             ` David Miller
  2007-01-29  1:33             ` Benjamin Herrenschmidt
  2007-02-01  4:29           ` Greg KH
  1 sibling, 2 replies; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-28 23:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jeff Garzik, Greg Kroah-Hartman, Tony Luck, Grant Grundler,
	Ingo Molnar, linux-kernel, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

>  .../... (enable/disable bits)
>
>> Which are either talking directly to the hardware, or are talking
>> to the hypervisor, which is using hardware isolation so it is safe to
>> talk directly to the hardware but isn't leting us?  If we could use
>> things to work around errata in card implementation details it would
>> make some sense to me (although we don't seem to have any cards with
>> that got the MSI registers wrong at this point).  Regardless these
>> operations clearly have a different granularity than the other
>> operations, and should have a different lookup method.
>
> I'm not sure I undersdand the point of your rant here. The hypervisor
> case hooks at alloc/free and does everything from there. It doens't use
> an enable or a diable hook.
>
> The enable/disable ops are optional for that reason. When not present,
> it's assumed that alloc/free do it all.

Well my feeling is that in your weird HV case enable/disable should do
all of the work.  And alloc/free won't have to do anything because the
bus doesn't matter any more.

> When using a "direct" approach (what we call "raw"), we expect backends
> to just plug the provided helper functions in enable/disable. It's still
> a hook so that one can do additional platform specific bits if
> necessary, but in that specific case, I do agree we could just remove it
> and move the "raw" code back into the toplevel functions, with a way
> (via a special return code from alloc maybe ?) for the HV case to tell
> us not to go through there. That was one of our initial approaches when
> working with Michael on the design.
>
> However, that sort of hurts my sense of aestetics :-) I quite like the
> toplevel to be just a toplevel, and clearly separate the raw "helpers"
> and the backend. Provides more flexibility to handle all possible crazy
> cases in the future.

To be clear I see this as 2 distinct layers of code. enable/disable
that talks directly to the hardware, and the helpers of enable/disable
that allocate the irq.  I base this on the fact that I only need the
alloc/free when I am exclusively working with real hardware.

> You seem to absolutely want to get the HV case to go throuh the same
> code path as the "raw" case, and that will not happen.

Yes I do.  Because that is the only sane approach for a HV to use.
And yes we need an irq allocator to call the HV to setup the upstream
reception of the msi message.

However I don't think it will be to hard to support your HV once we get
the real hardware supported.  I just refuse to consider it before we have
figured out what makes sense in the context where we have to do everything.


>   .../... (irq operations)
>
>> These because they are per irq make sense as per bus operations unless
>> you have a good architecture definition like x86 has.  Roughly those
>> operations are what we currently have except the current operations
>> are a little simpler and easier to deal with for the architecture
>> code.
>
> Oh ? How so ? (easier/simpler ?)

I don't take a type parameter, and I don't take a vector.  All of
that work is done in the generic code.

>> And then there are the operations that are going in the wrong
>> direction.
>> +	/* setup_msi_msg - Setup an MSI message for the given device.
>> +	 *
>> +	 * @pdev:	PCI device structure.
>> +	 * @entry:	The MSI entry to create a msi_msg for.
>> +	 * @msg:	Written with the magic address and data.
>> +	 * @type:	The type, MSI or MSI-X.
>> +	 *
>> +	 * Returns the "magic address and data" used to trigger the msi.
>> +	 * If the setup is succesful this routine must return 0.
>> +	 *
>> +	 * This callback is optional.
>> +	 */
>> +	int (*setup_msi_msg) (struct pci_dev *pdev, struct msix_entry *entry,
>> +				struct msi_msg *msg, int type);
>> 
>> Much to much of the operations base approach as proposed looks like
>> when you have a hammer every problem looks like a nail, given how much
>> confusion about what was put into the operations structure.
>
> This is indeed a lower level hook to be used by "raw" enable/disable. An
> other approach would be to remove it, have each backend have it's own
> enable/disable that obtains the address/data and calls into a helper to
> program them. This would indeed be a little bit nicer in a layering
> perspective. But it adds a bit more code to each backend, so we kept
> things closer to the way they used to be. I don't have a firm reason not
> to change it however, I need talk to Michael in case he has more good
> reasons to keep it that way around. 

The current code in the kernel already is structured that way because
we have to reprogram the msi message on each irq migration.  Not using
a helper to write the message would be a noticeable change and require
a fair amount of code rewriting on the currently supported
architectures.

>> I don't mind taking a small step and making the alloc/free primitives
>> per bus in a generic fashion. 
>>
>> I don't mind supporting poorly designed hypervisor interfaces, if it
>> is easy.
>
> And it it's not, we don't support them ? Ugh ? Well, it happens to be
> fairly easy but still, I don't understand your approach there.

Yes.  In general the mainline linux kernel does not support certain
classes of stupidity.  TCP offload engines, firmware drivers for
hardware we care about, a fixed ABI to binary only modules, etc.
It is the responsibility of the OS to setup MSI so we do it, not
the firmware so we do it.

Not supporting stupid things that are hard to support encourages other
people not to be so silly, especially when linux still works on the
hardware when that silly feature isn't supported.

For similar reasons we don't support more than 1 irq with a plain MSI
capability.  It is hard, we can't do it on most hardware, and anyone
who wants more than 1 irq should just implement MSI-X and everyone
will be able to use it, on any hardware.

Part of the reason to not support a messed up HV interface if it hard
is that a HV is just software.  Which means the incremental cost to
fix it is roughly the same as fixing the linux kernel, and it puts
the burden on the people doing stupid things not on the rest of us
forever more.

>> I do strongly mind code that doesn't work, or we can't git-bisect
>> through to find where bugs were introduced.
>
> It doesn't work yet for you which is why it's not -replacing- your
> current code. Again, this was intended as arch code in the first place,
> until other archs and maintainers voiced their opinion that we should
> move that to generic code. It may not be perfect, we may still want to
> change things, maybe make some things closer to the direction you are
> taking for the x86 code, but I don't understand the root of such a
> strong opposition except mayeb that you've spent time trying to fix the
> x86 junk and now are annoyed to see some of that work possibly
> replaced ?

No.  I have spent time fixing what is there, and made it work.  I see
implementations proposed that don't handle cases I have fixed, and I
don't see anything that resembles a simple migration path for i386,
x86_64 and ia64.  Which is part of what annoys me when I am told
the ops work for everything.

As for the code not working important parts of the code (like MSI-X)
don't even work on ppc.  The strength of my opposition is largely
shaped by the number of people wearing rose colored glasses and
ignoring the problems, and missing huge details.

Given that we have been talking about things since before OLS I would
have expected the ppc code to be a little farther along.

> I agree with the problem if small changes & bisecting in the general
> case. In fact, it would be nice if we could use your fixed code with
> little change to "plug" it in as the x86 backend in many ways. Michael's
> work isn't a re-implementation of everything, it's a re-structuring,
> lots of bits of code that are missing can possibly be lifted from the
> existing working implementation.

Not the x86 backend but the raw backend.  You might not need all of
the features because you are always going through another interrupt
controller but that doesn't mean they shouldn't be there.

Michael has at least agreed to look in that direction so I'm hoping
my changes remove some of the difficulty for him.

> If we followed that "only do incrementental changes" rule all the time,
> imagine in what state would be our USB stack today since we couldn't
> have dropped in Linus replacement one ...

Well even that was a partial following of that rule because you didn't
rewrite the rest of the kernel at the same time, to better support
usb.  I do agree that there are instances where a complete rewrite is
the best path.  In this case I don't see a reasonable case for not
reusing what is there.

Nor do I see the level of care being put into the problem that would
cause me to trust a rewrite.  I have a huge number of little technical
problems with the proposed code, and see absolutely no overriding
virtue in it.  Especially when the worst of the problems with msi.c
can be easily fixed, as demonstrated by my patchset.

Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 23:26           ` Eric W. Biederman
@ 2007-01-28 23:37             ` David Miller
  2007-01-29  5:18               ` Eric W. Biederman
  2007-01-29  1:33             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 35+ messages in thread
From: David Miller @ 2007-01-28 23:37 UTC (permalink / raw)
  To: ebiederm
  Cc: benh, jeff, greg, tony.luck, grundler, mingo, linux-kernel, kyle,
	linuxppc-dev, brice, shaohua.li, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 16:26:44 -0700

> Yes.  In general the mainline linux kernel does not support certain
> classes of stupidity.  TCP offload engines, firmware drivers for
> hardware we care about, a fixed ABI to binary only modules, etc.
> It is the responsibility of the OS to setup MSI so we do it, not
> the firmware so we do it.

I absolutely disagree with you Eric, and I think you're being
rediculious.

If the hypervisor doesn't control the MSI PCI config space
register writes, this allows the device to spam PCI devices
which belong to other domains.

It's a freakin' reasonable design trade off decision, get over
it! :-)

Yes it can be done at the hardware level, and many hypervisor
based systems do that, but it's not the one-and-only true
way to implment inter-domain protection behind a single
PCI host controller.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 20:47     ` Jeff Garzik
  2007-01-28 21:20       ` Eric W. Biederman
  2007-01-28 22:11       ` Eric W. Biederman
@ 2007-01-28 23:42       ` David Miller
  2 siblings, 0 replies; 35+ messages in thread
From: David Miller @ 2007-01-28 23:42 UTC (permalink / raw)
  To: jeff
  Cc: benh, ebiederm, greg, tony.luck, grundler, mingo, linux-kernel,
	kyle, linuxppc-dev, brice, shaohua.li, linux-pci

From: Jeff Garzik <jeff@garzik.org>
Date: Sun, 28 Jan 2007 15:47:24 -0500

> I think the high-level ops approach makes more sense.  It's more future 
> proof, in addition to covering all existing implementations.

I totally agree with this.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 21:20       ` Eric W. Biederman
  2007-01-28 21:26         ` Ingo Molnar
  2007-01-28 22:09         ` Benjamin Herrenschmidt
@ 2007-01-28 23:44         ` David Miller
  2 siblings, 0 replies; 35+ messages in thread
From: David Miller @ 2007-01-28 23:44 UTC (permalink / raw)
  To: ebiederm
  Cc: jeff, benh, greg, tony.luck, grundler, mingo, linux-kernel, kyle,
	linuxppc-dev, brice, shaohua.li, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 14:20:12 -0700

> I see people pushing ridiculous interfaces like the RTAS hypervisor
> interface at me, and saying we must support running firmware drivers
> in the msi code.

This is not what's going on.

The hypervisor does the PCI config space programming on the
device to setup the MSI so that it can be done in a controlled
manner and such that the device cannot ever be configured by
one domain to shoot MSI packets over at devices which belong
to another domain.

It's that simple.

That's absolutely reasonable, and is I believe what you'll see the
sparc64 hypervisor(s) all needing as well.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 23:26           ` Eric W. Biederman
  2007-01-28 23:37             ` David Miller
@ 2007-01-29  1:33             ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29  1:33 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Jeff Garzik, Greg Kroah-Hartman, Tony Luck, Grant Grundler,
	Ingo Molnar, linux-kernel, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller


> To be clear I see this as 2 distinct layers of code. enable/disable
> that talks directly to the hardware, and the helpers of enable/disable
> that allocate the irq.  I base this on the fact that I only need the
> alloc/free when I am exclusively working with real hardware.

We need the alloc/free in all cases, wether we are talking to real HW or
hypervisor. Alloc free is what allocates linux virtual irq numbers (or
irq_desc's if your prefer) and what sets up the irq_desc->irq_chip to
the appropriate thing for MSIs on that machines. Thus it's really the
required step for everybody.

The thing you seem to be mixing up is allocating of linux virtual irqs
(picking an irq desc) and allocating of a HW vectors on your platformn
(which happens to be the same pretty much on x86 nowdays no ? That is,
they have the same numbering domain don't they ?).

That is, while in the HV case, we don't allocate HW vectors (the HV does
it for us), we still need to allocate linux irqs, setup the irq desc,
and hook them up.

> > You seem to absolutely want to get the HV case to go throuh the same
> > code path as the "raw" case, and that will not happen.
> 
> Yes I do.  Because that is the only sane approach for a HV to use.

BUT WE DON'T HAVE A CHOICE ON WHAT APPROACH THE HV USES !!!! pfff...
Isn't that clear enough ?

IBM will not change their HV interfaces becasue you don't like them, and
I doubt sun will neither and despite you disagreeing on that, we -do-
have to support them (hell, that's what I'm paid for among other
things :-)

It would be nice if we could dictate all HV and hardware vendors around
how we think they should work and what interfaces they should provide, I
suppose M$ can do that with Windows, but unfortunately, we aren't in a
position to do that.
 
> And yes we need an irq allocator to call the HV to setup the upstream
> reception of the msi message.

Not sure I completely parse the above.

> However I don't think it will be to hard to support your HV once we get
> the real hardware supported.  I just refuse to consider it before we have
> figured out what makes sense in the context where we have to do everything.

Hrmph....

> >   .../... (irq operations)
> >
> >> These because they are per irq make sense as per bus operations unless
> >> you have a good architecture definition like x86 has.  Roughly those
> >> operations are what we currently have except the current operations
> >> are a little simpler and easier to deal with for the architecture
> >> code.
> >
> > Oh ? How so ? (easier/simpler ?)
> 
> I don't take a type parameter, and I don't take a vector.  All of
> that work is done in the generic code.

Well, so basically, the main difference is that we make MSI looks like
MSI-X by providing an alloc/free abstraction that takes an array in all
cases and you make MSI-X look like MSI by working one interrupt at a
time.

Your case avoids a for () loop in the backend, at the cost of being
fundamentally incompatible with our HV approach (and possibly others
like sparc64).

We do pass the MSI vs. MSI-X because it's handy for the HV case to pass
it along to the firmware, though it doesn't have to be used, and indeed,
in the "raw" case, we don't use it.
 
> > This is indeed a lower level hook to be used by "raw" enable/disable. An
> > other approach would be to remove it, have each backend have it's own
> > enable/disable that obtains the address/data and calls into a helper to
> > program them. This would indeed be a little bit nicer in a layering
> > perspective. But it adds a bit more code to each backend, so we kept
> > things closer to the way they used to be. I don't have a firm reason not
> > to change it however, I need talk to Michael in case he has more good
> > reasons to keep it that way around. 
> 
> The current code in the kernel already is structured that way because
> we have to reprogram the msi message on each irq migration.  Not using
> a helper to write the message would be a noticeable change and require
> a fair amount of code rewriting on the currently supported
> architectures.

We never proposed not to use a helper to write back the message. We are
missing such a helper in the current implementation, true, but that
doesn't mean we are opposed to havign it, on the contrary.

However, I don't think your implementation is much cleaner :-) The thing
is that Michael's implementation completely avoids having any knowledge
of the specifics of enabling/disabling MSI's or MSI-X's in the top level
core code.

The main difference after the alloc/free case is the enable/disable
case:

You do something like that: Toplevel calls the backend "setup" for each
MSI or MSI-X, which itself then calls back into a helper to actually
write the message, that helper doing then a switch/case on MSI vs. MSI-X
based on infos in the msi desc. Then, you go back to the toplevel which
goes whack the config space to atually do the global enabling of MSIs or
MSI-X.

Well, I don't think that from a layering perspective, that is much
nicer. Your toplevel is a mix of high level interface to the backend and
low level code specific to the "raw" implementation.

In fact, I preferred the way it was done previously in that area in the
sense that if you decide to have the "raw" implementation indeed be the
"default" one, then move it at the top level and call some hook to
obtain the address/value pair for each MSI. That doesn't preclude having
the low level write_msi_message() function still be exported for use by
the set_affinity callback.

Michael's approach is similar than the above except that instead of
having the raw implementation at the toplevel, it hooks is via
enable/disable/setup_msi_msg.
 
> Yes.  In general the mainline linux kernel does not support certain
> classes of stupidity.
> TCP offload engines, firmware drivers for
> hardware we care about, a fixed ABI to binary only modules, etc.
> It is the responsibility of the OS to setup MSI so we do it, not
> the firmware so we do it.
> 
> Not supporting stupid things that are hard to support encourages other
> people not to be so silly, especially when linux still works on the
> hardware when that silly feature isn't supported.

Not supporting IBM HV because of those idealistic reasons means not
supporting a whole range of IBM machines in linux since LSIs are
optional on PCI-E.

It's not just a performance difference. A whole set of hardware will
-not- work on those machines because somebody has an ideal view of the
world (heh, that's funny, that same person actively works on x86
support, damn, that's something less than ideal :-)

I think that's a bit of a lame attitude (without wanting to be
insulting). The same way we can't dictate HW vendors how to do their
stuff (we try to encourage them ,we try to teach them, but once the HW
is out and people use it, we do also try to actually support it). So
yes, we try to "fix" some of our HV stuff when we think it's too much
off the hook (for example, initial interfaces didn't allow to
differenciate MSI from MSI-X at all, we got that changed) but there's a
limit on our influence on these things (heh, they also have to support
other operating systems) and we can't just say "won't support you" when
the interfaces don't please us.

> For similar reasons we don't support more than 1 irq with a plain MSI
> capability.  

I never understood why we had this stupid limitation in our API. It
would have been easy enough to do an API that can support it, as long as
we properly define that the platform is allowed to give you less than
what you asked.

> It is hard

Not really... Heh, in fact, with those "stupid" hypervisor interfaces,
it's actually very easy :-) But even in the raw case. Really not that
hard. Easier than MSI-X in many ways.

> we can't do it on most hardware

I've seen quite a few cards who say they do more than 1 MSI and the host
hardware shouldn't matter in that area.

> and anyone who wants more than 1 irq should just implement MSI-X and everyone
> will be able to use it, on any hardware.

Sure, anyone should just implement their hardware the way the linux
folks tell them to do, too bad HW vendors don't worship us as gods and
don't take our rules as god send laws ...

> Part of the reason to not support a messed up HV interface if it hard
> is that a HV is just software.  Which means the incremental cost to
> fix it is roughly the same as fixing the linux kernel, and it puts
> the burden on the people doing stupid things not on the rest of us
> forever more.

The comparison between > 1 MSI and HV is bad here. Supporting only 1 MSI
actually still allows the HW to work. Not supporting the HV (and thus
not supporting MSIs on those machines) does not when you start hitting
hardware that doesn't do LSI (which is allowed by spec and is starting
to appear, some IB cards for example don't do LSI).

> No.  I have spent time fixing what is there, and made it work.  I see
> implementations proposed that don't handle cases I have fixed, and I
> don't see anything that resembles a simple migration path for i386,
> x86_64 and ia64.  Which is part of what annoys me when I am told
> the ops work for everything.

They potentially do, and the easy migration path is mostly to use the
existing code as a HV-type backend for x86, and then incrementally fix
our generic "raw" helpers and move bits of the x86 / old-generic-code to
it... 

That's also a nice incremental approach...
 
> As for the code not working important parts of the code (like MSI-X)
> don't even work on ppc. 
> The strength of my opposition is largely
> shaped by the number of people wearing rose colored glasses and
> ignoring the problems, and missing huge details.

Well, which is why I'd like to have a more constructive discussion on
how to address those rather than outright dismissing the approach. You
are using the fact that Michael's implementation isn't feature complete
as an argument to dismiss the entire approach. In a way, you are
commiting a layering violation in your argument :-)

However, we can't get that resolved if we still don't agree on the veric
basic premises of the direction we are taking. We need to agree on some
of the fundamentals (like having alloc/free take an array or be
per-interrupt) or agree to disagree in which case we have no choice on
our side to "finish" Michael's implementation to do all those bits it's
missing and propose it as an alternate since the main one will not
handle our needs.
 
> Given that we have been talking about things since before OLS I would
> have expected the ppc code to be a little farther along.

We have been delayed / side tracked with other things. Shit happens.

> Not the x86 backend but the raw backend.  You might not need all of
> the features because you are always going through another interrupt
> controller but that doesn't mean they shouldn't be there.

I never disagreed with that. I always said we should have most of the
missing bits added to the raw backend for x86.

> Michael has at least agreed to look in that direction so I'm hoping
> my changes remove some of the difficulty for him.

Some do, some makes it more difficult. The way you removed the
alloc/free vs. setup/teardown distinction and made the whole thing
per-interrupt makes things more difficult for us.

> Nor do I see the level of care being put into the problem that would
> cause me to trust a rewrite.  I have a huge number of little technical
> problems with the proposed code, and see absolutely no overriding
> virtue in it.  Especially when the worst of the problems with msi.c
> can be easily fixed, as demonstrated by my patchset.

Can be fixed in a way that is by design incompatible with what we need. 

How should I phrase this so you understand that's not an option for
us ? 

Ben.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 23:37             ` David Miller
@ 2007-01-29  5:18               ` Eric W. Biederman
  2007-01-29  5:25                 ` David Miller
  0 siblings, 1 reply; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-29  5:18 UTC (permalink / raw)
  To: David Miller
  Cc: benh, jeff, greg, tony.luck, grundler, mingo, linux-kernel, kyle,
	linuxppc-dev, brice, shaohua.li, linux-pci

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Sun, 28 Jan 2007 16:26:44 -0700
>
>> Yes.  In general the mainline linux kernel does not support certain
>> classes of stupidity.  TCP offload engines, firmware drivers for
>> hardware we care about, a fixed ABI to binary only modules, etc.
>> It is the responsibility of the OS to setup MSI so we do it, not
>> the firmware so we do it.
>
> I absolutely disagree with you Eric, and I think you're being
> rediculious.
>
> If the hypervisor doesn't control the MSI PCI config space
> register writes, this allows the device to spam PCI devices
> which belong to other domains.
>
> It's a freakin' reasonable design trade off decision, get over
> it! :-)

I completely agree with you in the case you have described, it does
mean that the hypervisor needs to trust all of the MSI capable
hardware in the system but it if that is the best your hardware can
support it is a reasonable trade-off.

With the MSI-X registers in a random part of some memory mapped bar
and not guaranteed to be page aligned, things are more difficult to
isolate purely in a software based hypervisor.

> Yes it can be done at the hardware level, and many hypervisor
> based systems do that, but it's not the one-and-only true
> way to implment inter-domain protection behind a single
> PCI host controller.

The reason I consider the case crazy is that every example I have
been given is where the hardware is doing the filtering above the
PCI device.  So the hypervisor has no need to filter the pci config
traffic or to write to the msi config registers for us.  Yet the
defined hypervisor interface is.  Given the reduction in flexibility
of an interface where the hypervisor writes to the config registers
for the OS as compared to an interface where the hypervisor provides
a destination for MSI messages from a particular device upon request,
I think it is silly to design an interface when you full hardware
support to act like an interface built for a hypervisor that had
to do everything in software.

Regardless of my opinion on the sanity of the hypervisor architects.
I have not seen anything that indicates it will be hard to support
the hypervisor doing everything or most of everything for us, so
I see no valid technical objection to it.  Nor have I ever.

So I have no problem with additional patches in that direction.

Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  5:18               ` Eric W. Biederman
@ 2007-01-29  5:25                 ` David Miller
  2007-01-29  5:58                   ` Eric W. Biederman
  2007-01-29  6:05                   ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 35+ messages in thread
From: David Miller @ 2007-01-29  5:25 UTC (permalink / raw)
  To: ebiederm
  Cc: benh, jeff, greg, tony.luck, grundler, mingo, linux-kernel, kyle,
	linuxppc-dev, brice, shaohua.li, linux-pci

From: ebiederm@xmission.com (Eric W. Biederman)
Date: Sun, 28 Jan 2007 22:18:59 -0700

> Regardless of my opinion on the sanity of the hypervisor architects.
> I have not seen anything that indicates it will be hard to support
> the hypervisor doing everything or most of everything for us, so
> I see no valid technical objection to it.  Nor have I ever.
> 
> So I have no problem with additional patches in that direction.

Ok, that's great to hear.

I know your bi-directional approach isn't exactly what Ben
wants but he can support his machines with it.  Maybe after
some time we can agree to move from that more towards the
totally abstracted scheme.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  5:25                 ` David Miller
@ 2007-01-29  5:58                   ` Eric W. Biederman
  2007-01-29  6:05                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-29  5:58 UTC (permalink / raw)
  To: David Miller
  Cc: benh, jeff, greg, tony.luck, grundler, mingo, linux-kernel, kyle,
	linuxppc-dev, brice, shaohua.li, linux-pci

David Miller <davem@davemloft.net> writes:

> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Sun, 28 Jan 2007 22:18:59 -0700
>
>> Regardless of my opinion on the sanity of the hypervisor architects.
>> I have not seen anything that indicates it will be hard to support
>> the hypervisor doing everything or most of everything for us, so
>> I see no valid technical objection to it.  Nor have I ever.
>> 
>> So I have no problem with additional patches in that direction.
>
> Ok, that's great to hear.
>
> I know your bi-directional approach isn't exactly what Ben
> wants but he can support his machines with it.  Maybe after
> some time we can agree to move from that more towards the
> totally abstracted scheme.

Moving farther has been my intention the entire time, even
while writing those patches.  I'm just not prepared to do it in
one giant patch where bug hunting becomes impossible.

I think I have moved msi.c to the point it won't be a horror to
work with, so we can start seriously looking at what it will
take to support hypervisors that do this.

I don't believe there is anything generic we can do in the general
hypervisor case, so we need a way for the architecture code in
the case where it is inappropriate to write directly to the msi
and msi-x registers to have a completely different implementation of:
pci_enable_msi, pci_disable_msi, pci_enable_msix, psi_disable_msix,
and whatever other driver interface bits we have in there.

One small step at a time and we should get there soon.

Eric


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  5:25                 ` David Miller
  2007-01-29  5:58                   ` Eric W. Biederman
@ 2007-01-29  6:05                   ` Benjamin Herrenschmidt
  2007-01-29  8:28                     ` Eric W. Biederman
  2007-01-29  9:03                     ` Eric W. Biederman
  1 sibling, 2 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29  6:05 UTC (permalink / raw)
  To: David Miller
  Cc: ebiederm, jeff, greg, tony.luck, grundler, mingo, linux-kernel,
	kyle, linuxppc-dev, brice, shaohua.li, linux-pci

On Sun, 2007-01-28 at 21:25 -0800, David Miller wrote:
> From: ebiederm@xmission.com (Eric W. Biederman)
> Date: Sun, 28 Jan 2007 22:18:59 -0700
> 
> > Regardless of my opinion on the sanity of the hypervisor architects.
> > I have not seen anything that indicates it will be hard to support
> > the hypervisor doing everything or most of everything for us, so
> > I see no valid technical objection to it.  Nor have I ever.
> > 
> > So I have no problem with additional patches in that direction.
> 
> Ok, that's great to hear.
> 
> I know your bi-directional approach isn't exactly what Ben
> wants but he can support his machines with it.  Maybe after
> some time we can agree to move from that more towards the
> totally abstracted scheme.

It can support my machines without HV with trivial changes I reckon: I
need an ops struct to indirect eric's 2 remaining arch hooks
(setup/teardown) but that can be done inline within asm-powerpc. I need
to double check of course and probably actually port the MPIC backend
and possibly go write the Cell Axon one while at it to verify everything
is allright, but the base design seems sound enough.

For the ones with HV (RTAS stuff), we still need to agree on how to
approach it. We can either:

Option 1
--------

Do a hook -above- Eric stuff, by having the toplevel APIs themselves be
arch hooks that can either go toward the RTAS implementation or toward
Eric's code. That is, eric code would define those (pick better names if
you are good at it):

	pci_generic_enable_msi
	pci_generic_disable_msi
	pci_generic_enable_msix
	pci_generic_disable_msix
	pci_generic_save_msi_state
	pci_generic_restore_msi_state

Then we can have asm-i386/msi.h & friends do something like

#define pci_enable_msi	pci_generic_enable_msi
#define pci_disable_msi	pci_generic_disable_msi
   etc...

And we can have asm-powerpc/msi.h hook then via ppc_md:

static inline int pci_enable_msi(xxx...)
{
	return ppc_md.pci_enable_msi(xxx...);
}
etc...

(ppc_md is our per-platform global hook structure filled at boot when we
discover on what machine type we are running on) so that pSeries can use
it's own RTAS callbacks, and others can just re-hook those to Eric's
code.


Option 2
--------

That is to make Eric's code itself cope with the HV case. I'm a bit at
loss right now as how precisely to do it. I need to spend more time
staring at the code after Eric latest patches rather than the patches
themselves I suppose :-) (Eric, they don't apply out of the box on
current git, they are against -mm ?).

Some of the main issues here, more/less following the order in which
Eric code calls things:

 - The number of vectors for MSI-X is obtained from config space (at
least for sanity checking the requested argument). On RTAS, it should
come from an OF property (we are really not supposed to go read the
config space even if we can). I -suppose- we can survive for now with
just reading it, but we might well run into trouble with some "special"
devices shared accross partitions or if the IBM magic bridges themselves
ever start sending MSI-X on their own (unlikely but who knows...).
Michael's code handled that by having a callback ->check() do the sanity
checking of the nvec, and then just use the nvec passed in as an
argument once it's sane.

So for that I would propose adding an arch_check_msi(pdev, type, nvec)
or something like that. Note the biggest issue at this point anyway.

 - The real big one: For MSI-X, Eric's code tries to "hide" the fact
that those are MSI-X by allocating the msi-x entry array, then iterating
through them calling arch_setup_msi_irq() for each of them.

For that to work for us, it would need to be different, possibly
pre-allocating the array, and having -one- call taking an array and a
nvec. That's one of the reasons why I liked Michael's approach as
instead of making MSI-X look like MSI, it made MSI look like MSI-X by
passing a 1 entry array in the MSI case. Both approaches can probably be
made to handle multiple MSIs if we ever want to handle them.

The same issue is present for teardown of course.

 - We need HV hooks for suspend/resume at one point. Nothing urgent
there as our HV machines don't do suspend/resume just yet :-) But if we
ever implement something like suspend-to-disk, they'll definitely need
something as we are likely to get different vectors back from the
firmware so we need to re-map them to the same linux IRQ numbers.

I need to have a second look at Eric's code after I manage to find the
right combination of kernel for his patches to apply to check if I
missed anything important.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  6:05                   ` Benjamin Herrenschmidt
@ 2007-01-29  8:28                     ` Eric W. Biederman
  2007-01-29  9:03                     ` Eric W. Biederman
  1 sibling, 0 replies; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-29  8:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Miller, ebiederm, jeff, greg, tony.luck, grundler, mingo,
	linux-kernel, kyle, linuxppc-dev, brice, shaohua.li, linux-pci

>
> That is to make Eric's code itself cope with the HV case. I'm a bit at
> loss right now as how precisely to do it. I need to spend more time
> staring at the code after Eric latest patches rather than the patches
> themselves I suppose :-) (Eric, they don't apply out of the box on
> current git, they are against -mm ?).

Current git + gregkh-pci (Which has a couple of Michaels patches).
With current git the only problem should be context around msi_lookup_irq
which changes between the two.  But in this case the context around
an entire function being deleted doesn't matter.

Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  6:05                   ` Benjamin Herrenschmidt
  2007-01-29  8:28                     ` Eric W. Biederman
@ 2007-01-29  9:03                     ` Eric W. Biederman
  2007-01-29 10:11                       ` Michael Ellerman
  2007-01-29 20:22                       ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 35+ messages in thread
From: Eric W. Biederman @ 2007-01-29  9:03 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Miller, jeff, greg, tony.luck, grundler, mingo,
	linux-kernel, kyle, linuxppc-dev, brice, shaohua.li, linux-pci

Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:

> On Sun, 2007-01-28 at 21:25 -0800, David Miller wrote:
>> From: ebiederm@xmission.com (Eric W. Biederman)
>> Date: Sun, 28 Jan 2007 22:18:59 -0700
>> 
>> > Regardless of my opinion on the sanity of the hypervisor architects.
>> > I have not seen anything that indicates it will be hard to support
>> > the hypervisor doing everything or most of everything for us, so
>> > I see no valid technical objection to it.  Nor have I ever.
>> > 
>> > So I have no problem with additional patches in that direction.
>> 
>> Ok, that's great to hear.
>> 
>> I know your bi-directional approach isn't exactly what Ben
>> wants but he can support his machines with it.  Maybe after
>> some time we can agree to move from that more towards the
>> totally abstracted scheme.
>
> It can support my machines without HV with trivial changes I reckon: I
> need an ops struct to indirect eric's 2 remaining arch hooks
> (setup/teardown) but that can be done inline within asm-powerpc. I need
> to double check of course and probably actually port the MPIC backend
> and possibly go write the Cell Axon one while at it to verify everything
> is allright, but the base design seems sound enough.
>
> For the ones with HV (RTAS stuff), we still need to agree on how to
> approach it. We can either:
>
> Option 1
> --------
>
> Do a hook -above- Eric stuff, by having the toplevel APIs themselves be
> arch hooks that can either go toward the RTAS implementation or toward
> Eric's code. That is, eric code would define those (pick better names if
> you are good at it):
>
> 	pci_generic_enable_msi
> 	pci_generic_disable_msi
> 	pci_generic_enable_msix
> 	pci_generic_disable_msix
> 	pci_generic_save_msi_state
> 	pci_generic_restore_msi_state
>
> Then we can have asm-i386/msi.h & friends do something like
>
> #define pci_enable_msi	pci_generic_enable_msi
> #define pci_disable_msi	pci_generic_disable_msi
>    etc...
>
> And we can have asm-powerpc/msi.h hook then via ppc_md:
>
> static inline int pci_enable_msi(xxx...)
> {
> 	return ppc_md.pci_enable_msi(xxx...);
> }
> etc...
>
> (ppc_md is our per-platform global hook structure filled at boot when we
> discover on what machine type we are running on) so that pSeries can use
> it's own RTAS callbacks, and others can just re-hook those to Eric's
> code.

This is the most straight forward and handles machines with really
weird msi setups, so I lean in this direction.

The question is there anything at all we can do generically?

I can't see a case where ppc_md would not wind up with the hooks
that decide if it is a hypervisor or not.  Even if we came up
with a better set of functions you need to hook.

> Option 2
> --------
>
> That is to make Eric's code itself cope with the HV case. I'm a bit at
> loss right now as how precisely to do it. I need to spend more time
> staring at the code after Eric latest patches rather than the patches
> themselves I suppose :-) (Eric, they don't apply out of the box on
> current git, they are against -mm ?).
>
> Some of the main issues here, more/less following the order in which
> Eric code calls things:
>
>  - The number of vectors for MSI-X is obtained from config space (at
> least for sanity checking the requested argument). On RTAS, it should
> come from an OF property (we are really not supposed to go read the
> config space even if we can). I -suppose- we can survive for now with
> just reading it, but we might well run into trouble with some "special"
> devices shared accross partitions or if the IBM magic bridges themselves
> ever start sending MSI-X on their own (unlikely but who knows...).
> Michael's code handled that by having a callback ->check() do the sanity
> checking of the nvec, and then just use the nvec passed in as an
> argument once it's sane.

Ok. I think I get the point of check.  I believe I need to look at your
code a little more and see what you are doing to see if there is anything
generic worth doing, that we can always do outside of architecture code
no matter how much of the job the Hypervisor wants to do for us.

I'd hate to hit a different Hypervisor that did something close but
not quite the same and have the code fail then.  So definitely
avoiding touching pci config space at all in the calls seems to make a
lot of sense.  This includes avoiding pci_find_capability right?

Off the top of my head the only things we can do generically are
some data structure things and flags like dev->msi_enabled or
dev->msix_enabled.

Anyway have a nice night and more in the morning.


Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  9:03                     ` Eric W. Biederman
@ 2007-01-29 10:11                       ` Michael Ellerman
  2007-01-29 20:32                         ` Benjamin Herrenschmidt
  2007-01-29 23:29                         ` Paul Mackerras
  2007-01-29 20:22                       ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 35+ messages in thread
From: Michael Ellerman @ 2007-01-29 10:11 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Benjamin Herrenschmidt, David Miller, jeff, greg, tony.luck,
	grundler, mingo, linux-kernel, kyle, linuxppc-dev, brice,
	shaohua.li, linux-pci

[-- Attachment #1: Type: text/plain, Size: 6444 bytes --]

On Mon, 2007-01-29 at 02:03 -0700, Eric W. Biederman wrote:
> Benjamin Herrenschmidt <benh@kernel.crashing.org> writes:
> 
> > On Sun, 2007-01-28 at 21:25 -0800, David Miller wrote:
> >> From: ebiederm@xmission.com (Eric W. Biederman)
> >> Date: Sun, 28 Jan 2007 22:18:59 -0700
> >> 
> >> > Regardless of my opinion on the sanity of the hypervisor architects.
> >> > I have not seen anything that indicates it will be hard to support
> >> > the hypervisor doing everything or most of everything for us, so
> >> > I see no valid technical objection to it.  Nor have I ever.
> >> > 
> >> > So I have no problem with additional patches in that direction.
> >> 
> >> Ok, that's great to hear.
> >> 
> >> I know your bi-directional approach isn't exactly what Ben
> >> wants but he can support his machines with it.  Maybe after
> >> some time we can agree to move from that more towards the
> >> totally abstracted scheme.
> >
> > It can support my machines without HV with trivial changes I reckon: I
> > need an ops struct to indirect eric's 2 remaining arch hooks
> > (setup/teardown) but that can be done inline within asm-powerpc. I need
> > to double check of course and probably actually port the MPIC backend
> > and possibly go write the Cell Axon one while at it to verify everything
> > is allright, but the base design seems sound enough.
> >
> > For the ones with HV (RTAS stuff), we still need to agree on how to
> > approach it. We can either:
> >
> > Option 1
> > --------
> >
> > Do a hook -above- Eric stuff, by having the toplevel APIs themselves be
> > arch hooks that can either go toward the RTAS implementation or toward
> > Eric's code. That is, eric code would define those (pick better names if
> > you are good at it):
> >
> > 	pci_generic_enable_msi
> > 	pci_generic_disable_msi
> > 	pci_generic_enable_msix
> > 	pci_generic_disable_msix
> > 	pci_generic_save_msi_state
> > 	pci_generic_restore_msi_state
> >
> > Then we can have asm-i386/msi.h & friends do something like
> >
> > #define pci_enable_msi	pci_generic_enable_msi
> > #define pci_disable_msi	pci_generic_disable_msi
> >    etc...
> >
> > And we can have asm-powerpc/msi.h hook then via ppc_md:
> >
> > static inline int pci_enable_msi(xxx...)
> > {
> > 	return ppc_md.pci_enable_msi(xxx...);
> > }
> > etc...
> >
> > (ppc_md is our per-platform global hook structure filled at boot when we
> > discover on what machine type we are running on) so that pSeries can use
> > it's own RTAS callbacks, and others can just re-hook those to Eric's
> > code.
> 
> This is the most straight forward and handles machines with really
> weird msi setups, so I lean in this direction.
> 
> The question is there anything at all we can do generically?
> 
> I can't see a case where ppc_md would not wind up with the hooks
> that decide if it is a hypervisor or not.  Even if we came up
> with a better set of functions you need to hook.
> 
> > Option 2
> > --------
> >
> > That is to make Eric's code itself cope with the HV case. I'm a bit at
> > loss right now as how precisely to do it. I need to spend more time
> > staring at the code after Eric latest patches rather than the patches
> > themselves I suppose :-) (Eric, they don't apply out of the box on
> > current git, they are against -mm ?).
> >
> > Some of the main issues here, more/less following the order in which
> > Eric code calls things:
> >
> >  - The number of vectors for MSI-X is obtained from config space (at
> > least for sanity checking the requested argument). On RTAS, it should
> > come from an OF property (we are really not supposed to go read the
> > config space even if we can). I -suppose- we can survive for now with
> > just reading it, but we might well run into trouble with some "special"
> > devices shared accross partitions or if the IBM magic bridges themselves
> > ever start sending MSI-X on their own (unlikely but who knows...).
> > Michael's code handled that by having a callback ->check() do the sanity
> > checking of the nvec, and then just use the nvec passed in as an
> > argument once it's sane.
> 
> Ok. I think I get the point of check.  I believe I need to look at your
> code a little more and see what you are doing to see if there is anything
> generic worth doing, that we can always do outside of architecture code
> no matter how much of the job the Hypervisor wants to do for us.
> 
> I'd hate to hit a different Hypervisor that did something close but
> not quite the same and have the code fail then.  So definitely
> avoiding touching pci config space at all in the calls seems to make a
> lot of sense.  This includes avoiding pci_find_capability right?

You can read config space, but it's not clear to me if the HV is allowed
to filter it and hide things. It's also possible that the device
supports MSI, but for some reason the HV doesn't allow it on that device
etc. so you really have to ask the HV if it's enabled. So pci_find_cap()
shouldn't crash or anything, but it may lie to you.

> Off the top of my head the only things we can do generically are
> some data structure things and flags like dev->msi_enabled or
> dev->msix_enabled.

It would be good to have a common data structure if possible. My
thinking was that most of the information is per pci_dev, so that's
where I put it. I realise the Intel code stores some info that's
per-irq, but most of it is per-device. I hadn't got anywhere near coding
it, but my vague idea was to add a arch_data (or whatever) pointer to my
msi_info struct, which would allow backends to stash stuff.

I think the pci_intx() calls can be in the core.

Munging dev->irq could be in the core, assuming it's left in some known
location by the code. On the other hand we might want to decide it's a
bad idea altogether.

One thing I did like about my code, is that pci_enable_msi() and
pci_enable_msix() are just small wrappers around generic_enable_msi() -
which does all the work, and is the same regardless of whether it's an
MSI or MSI-X. Although that's facilitated by the type arg which you
don't like.

cheers

-- 
Michael Ellerman
OzLabs, IBM Australia Development Lab

wwweb: http://michael.ellerman.id.au
phone: +61 2 6212 1183 (tie line 70 21183)

We do not inherit the earth from our ancestors,
we borrow it from our children. - S.M.A.R.T Person

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29  9:03                     ` Eric W. Biederman
  2007-01-29 10:11                       ` Michael Ellerman
@ 2007-01-29 20:22                       ` Benjamin Herrenschmidt
  2007-01-29 23:05                         ` Paul Mackerras
  1 sibling, 1 reply; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29 20:22 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: David Miller, jeff, greg, tony.luck, grundler, mingo,
	linux-kernel, kyle, linuxppc-dev, brice, shaohua.li, linux-pci


> This is the most straight forward and handles machines with really
> weird msi setups, so I lean in this direction.
> 
> The question is there anything at all we can do generically?
> 
> I can't see a case where ppc_md would not wind up with the hooks
> that decide if it is a hypervisor or not.  Even if we came up
> with a better set of functions you need to hook.

Sure, but with Michael's approach, the only hook was get_msi_ops(pdev) 

Anyway, there isn't -that- much that can be done generically in the HV
case. Mostly some argument sanity checking, the logic for saving &
restoring pdev->irq for MSIs, that sort of thing.

> Ok. I think I get the point of check.  I believe I need to look at your
> code a little more and see what you are doing to see if there is anything
> generic worth doing, that we can always do outside of architecture code
> no matter how much of the job the Hypervisor wants to do for us.

I understand.

> I'd hate to hit a different Hypervisor that did something close but
> not quite the same and have the code fail then.  So definitely
> avoiding touching pci config space at all in the calls seems to make a
> lot of sense.  This includes avoiding pci_find_capability right?

Quite possibly yes. I'm pretty sure it will work on IBM HV but we aren't
really supposed to use it...

> Off the top of my head the only things we can do generically are
> some data structure things and flags like dev->msi_enabled or
> dev->msix_enabled.

That and the saving & restoring of pdev->irq. That is not very much.

> Anyway have a nice night and more in the morning.

Ben.


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29 10:11                       ` Michael Ellerman
@ 2007-01-29 20:32                         ` Benjamin Herrenschmidt
  2007-01-29 23:29                         ` Paul Mackerras
  1 sibling, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29 20:32 UTC (permalink / raw)
  To: michael
  Cc: Eric W. Biederman, David Miller, jeff, greg, tony.luck, grundler,
	mingo, linux-kernel, kyle, linuxppc-dev, brice, shaohua.li,
	linux-pci


> You can read config space, but it's not clear to me if the HV is allowed
> to filter it and hide things. 

I've seen it do it for example with EADS bridges. I haven't seen doing
it with devices (other than hiding entire functions) but I wouldn't
exclude it...

> It's also possible that the device
> supports MSI, but for some reason the HV doesn't allow it on that device
> etc. so you really have to ask the HV if it's enabled. So pci_find_cap()
> shouldn't crash or anything, but it may lie to you.

Yup.

> One thing I did like about my code, is that pci_enable_msi() and
> pci_enable_msix() are just small wrappers around generic_enable_msi() -
> which does all the work, and is the same regardless of whether it's an
> MSI or MSI-X. Although that's facilitated by the type arg which you
> don't like.

Part of the reason is you make MSI look like MSI-X (a vector of 1 entry)
while Eric does the opposite.

Ben.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29 20:22                       ` Benjamin Herrenschmidt
@ 2007-01-29 23:05                         ` Paul Mackerras
  2007-01-30 19:32                           ` Segher Boessenkool
  0 siblings, 1 reply; 35+ messages in thread
From: Paul Mackerras @ 2007-01-29 23:05 UTC (permalink / raw)
  To: Eric W. Biederman, Benjamin Herrenschmidt
  Cc: tony.luck, grundler, jeff, greg, linux-kernel, kyle,
	linuxppc-dev, linux-pci, brice, shaohua.li, mingo, David Miller

Benjamin Herrenschmidt writes:

> > I'd hate to hit a different Hypervisor that did something close but
> > not quite the same and have the code fail then.  So definitely
> > avoiding touching pci config space at all in the calls seems to make a
> > lot of sense.  This includes avoiding pci_find_capability right?
> 
> Quite possibly yes. I'm pretty sure it will work on IBM HV but we aren't
> really supposed to use it...

Actually, I don't know of any reason why we can't use
pci_find_capability.  We are supposed to avoid trying to touch config
space of devices (in fact, functions) that aren't assigned to our
partition, but we're not talking about that here.

I just got an answer from the hypervisor architects.  It turns out
that the hardware _does_ prevent the device from sending MSI messages
to another partition.  The OS _can_ write whatever it likes to the MSI
address and data registers.  It can potentially lose interrupts (or, I
expect, get the device isolated by EEH) but it can't disrupt another
partition.

I think the reason why the hypervisor call writes the values straight
into the MSI/MSI-X registers in the device is (a) that's convenient
for AIX, since it saves it from immediately having to do more calls
into the hypervisor to write those values to the device, and (b) there
are some ABI complications in returning a lot of values, so the device
registers provide a convenient place to return those values.

So it would be possible, although gross, to do the hypervisor call,
read the values from config space and return them to the generic code,
then let the generic code write them to config space for us. :P

The remaining point of difference then seems to be that for MSI-X, we
really want to know up-front how many interrupts the device driver is
asking for, rather than having a series of alloc requests dribble in
one at a time.

Regards,
Paul.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29 10:11                       ` Michael Ellerman
  2007-01-29 20:32                         ` Benjamin Herrenschmidt
@ 2007-01-29 23:29                         ` Paul Mackerras
  2007-01-29 23:40                           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 35+ messages in thread
From: Paul Mackerras @ 2007-01-29 23:29 UTC (permalink / raw)
  To: michael
  Cc: Eric W. Biederman, tony.luck, grundler, jeff, linux-kernel, kyle,
	linuxppc-dev, linux-pci, brice, greg, shaohua.li, mingo,
	David Miller

Michael Ellerman writes:

> You can read config space, but it's not clear to me if the HV is allowed
> to filter it and hide things. It's also possible that the device

It appears that the HV does not prevent us from reading or writing any
config space registers for functions that are assigned to us.

> supports MSI, but for some reason the HV doesn't allow it on that device
> etc. so you really have to ask the HV if it's enabled. So pci_find_cap()
> shouldn't crash or anything, but it may lie to you.

It's possible that the device can do MSI(X), but that using MSI(X)
requires other platform resources (e.g. interrupt source numbers) and
there are none free.  I believe the platform guarantees a minimum
number of MSI(X) interrupts per function, but a pci_enable_msix() call
may not be able to give the driver as many MSI-X interrupts as it is
requesting even if the function can handle that many.

Paul.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29 23:29                         ` Paul Mackerras
@ 2007-01-29 23:40                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 35+ messages in thread
From: Benjamin Herrenschmidt @ 2007-01-29 23:40 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: michael, tony.luck, grundler, jeff, David Miller, greg,
	linux-kernel, kyle, linuxppc-dev, Eric W. Biederman, shaohua.li,
	linux-pci, mingo, brice


> It's possible that the device can do MSI(X), but that using MSI(X)
> requires other platform resources (e.g. interrupt source numbers) and
> there are none free.  I believe the platform guarantees a minimum
> number of MSI(X) interrupts per function, but a pci_enable_msix() call
> may not be able to give the driver as many MSI-X interrupts as it is
> requesting even if the function can handle that many.

However, the ibm,req#msi(-x) properties contain the number as requested
by the device, and thus I expect them to be identical to the config
space value. So if you are confident enough that our HV won't play any
tricks there in the future, reading the config space is as good as
hooking that check() callback, though it might not be vs. some other HV
for some other platform that might be more strict.

We cannot know in advance how much max the HV will give us without
actually trying ibm,change-msi and see the result code for it
unfortunately.

Ben.



^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-29 23:05                         ` Paul Mackerras
@ 2007-01-30 19:32                           ` Segher Boessenkool
  0 siblings, 0 replies; 35+ messages in thread
From: Segher Boessenkool @ 2007-01-30 19:32 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: Eric W. Biederman, Benjamin Herrenschmidt, tony.luck, grundler,
	jeff, David Miller, greg, linux-kernel, kyle, linuxppc-dev,
	brice, shaohua.li, linux-pci, mingo

> I just got an answer from the hypervisor architects.  It turns out
> that the hardware _does_ prevent the device from sending MSI messages
> to another partition.  The OS _can_ write whatever it likes to the MSI
> address and data registers.  It can potentially lose interrupts (or, I
> expect, get the device isolated by EEH) but it can't disrupt another
> partition.

The OS however has to write the values the HV wants to
the device, or things won't work -- so the HV can just
as well do it itself.  Also, pulling all the work into
the HV makes for a cleaner, more generic design (who
knows what hardware will show up within the next few
years, the HV interface had better be prepared).


Segher


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH 0/6] MSI portability cleanups
  2007-01-28 22:09         ` Benjamin Herrenschmidt
  2007-01-28 23:26           ` Eric W. Biederman
@ 2007-02-01  4:29           ` Greg KH
  1 sibling, 0 replies; 35+ messages in thread
From: Greg KH @ 2007-02-01  4:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Eric W. Biederman, Jeff Garzik, Tony Luck, Grant Grundler,
	Ingo Molnar, linux-kernel, Kyle McMartin, linuxppc-dev,
	Brice Goglin, shaohua.li, linux-pci, David S. Miller

On Mon, Jan 29, 2007 at 09:09:14AM +1100, Benjamin Herrenschmidt wrote:
> 
> If we followed that "only do incrementental changes" rule all the time,
> imagine in what state would be our USB stack today since we couldn't
> have dropped in Linus replacement one ...

Bad example, that is not what happened at all.  There was not an
in-kernel USB stack when Linus wrote his.  Inaky had his
all-singing-all-dancing stack outside of the tree, and no one was really
helping out with it.

Only when Linus added his code to mainline did we all jump on it and
_incrementally_ improve it to what we have today.

So, in a way, you just proved that we need to do this in an incremental
fashion, which is what I was also saying all along :)

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2007-02-01  4:30 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1169714047.65693.647693675533.qpush@cradle>
2007-01-28 19:40 ` [PATCH 0/6] MSI portability cleanups Eric W. Biederman
2007-01-28 19:42   ` [PATCH 1/6] msi: Kill msi_lookup_irq Eric W. Biederman
2007-01-28 19:44     ` [PATCH 2/6] msi: Remove msi_lock Eric W. Biederman
2007-01-28 19:45       ` [PATCH 3/6] msi: Fix msi_remove_pci_irq_vectors Eric W. Biederman
2007-01-28 19:47         ` [PATCH 4/6] msi: Remove attach_msi_entry Eric W. Biederman
2007-01-28 19:52           ` [PATCH 5/6] msi: Kill the msi_desc array Eric W. Biederman
2007-01-28 19:56             ` [PATCH 6/6] msi: Make MSI useable more architectures Eric W. Biederman
2007-01-28 22:01     ` [PATCH 1/6] msi: Kill msi_lookup_irq Paul Mackerras
2007-01-28 22:18       ` Eric W. Biederman
2007-01-28 20:23   ` [PATCH 0/6] MSI portability cleanups Benjamin Herrenschmidt
2007-01-28 20:47     ` Jeff Garzik
2007-01-28 21:20       ` Eric W. Biederman
2007-01-28 21:26         ` Ingo Molnar
2007-01-28 22:09         ` Benjamin Herrenschmidt
2007-01-28 23:26           ` Eric W. Biederman
2007-01-28 23:37             ` David Miller
2007-01-29  5:18               ` Eric W. Biederman
2007-01-29  5:25                 ` David Miller
2007-01-29  5:58                   ` Eric W. Biederman
2007-01-29  6:05                   ` Benjamin Herrenschmidt
2007-01-29  8:28                     ` Eric W. Biederman
2007-01-29  9:03                     ` Eric W. Biederman
2007-01-29 10:11                       ` Michael Ellerman
2007-01-29 20:32                         ` Benjamin Herrenschmidt
2007-01-29 23:29                         ` Paul Mackerras
2007-01-29 23:40                           ` Benjamin Herrenschmidt
2007-01-29 20:22                       ` Benjamin Herrenschmidt
2007-01-29 23:05                         ` Paul Mackerras
2007-01-30 19:32                           ` Segher Boessenkool
2007-01-29  1:33             ` Benjamin Herrenschmidt
2007-02-01  4:29           ` Greg KH
2007-01-28 23:44         ` David Miller
2007-01-28 22:11       ` Eric W. Biederman
2007-01-28 23:42       ` David Miller
2007-01-28 21:34     ` Eric W. Biederman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).