[PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
@ 2011-10-03 14:50 Jon Mason
  2011-10-03 20:56 ` Linus Torvalds
  2011-10-03 21:55 ` Jon Mason
  0 siblings, 2 replies; 20+ messages in thread
From: Jon Mason @ 2011-10-03 14:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Greg Kroah-Hartman, Jesse Barnes, Bjorn Helgaas,
	Benjamin Herrenschmidt, linux-kernel, linux-pci

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>

When configuring the PCIe settings for "performance", we allow parents
to have a larger Max Payload Size than children and rely on children
Max Read Request Size to not be larger than their own MPS to avoid
having the host bridge generate responses they can't cope with.

However, various drivers in Linux call pci_set_readrq() with arbitrary
values, assuming this to be a simple performance tweak. This breaks
under our "performance" configuration.

Fix that by making sure the value programmed by pcie_set_readrq() is
never larger than the configured MPS for that device.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Acked-by: Jon Mason <mason@myri.com>
---
 drivers/pci/pci.c |   18 ++++++++++++++++--
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 4e84fd4..d369316 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3203,8 +3203,6 @@ int pcie_set_readrq(struct pci_dev *dev, int rq)
 	if (rq < 128 || rq > 4096 || !is_power_of_2(rq))
 		goto out;
 
-	v = (ffs(rq) - 8) << 12;
-
 	cap = pci_pcie_cap(dev);
 	if (!cap)
 		goto out;
@@ -3212,6 +3210,22 @@ int pcie_set_readrq(struct pci_dev *dev, int rq)
 	err = pci_read_config_word(dev, cap + PCI_EXP_DEVCTL, &ctl);
 	if (err)
 		goto out;
+	/*
+	 * If using the "performance" PCIe config, we clamp the
+	 * read rq size to the max packet size to prevent the
+	 * host bridge generating requests larger than we can
+	 * cope with
+	 */
+	if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
+		int mps = pcie_get_mps(dev);
+
+		if (mps < 0)
+			return mps;
+		if (mps < rq)
+			rq = mps;
+	}
+
+	v = (ffs(rq) - 8) << 12;
 
 	if ((ctl & PCI_EXP_DEVCTL_READRQ) != v) {
 		ctl &= ~PCI_EXP_DEVCTL_READRQ;
-- 
1.7.6.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-03 14:50 [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings Jon Mason
@ 2011-10-03 20:56 ` Linus Torvalds
  2011-10-04 15:40   ` Benjamin Herrenschmidt
  2011-10-03 21:55 ` Jon Mason
  1 sibling, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2011-10-03 20:56 UTC (permalink / raw)
  To: Jon Mason
  Cc: Greg Kroah-Hartman, Jesse Barnes, Bjorn Helgaas,
	Benjamin Herrenschmidt, linux-kernel, linux-pci

On Mon, Oct 3, 2011 at 7:50 AM, Jon Mason <mason@myri.com> wrote:
>
> Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Acked-by: Jon Mason <mason@myri.com>

You need to sign off on this one too, since you're passing it on.

Also, I'd *really* like some more acks/reviews on the series. Anybody?

                    Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-03 14:50 [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings Jon Mason
  2011-10-03 20:56 ` Linus Torvalds
@ 2011-10-03 21:55 ` Jon Mason
  2011-10-04 14:42   ` Benjamin LaHaise
  1 sibling, 1 reply; 20+ messages in thread
From: Jon Mason @ 2011-10-03 21:55 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Greg Kroah-Hartman, Jesse Barnes, Bjorn Helgaas,
	Benjamin Herrenschmidt, linux-kernel, linux-pci

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>

When configuring the PCIe settings for "performance", we allow parents
to have a larger Max Payload Size than children and rely on children
Max Read Request Size to not be larger than their own MPS to avoid
having the host bridge generate responses they can't cope with.

However, various drivers in Linux call pci_set_readrq() with arbitrary
values, assuming this to be a simple performance tweak. This breaks
under our "performance" configuration.

Fix that by making sure the value programmed by pcie_set_readrq() is
never larger than the configured MPS for that device.

Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Jon Mason <mason@myri.com>
---
 drivers/pci/pci.c |   18 ++++++++++++++++--
 1 files changed, 16 insertions(+), 2 deletions(-)

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index 4e84fd4..d369316 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -3203,8 +3203,6 @@ int pcie_set_readrq(struct pci_dev *dev, int rq)
 	if (rq < 128 || rq > 4096 || !is_power_of_2(rq))
 		goto out;
 
-	v = (ffs(rq) - 8) << 12;
-
 	cap = pci_pcie_cap(dev);
 	if (!cap)
 		goto out;
@@ -3212,6 +3210,22 @@ int pcie_set_readrq(struct pci_dev *dev, int rq)
 	err = pci_read_config_word(dev, cap + PCI_EXP_DEVCTL, &ctl);
 	if (err)
 		goto out;
+	/*
+	 * If using the "performance" PCIe config, we clamp the
+	 * read rq size to the max packet size to prevent the
+	 * host bridge generating requests larger than we can
+	 * cope with
+	 */
+	if (pcie_bus_config == PCIE_BUS_PERFORMANCE) {
+		int mps = pcie_get_mps(dev);
+
+		if (mps < 0)
+			return mps;
+		if (mps < rq)
+			rq = mps;
+	}
+
+	v = (ffs(rq) - 8) << 12;
 
 	if ((ctl & PCI_EXP_DEVCTL_READRQ) != v) {
 		ctl &= ~PCI_EXP_DEVCTL_READRQ;
-- 
1.7.6.2


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-03 21:55 ` Jon Mason
@ 2011-10-04 14:42   ` Benjamin LaHaise
  2011-10-04 15:37     ` Benjamin LaHaise
  2011-10-04 15:52     ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 20+ messages in thread
From: Benjamin LaHaise @ 2011-10-04 14:42 UTC (permalink / raw)
  To: Jon Mason
  Cc: Linus Torvalds, Greg Kroah-Hartman, Jesse Barnes, Bjorn Helgaas,
	Benjamin Herrenschmidt, linux-kernel, linux-pci

On Mon, Oct 03, 2011 at 04:55:48PM -0500, Jon Mason wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> 
> When configuring the PCIe settings for "performance", we allow parents
> to have a larger Max Payload Size than children and rely on children
> Max Read Request Size to not be larger than their own MPS to avoid
> having the host bridge generate responses they can't cope with.

I'm pretty sure that simply will not work, and is an incorrect understanding 
of how PCIe bridges and devices interact with regards to transaction size 
limits.  Here's why: I am actually implementing a PCIe nic on an FPGA at 
present, and have just been in the process of tuning how memory read 
requests are issued and processed.  It is perfectly valid for a PCIe 
endpoint to issue a read request for an entire 4KB block (assuming it 
respects the no 4KB boundary crossings rule), even when the MPS setting 
is only 64 or 128 bytes.  However, the root complex or PCIe bridge *must 
not* exceed the Maximum Payload Size for any completions with data or 
posted writes.  Multiple completions are okay and expected for read 
requests.  If the MPS on the bridge is set to a larger value than 
what all of the endpoints connected to it, the bridge or root complex will 
happily send read completions exceeding the endpoint's MPS.  This can and 
will lead to failure on the parts of endpoints.

		-ben

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-04 14:42   ` Benjamin LaHaise
@ 2011-10-04 15:37     ` Benjamin LaHaise
  2011-10-04 15:52     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 20+ messages in thread
From: Benjamin LaHaise @ 2011-10-04 15:37 UTC (permalink / raw)
  To: Jon Mason
  Cc: Linus Torvalds, Greg Kroah-Hartman, Jesse Barnes, Bjorn Helgaas,
	Benjamin Herrenschmidt, linux-kernel, linux-pci

On Tue, Oct 04, 2011 at 10:42:15AM -0400, Benjamin LaHaise wrote:
> On Mon, Oct 03, 2011 at 04:55:48PM -0500, Jon Mason wrote:
> > From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > 
> > When configuring the PCIe settings for "performance", we allow parents
> > to have a larger Max Payload Size than children and rely on children
> > Max Read Request Size to not be larger than their own MPS to avoid
> > having the host bridge generate responses they can't cope with.
> 
> I'm pretty sure that simply will not work, and is an incorrect understanding 
> of how PCIe bridges and devices interact with regards to transaction size 
> limits.  Here's why: I am actually implementing a PCIe nic on an FPGA at 
> present, and have just been in the process of tuning how memory read 
> requests are issued and processed.  It is perfectly valid for a PCIe 
> endpoint to issue a read request for an entire 4KB block (assuming it 
> respects the no 4KB boundary crossings rule), even when the MPS setting 
> is only 64 or 128 bytes.  However, the root complex or PCIe bridge *must 
> not* exceed the Maximum Payload Size for any completions with data or 
> posted writes.  Multiple completions are okay and expected for read 
> requests.  If the MPS on the bridge is set to a larger value than 
> what all of the endpoints connected to it, the bridge or root complex will 
> happily send read completions exceeding the endpoint's MPS.  This can and 
> will lead to failure on the parts of endpoints.

Just to clarify, my main concern is that restricting the size of read 
requests will impact performance negatively, for things like network tx.  
Issuing small reads for network tx made a huge impact on transmit 
performance, while also constraining rx performance in a full duplex 
scenario.  It also leaves the door open to incorrect behaviour in the 
cast of posted writes (think of memcpy_toio()) since the write MPS is 
incorrect.

		-ben

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-03 20:56 ` Linus Torvalds
@ 2011-10-04 15:40   ` Benjamin Herrenschmidt
  2011-10-04 15:48     ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2011-10-04 15:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jon Mason, Greg Kroah-Hartman, Jesse Barnes, Bjorn Helgaas,
	linux-kernel, linux-pci

On Mon, 2011-10-03 at 13:56 -0700, Linus Torvalds wrote:
> On Mon, Oct 3, 2011 at 7:50 AM, Jon Mason <mason@myri.com> wrote:
> >
> > Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Acked-by: Jon Mason <mason@myri.com>
> 
> You need to sign off on this one too, since you're passing it on.
> 
> Also, I'd *really* like some more acks/reviews on the series. Anybody?

Hopefully most of these patches only affect the "performance" setting
which is no longer the default.

But yes, more reviews are always welcome.

Cheers,
Ben.

>                     Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-04 15:40   ` Benjamin Herrenschmidt
@ 2011-10-04 15:48     ` Linus Torvalds
  2011-10-04 15:56       ` Bjorn Helgaas
  2011-10-04 16:08       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 20+ messages in thread
From: Linus Torvalds @ 2011-10-04 15:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jon Mason, Greg Kroah-Hartman, Jesse Barnes, Bjorn Helgaas,
	linux-kernel, linux-pci, Benjamin LaHaise

On Tue, Oct 4, 2011 at 8:40 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> Hopefully most of these patches only affect the "performance" setting
> which is no longer the default.
>
> But yes, more reviews are always welcome.

Well, bcrl argues that patches 1-2 of 3 are actively wrong.

Quite frankly, my gut feel is that this late in the game, the only
patch I should apply is 3/3, and even that I wonder about. But that
seems to *really* disable all the games we do, and that all seem to be
questionable.

Comments? Please? I'm going to do an -rc9 today, and after that I'll
be travelling a lot, so..

                       Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-04 14:42   ` Benjamin LaHaise
  2011-10-04 15:37     ` Benjamin LaHaise
@ 2011-10-04 15:52     ` Benjamin Herrenschmidt
  2011-10-04 15:59       ` Benjamin LaHaise
  1 sibling, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2011-10-04 15:52 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Jon Mason, Linus Torvalds, Greg Kroah-Hartman, Jesse Barnes,
	Bjorn Helgaas, linux-kernel, linux-pci

On Tue, 2011-10-04 at 10:42 -0400, Benjamin LaHaise wrote:
> On Mon, Oct 03, 2011 at 04:55:48PM -0500, Jon Mason wrote:
> > From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > 
> > When configuring the PCIe settings for "performance", we allow parents
> > to have a larger Max Payload Size than children and rely on children
> > Max Read Request Size to not be larger than their own MPS to avoid
> > having the host bridge generate responses they can't cope with.
> 
> I'm pretty sure that simply will not work, and is an incorrect understanding 
> of how PCIe bridges and devices interact with regards to transaction size 
> limits. 

Hi Ben !

I beg to disagree :) See below.

>  Here's why: I am actually implementing a PCIe nic on an FPGA at 
> present, and have just been in the process of tuning how memory read 
> requests are issued and processed.  It is perfectly valid for a PCIe 
> endpoint to issue a read request for an entire 4KB block (assuming it 
> respects the no 4KB boundary crossings rule), even when the MPS setting 
> is only 64 or 128 bytes.

But not if the Max Read Request Size of the endpoint is clamped which
afaik is the whole point of the exercise.

>   However, the root complex or PCIe bridge *must 
> not* exceed the Maximum Payload Size for any completions with data or 
> posted writes.  Multiple completions are okay and expected for read 
> requests.  If the MPS on the bridge is set to a larger value than 
> what all of the endpoints connected to it, the bridge or root complex will 
> happily send read completions exceeding the endpoint's MPS.  This can and 
> will lead to failure on the parts of endpoints.

Hence the clamping of MRRS which is done by Jon's patch, the patch
referenced here by me additionally prevents drivers who blindly try to
set it back to 4096 to also be appropriately limited.

Note that in practice (though I haven't put that logic in Linux bare
metal yet), pHyp has an additional refinement which is to "know" what
the real max read response of the host bridge is and only clamp the MRRS
if the MPS of the device is lower than that. In practice, that means
that we don't clamp on most high speed adapters as our bridges never
reply with more than 512 bytes in a TLP, but this will require passing
some platforms specific information down which we don't have at hand
just yet.

This is really the only way to avoid bogging everybody down to 128 bytes
if you have one hotplug leg on a switch or one slow device. For example
on some of our machines, if we don't apply that technique, the PCI-X ->
USB leg of the main switch will cause everything to go down to 128
bytes, including the on-board SAS controllers. (The chipset has 6 host
bridges or so but all the on-board stuff is behind a switch on one of
them).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-04 15:48     ` Linus Torvalds
@ 2011-10-04 15:56       ` Bjorn Helgaas
  2011-10-04 16:08       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 20+ messages in thread
From: Bjorn Helgaas @ 2011-10-04 15:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Jon Mason, Greg Kroah-Hartman,
	Jesse Barnes, linux-kernel, linux-pci, Benjamin LaHaise

On Tue, Oct 4, 2011 at 9:48 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Oct 4, 2011 at 8:40 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
>>
>> Hopefully most of these patches only affect the "performance" setting
>> which is no longer the default.
>>
>> But yes, more reviews are always welcome.
>
> Well, bcrl argues that patches 1-2 of 3 are actively wrong.
>
> Quite frankly, my gut feel is that this late in the game, the only
> patch I should apply is 3/3, and even that I wonder about. But that
> seems to *really* disable all the games we do, and that all seem to be
> questionable.

The current state (with none of these three applied) is that we do
change MPS configuration in some cases and there is an unsolved
regression (Avi's e1000e problem) and bcrl's concern.

In my opinion, we need to turn off all MPS fiddling by default for rc9
with something like the following.  This is part of 3/3, which I think
should have been split into two pieces -- this one and the "also add"
part.

diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
index d369316..5db5986 100644
--- a/drivers/pci/pci.c
+++ b/drivers/pci/pci.c
@@ -77,7 +77,7 @@ unsigned long pci_cardbus_mem_size = DEFAULT_CARDBUS_MEM_SIZE;
 unsigned long pci_hotplug_io_size  = DEFAULT_HOTPLUG_IO_SIZE;
 unsigned long pci_hotplug_mem_size = DEFAULT_HOTPLUG_MEM_SIZE;

-enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_SAFE;
+enum pcie_bus_config_types pcie_bus_config = PCIE_BUS_TUNE_OFF;

 /*
 * The default CLS is used if arch didn't set CLS explicitly and not
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index a919db2..4829424 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -1451,12 +1451,24 @@ static int pcie_bus_configure_set(struct
pci_dev *dev, void *data)
 */
 void pcie_bus_configure_settings(struct pci_bus *bus, u8 mpss)
 {
       if (!pci_is_pcie(bus->self))
               return;

+       if (pcie_bus_config == PCIE_BUS_TUNE_OFF)
+               return;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-04 15:52     ` Benjamin Herrenschmidt
@ 2011-10-04 15:59       ` Benjamin LaHaise
  2011-10-04 16:19         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin LaHaise @ 2011-10-04 15:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jon Mason, Linus Torvalds, Greg Kroah-Hartman, Jesse Barnes,
	Bjorn Helgaas, linux-kernel, linux-pci

On Tue, Oct 04, 2011 at 05:52:02PM +0200, Benjamin Herrenschmidt wrote:
> Hi Ben !

Hey Ben as well!

> I beg to disagree :) See below.
> 
> >  Here's why: I am actually implementing a PCIe nic on an FPGA at 
> > present, and have just been in the process of tuning how memory read 
> > requests are issued and processed.  It is perfectly valid for a PCIe 
> > endpoint to issue a read request for an entire 4KB block (assuming it 
> > respects the no 4KB boundary crossings rule), even when the MPS setting 
> > is only 64 or 128 bytes.
> 
> But not if the Max Read Request Size of the endpoint is clamped which
> afaik is the whole point of the exercise.

Yes, that is true.  However, from a performance point of view, clamping 
the maximum read size has a large, negative effect.  Going from 32 dword 
reads to full packet sized reads in my project meant the difference of 
50MB/s vs 110MB/s on gige with 1500 byte packets.

> Hence the clamping of MRRS which is done by Jon's patch, the patch
> referenced here by me additionally prevents drivers who blindly try to
> set it back to 4096 to also be appropriately limited.

I have checked, and the large read requests are supported by all the systems 
I have access to (mix of 2-5 year old hardware).

> Note that in practice (though I haven't put that logic in Linux bare
> metal yet), pHyp has an additional refinement which is to "know" what
> the real max read response of the host bridge is and only clamp the MRRS
> if the MPS of the device is lower than that. In practice, that means
> that we don't clamp on most high speed adapters as our bridges never
> reply with more than 512 bytes in a TLP, but this will require passing
> some platforms specific information down which we don't have at hand
> just yet.

My concern, which I forgot to put in the original message is that allowing 
a bridge to have a large MPS than the endpoints will result in things 
failing when a large write occurs.  Aiui, we don't restrict the size of 
memcpy_toio() type functions, and there are PCIe devices which do not 
perform DMA.  Clamping MPS on the bridge is a requirement of correctness.

> This is really the only way to avoid bogging everybody down to 128 bytes
> if you have one hotplug leg on a switch or one slow device. For example
> on some of our machines, if we don't apply that technique, the PCI-X ->
> USB leg of the main switch will cause everything to go down to 128
> bytes, including the on-board SAS controllers. (The chipset has 6 host
> bridges or so but all the on-board stuff is behind a switch on one of
> them).

The difference in overhead between 128 and 256 byte TLPs isn't that great.  
64 bytes is pretty bad, I'd agree.  That said, I'd be interested in seeing 
how things measure up when you have the PCIe link busy in both directions.

		-ben

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-04 15:48     ` Linus Torvalds
  2011-10-04 15:56       ` Bjorn Helgaas
@ 2011-10-04 16:08       ` Benjamin Herrenschmidt
  2011-10-04 16:51         ` Linus Torvalds
  1 sibling, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2011-10-04 16:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jon Mason, Greg Kroah-Hartman, Jesse Barnes, Bjorn Helgaas,
	linux-kernel, linux-pci, Benjamin LaHaise

On Tue, 2011-10-04 at 08:48 -0700, Linus Torvalds wrote:
> On Tue, Oct 4, 2011 at 8:40 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >
> > Hopefully most of these patches only affect the "performance" setting
> > which is no longer the default.
> >
> > But yes, more reviews are always welcome.
> 
> Well, bcrl argues that patches 1-2 of 3 are actively wrong.

This is an argument that isn't finished :-)

In any case, what Ben's arguing about is the whole "performance" mode
which is -already- implemented in your tree, except that the current
implementation you have is totally broken and those patches fix it. This
mode is optional and isn't the default.

> Quite frankly, my gut feel is that this late in the game, he only
> patch I should apply is 3/3, and even that I wonder about. But that
> seems to *really* disable all the games we do, and that all seem to be
> questionable.

Sort-of yes. The "safe" mode shouldn't be questionable. It's the right
way to safely configure the system. But as usual, we get defeated by
chipset errata.

Still, I think Jon has workarounds for a bunch and having those modes as
options for broken BIOSes is handy.

The real thing for me is that this ability to configure MPS and MRRS
from scratch is needed for platforms where the firmware isn't doing it
at all, such as some embedded platforms or our new upcoming "no pHyp"
KVM-oriented power platforms.

So I like having the code there and the ability to turn it on either
manually or from platform code.

But I agree that at this point, the right default for the x86 mess
should be "don't touch" which is what 3/3 does afaik.

> Comments? Please? I'm going to do an -rc9 today, and after that I'll
> be travelling a lot, so..

My vote is to apply Jon patches. 1/2 and 2/3 will have no effect unless
the "performance" mode is explicitly requested. Without them, it's
totally busted (ie the current implementation you have in your tree is
broken). Patch 3/3 will make sure that the default is to not do
anything.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-04 15:59       ` Benjamin LaHaise
@ 2011-10-04 16:19         ` Benjamin Herrenschmidt
  2011-10-04 16:44           ` Benjamin LaHaise
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2011-10-04 16:19 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: Jon Mason, Linus Torvalds, Greg Kroah-Hartman, Jesse Barnes,
	Bjorn Helgaas, linux-kernel, linux-pci

On Tue, 2011-10-04 at 11:59 -0400, Benjamin LaHaise wrote:

> Yes, that is true.  However, from a performance point of view, clamping 
> the maximum read size has a large, negative effect.  Going from 32 dword 
> reads to full packet sized reads in my project meant the difference of 
> 50MB/s vs 110MB/s on gige with 1500 byte packets.

It depends how badly it's clamped and it's a tradeoff vs. clamping (or
not) the MPS. Ie. in our case it avoids clamping the MPS down to 128.

> > Hence the clamping of MRRS which is done by Jon's patch, the patch
> > referenced here by me additionally prevents drivers who blindly try to
> > set it back to 4096 to also be appropriately limited.
> 
> I have checked, and the large read requests are supported by all the systems 
> I have access to (mix of 2-5 year old hardware).

Yes, again, we only do that clamping on MRRS when the "performance" mode
is enabled explicitely. In any other case, we leave that alone and we
allow drivers to set it to 4K if they wish to do so.

> > Note that in practice (though I haven't put that logic in Linux bare
> > metal yet), pHyp has an additional refinement which is to "know" what
> > the real max read response of the host bridge is and only clamp the MRRS
> > if the MPS of the device is lower than that. In practice, that means
> > that we don't clamp on most high speed adapters as our bridges never
> > reply with more than 512 bytes in a TLP, but this will require passing
> > some platforms specific information down which we don't have at hand
> > just yet.
> 
> My concern, which I forgot to put in the original message is that allowing 
> a bridge to have a large MPS than the endpoints will result in things 
> failing when a large write occurs.  Aiui, we don't restrict the size of 
> memcpy_toio() type functions, and there are PCIe devices which do not 
> perform DMA.  Clamping MPS on the bridge is a requirement of correctness.

Yes, this is a problem if your bridge can generate large writes. Ours
can't so this is safe. Our bridges can't write combine beyond 128 bytes.

Originally, I had that whole logic implemented in arch specific code,
but when Jon started fiddling with MPS/MRRS in generic code, I suggested
to have the generic code have the option of doing what I want. Maybe we
should remove it completely as a user-visible option and leave it back
to platform code only. In any case, it isn't the default.

> > This is really the only way to avoid bogging everybody down to 128 bytes
> > if you have one hotplug leg on a switch or one slow device. For example
> > on some of our machines, if we don't apply that technique, the PCI-X ->
> > USB leg of the main switch will cause everything to go down to 128
> > bytes, including the on-board SAS controllers. (The chipset has 6 host
> > bridges or so but all the on-board stuff is behind a switch on one of
> > them).
> 
> The difference in overhead between 128 and 256 byte TLPs isn't that great.

Our performance guys say that between 128 and 512 or 2048 which is what
we face in some cases here, it's worth it. I haven't yet had a chance to
verify by myself.

> 64 bytes is pretty bad, I'd agree.  That said, I'd be interested in seeing 
> how things measure up when you have the PCIe link busy in both directions.

When I get the HW support for some of those new platforms in a more
final shape and a bit more time (ie after I'm back from travelling :-) I
should be able to do better measurements with some 10GE adapters for
example, or with our RAID controllers.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-04 16:19         ` Benjamin Herrenschmidt
@ 2011-10-04 16:44           ` Benjamin LaHaise
  0 siblings, 0 replies; 20+ messages in thread
From: Benjamin LaHaise @ 2011-10-04 16:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jon Mason, Linus Torvalds, Greg Kroah-Hartman, Jesse Barnes,
	Bjorn Helgaas, linux-kernel, linux-pci

On Tue, Oct 04, 2011 at 06:19:57PM +0200, Benjamin Herrenschmidt wrote:
> On Tue, 2011-10-04 at 11:59 -0400, Benjamin LaHaise wrote:
...
> > My concern, which I forgot to put in the original message is that allowing 
> > a bridge to have a large MPS than the endpoints will result in things 
> > failing when a large write occurs.  Aiui, we don't restrict the size of 
> > memcpy_toio() type functions, and there are PCIe devices which do not 
> > perform DMA.  Clamping MPS on the bridge is a requirement of correctness.
> 
> Yes, this is a problem if your bridge can generate large writes. Ours
> can't so this is safe. Our bridges can't write combine beyond 128 bytes.

> Originally, I had that whole logic implemented in arch specific code,
> but when Jon started fiddling with MPS/MRRS in generic code, I suggested
> to have the generic code have the option of doing what I want. Maybe we
> should remove it completely as a user-visible option and leave it back
> to platform code only. In any case, it isn't the default.

Ah, I see now.  Yes, if that can be put into arch specific code, it makes 
far more sense since things will work correctly assuming the bridge also 
limits read completions with data to 128 bytes?

> Our performance guys say that between 128 and 512 or 2048 which is what
> we face in some cases here, it's worth it. I haven't yet had a chance to
> verify by myself.

Just to give you an idea of the problems with generating small read 
read requests: the on-the-wire cost of sending a 1500 byte ethernet packet 
with 128 byte read requests is the difference between 1x 12-16 byte TLP 
on the transmit link of the endpoint vs 12x 12-16 byte TLPs (the overhead 
for a TLP is at least 8 bytes of header and CRC iirc, maybe more for the 
DLLP flag bytes) or going from 20-24 bytes per tx packet to 256-288 bytes 
of read requests.  This won't matter much for devices with lots of excess 
bandwidth, but for PCIe 1x devices (ie common gige chipsets) that really 
hurts.  It also means that only 2 or 3 packets can have read requests 
outstanding with the default limit of 32 tags, but hopefully the device 
supports extended tags.

		-ben

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-04 16:08       ` Benjamin Herrenschmidt
@ 2011-10-04 16:51         ` Linus Torvalds
  2011-10-04 17:30           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2011-10-04 16:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jon Mason, Greg Kroah-Hartman, Jesse Barnes, Bjorn Helgaas,
	linux-kernel, linux-pci, Benjamin LaHaise

On Tue, Oct 4, 2011 at 9:08 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>>
>> Well, bcrl argues that patches 1-2 of 3 are actively wrong.
>
> This is an argument that isn't finished :-)

Oh, I absolutely *hope* it isn't finished, but there's no way I'm
applying a patch for -rc9 that people are still actively arguing
whether it's at all valid or not.

Which is why I'm planning on applying 3/3 just to make the whole issue
irrelevant for 3.1, and then the people who want to test things out
can apply whatever patches they want and play with the kernel command
line options to actually enable whatever behavior they are testing.

                       Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-04 16:51         ` Linus Torvalds
@ 2011-10-04 17:30           ` Benjamin Herrenschmidt
  2011-10-04 17:36             ` Linus Torvalds
  2011-10-04 17:41             ` Benjamin LaHaise
  0 siblings, 2 replies; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2011-10-04 17:30 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jon Mason, Greg Kroah-Hartman, Jesse Barnes, Bjorn Helgaas,
	linux-kernel, linux-pci, Benjamin LaHaise

On Tue, 2011-10-04 at 09:51 -0700, Linus Torvalds wrote:
> On Tue, Oct 4, 2011 at 9:08 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >>
> >> Well, bcrl argues that patches 1-2 of 3 are actively wrong.
> >
> > This is an argument that isn't finished :-)
> 
> Oh, I absolutely *hope* it isn't finished, but there's no way I'm
> applying a patch for -rc9 that people are still actively arguing
> whether it's at all valid or not.

Well, thing is, you -already- have the whole "performance" option which
is what we are 'arguing' about upstream. Except that the implementation
of it that you have in your tree now has very nasty bugs (ie it doesn't
do what it's supposed to and really doesn't work).

Patches 1 and 2 fix it to do what it's supposed to.

Whether that's a useful scheme or not is what Ben and I are discussing
and frankly the jury is still out as to whether it's beneficial or not
for the generic case. (It's basically an algorithm used today by pHyp on
power).

I wouldn't mind much if we just ripped it out, and left the other
options only at this stage.

> Which is why I'm planning on applying 3/3 just to make the whole issue
> irrelevant for 3.1, and then the people who want to test things out
> can apply whatever patches they want and play with the kernel command
> line options to actually enable whatever behavior they are testing.

As you like but I still think you should apply 1 and 2. As I said above,
the option is there already but the implementation is buggy. Let's at
least make it do what it's supposed to. It's still optional and will
still need testing and benchmarking but heh ...

Or the other option is to rip out all the "performance" case code.

I don't think it makes any sense to keep that option in it's current
non-working form. IE. The code right now does utterly wrong things and
generates something that cannot work if you chose that mode on your
kernel command line.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-04 17:30           ` Benjamin Herrenschmidt
@ 2011-10-04 17:36             ` Linus Torvalds
  2011-10-05  7:01               ` Benjamin Herrenschmidt
  2011-10-04 17:41             ` Benjamin LaHaise
  1 sibling, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2011-10-04 17:36 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jon Mason, Greg Kroah-Hartman, Jesse Barnes, Bjorn Helgaas,
	linux-kernel, linux-pci, Benjamin LaHaise

On Tue, Oct 4, 2011 at 10:30 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> Patches 1 and 2 fix it to do what it's supposed to.

Maybe.

Or maybe not.

We just don't the hell know, do we?

And -rc9 is not when we are supposed to make these kinds of changes.

                              Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-04 17:30           ` Benjamin Herrenschmidt
  2011-10-04 17:36             ` Linus Torvalds
@ 2011-10-04 17:41             ` Benjamin LaHaise
  1 sibling, 0 replies; 20+ messages in thread
From: Benjamin LaHaise @ 2011-10-04 17:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Jon Mason, Greg Kroah-Hartman, Jesse Barnes,
	Bjorn Helgaas, linux-kernel, linux-pci

On Tue, Oct 04, 2011 at 07:30:39PM +0200, Benjamin Herrenschmidt wrote:
> Well, thing is, you -already- have the whole "performance" option which
> is what we are 'arguing' about upstream. Except that the implementation
> of it that you have in your tree now has very nasty bugs (ie it doesn't
> do what it's supposed to and really doesn't work).
> 
> Patches 1 and 2 fix it to do what it's supposed to.

I'll agree with benh for now since, yes, the performance option is 
completely broken in just blindly applying settings right now.  It should 
be revisted and made so that a safe performance setting is supported in 
the future (possibly via suggestions from arch-specific quirks), since 
having to configure this makes little sense for the vast majority of 
users.  Applying 3/3 first would also make the most sense.

		-ben

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-04 17:36             ` Linus Torvalds
@ 2011-10-05  7:01               ` Benjamin Herrenschmidt
  2011-10-05 14:49                 ` Linus Torvalds
  0 siblings, 1 reply; 20+ messages in thread
From: Benjamin Herrenschmidt @ 2011-10-05  7:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jon Mason, Greg Kroah-Hartman, Jesse Barnes, Bjorn Helgaas,
	linux-kernel, linux-pci, Benjamin LaHaise

On Tue, 2011-10-04 at 10:36 -0700, Linus Torvalds wrote:
> On Tue, Oct 4, 2011 at 10:30 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >
> > Patches 1 and 2 fix it to do what it's supposed to.
> 
> Maybe.
> 
> Or maybe not.

:-)

> We just don't the hell know, do we?

Well, we do with some confidence :-) Or rather what we do know is what
you have today in your tree is broken.

Let's break the story back into simple steps, I think there's a lot of
confusion about what's there, what not, etc... :

 - I came up with a crackpot configuration scheme for MPS based on what
pHyp does on our machines. Whether that's scheme is suitable for the
general case or not or even useful is the root of the debate between Ben
and I (I originally intended to have this inside arch/powerpc).

 - Jon was trying to fix MPS related issues (BIOS doing the wrong thing
on some machines) at the same time in generic code.

 - I described my scheme to Jon, who implemented it as the "performance"
option, he also implemented a "safe" option which does the classic
algorithm of clamping everbody down the the lowest common MPS.

 - Jon original patch (which went upstream) had a few problems. Among
others it made "performance" the default, and he didn't completely
understand what I wanted, ie, his implementation of "performance" was in
fact totally broken (it would try to configure devices with MPS larger
than what they can support for example). So that was a trainwreck.

 - I was travelling and didn't get to review things properly before it
went upstream.

 - Jon eventually got to fix things in most cases by switching the
default to "safe" rather than "performance" (as it should have been to
start with). This exposed different problems related to even "safe" not
working well on some chipsets due to HW errata. This is what you have
upstream today.

 - Jon patches 3 switches things to "don't touch" by default. We agree
that's what we want for 3.1. However, whatever way you look at it,
"performance" is still totally broken (he didn't implement my algorithm,
but something "else" that cannot work and never worked).

 - Patches 1 and 2 only affect the "performance" case and "fix" it to do
what I want.

So patches 1 and 2 are about taking the existing busted & known to not
work "performance" option in your tree and turning it into what I
originally wanted it to be, which will work at least for some cases
(tested !).

Now you may prefer to just remove the code for "performance" completely,
I wouldn't object.

But don't leave it there in a known totally busted state.... at least
that's my preference.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-05  7:01               ` Benjamin Herrenschmidt
@ 2011-10-05 14:49                 ` Linus Torvalds
  2011-10-05 16:26                   ` Jesse Barnes
  0 siblings, 1 reply; 20+ messages in thread
From: Linus Torvalds @ 2011-10-05 14:49 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jon Mason, Greg Kroah-Hartman, Jesse Barnes, Bjorn Helgaas,
	linux-kernel, linux-pci, Benjamin LaHaise

On Wed, Oct 5, 2011 at 12:01 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
>> We just don't the hell know, do we?
>
> Well, we do with some confidence :-) Or rather what we do know is what
> you have today in your tree is broken.

You're missing the point.

Repeat after me: late -rc series is not when we test these kinds of things.

This point in time is when we *revert* commits that are broken. We fix
them if there is absolutely no question about the fix, but that simply
isn't true here. Even if the patches "improve" something, there is no
way I hell that I believe that we suddenly don't need to worry about
MPS any more.

How hard is this to just understand? It's not about "we can improve
things". It's about "it's f*%!ing late in the rc series, we're not
dicking around any more!".

So quite frankly, if you don't like the code now, send me a revert for
all the mess. BUT DON'T ARGUE FOR CHANGES THAT AREN'T 100% ROCK SOLID.

So the current situation is that MPS is simply *disabled*. All the
crap code simply doesn't matter at all, because nobody will run it.
Arguing that it is "broken" is stupid, because the only people that
that brokenness would ever matter for are the people like you who are
testing things out - not actual users.

Comprende?

Think of it like a stable kernel. We don't mess around with things
that don't matter and nobody will hit. We don't do "development" in
the late -rc, and yet that is what the MPS patches in question have
been doing.

IT IS TOO LATE FOR CRAP LIKE THAT. WE TURNED IT OFF.

                       Linus

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings
  2011-10-05 14:49                 ` Linus Torvalds
@ 2011-10-05 16:26                   ` Jesse Barnes
  0 siblings, 0 replies; 20+ messages in thread
From: Jesse Barnes @ 2011-10-05 16:26 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Jon Mason, Greg Kroah-Hartman,
	Bjorn Helgaas, linux-kernel, linux-pci, Benjamin LaHaise

On Wed, 5 Oct 2011 07:49:24 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Wed, Oct 5, 2011 at 12:01 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >
> >> We just don't the hell know, do we?
> >
> > Well, we do with some confidence :-) Or rather what we do know is what
> > you have today in your tree is broken.
> 
> You're missing the point.
> 
> Repeat after me: late -rc series is not when we test these kinds of things.

Yeah I think Ben just wants the convenience of having the non-default
code paths fixed when 3.1 comes out; but I agree we shouldn't be poking
around at all at this point.

Hopefully I'll get my kernel.org tree back up and running this week so
I can queue up the various fixes for the merge window.  I'll do another
scan and ask people to send directly to you for the 2 or 3 things we
still have open for 3.1.

Thanks,
-- 
Jesse Barnes, Intel Open Source Technology Center

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2011-10-05 16:26 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-10-03 14:50 [PATCH 2/3] pci: Clamp pcie_set_readrq() when using "performance" settings Jon Mason
2011-10-03 20:56 ` Linus Torvalds
2011-10-04 15:40   ` Benjamin Herrenschmidt
2011-10-04 15:48     ` Linus Torvalds
2011-10-04 15:56       ` Bjorn Helgaas
2011-10-04 16:08       ` Benjamin Herrenschmidt
2011-10-04 16:51         ` Linus Torvalds
2011-10-04 17:30           ` Benjamin Herrenschmidt
2011-10-04 17:36             ` Linus Torvalds
2011-10-05  7:01               ` Benjamin Herrenschmidt
2011-10-05 14:49                 ` Linus Torvalds
2011-10-05 16:26                   ` Jesse Barnes
2011-10-04 17:41             ` Benjamin LaHaise
2011-10-03 21:55 ` Jon Mason
2011-10-04 14:42   ` Benjamin LaHaise
2011-10-04 15:37     ` Benjamin LaHaise
2011-10-04 15:52     ` Benjamin Herrenschmidt
2011-10-04 15:59       ` Benjamin LaHaise
2011-10-04 16:19         ` Benjamin Herrenschmidt
2011-10-04 16:44           ` Benjamin LaHaise

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.