linux-pci.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] arm64: Enable PCI write-combine resources under sysfs
@ 2020-08-31 15:18 Clint Sbisa
  2020-08-31 23:24 ` Benjamin Herrenschmidt
                   ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Clint Sbisa @ 2020-08-31 15:18 UTC (permalink / raw)
  To: linux-pci; +Cc: Lorenzo Pieralisi, Bjorn Helgaas, csbisa, benh

Using write-combine is crucial for performance of PCI devices where
significant amounts of transactions go over PCI BARs.

arm64 supports write-combine PCI mappings, so the appropriate define
has been added which will expose write-combine mappings under sysfs
for prefetchable PCI resources.

Signed-off-by: Clint Sbisa <csbisa@amazon.com>
---
 arch/arm64/include/asm/pci.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h
index 70b323cf8300..b33ca260e3c9 100644
--- a/arch/arm64/include/asm/pci.h
+++ b/arch/arm64/include/asm/pci.h
@@ -17,6 +17,7 @@
 #define pcibios_assign_all_busses() \
 	(pci_has_flag(PCI_REASSIGN_ALL_BUS))
 
+#define arch_can_pci_mmap_wc() 1
 #define ARCH_GENERIC_PCI_MMAP_RESOURCE	1
 
 extern int isa_dma_bridge_buggy;
-- 
2.23.3


^ permalink raw reply related	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-08-31 15:18 [PATCH] arm64: Enable PCI write-combine resources under sysfs Clint Sbisa
@ 2020-08-31 23:24 ` Benjamin Herrenschmidt
  2020-09-01 18:37 ` Bjorn Helgaas
  2020-09-02 11:32 ` Lorenzo Pieralisi
  2 siblings, 0 replies; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-08-31 23:24 UTC (permalink / raw)
  To: Clint Sbisa, linux-pci; +Cc: Lorenzo Pieralisi, Bjorn Helgaas

On Mon, 2020-08-31 at 15:18 +0000, Clint Sbisa wrote:
> Using write-combine is crucial for performance of PCI devices where
> significant amounts of transactions go over PCI BARs.
> 
> arm64 supports write-combine PCI mappings, so the appropriate define
> has been added which will expose write-combine mappings under sysfs
> for prefetchable PCI resources.
> 
> Signed-off-by: Clint Sbisa <csbisa@amazon.com>

Acked-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>

> ---
>  arch/arm64/include/asm/pci.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/include/asm/pci.h
> b/arch/arm64/include/asm/pci.h
> index 70b323cf8300..b33ca260e3c9 100644
> --- a/arch/arm64/include/asm/pci.h
> +++ b/arch/arm64/include/asm/pci.h
> @@ -17,6 +17,7 @@
>  #define pcibios_assign_all_busses() \
>  	(pci_has_flag(PCI_REASSIGN_ALL_BUS))
>  
> +#define arch_can_pci_mmap_wc() 1
>  #define ARCH_GENERIC_PCI_MMAP_RESOURCE	1
>  
>  extern int isa_dma_bridge_buggy;


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-08-31 15:18 [PATCH] arm64: Enable PCI write-combine resources under sysfs Clint Sbisa
  2020-08-31 23:24 ` Benjamin Herrenschmidt
@ 2020-09-01 18:37 ` Bjorn Helgaas
  2020-09-01 23:22   ` Benjamin Herrenschmidt
  2020-09-02 11:32 ` Lorenzo Pieralisi
  2 siblings, 1 reply; 59+ messages in thread
From: Bjorn Helgaas @ 2020-09-01 18:37 UTC (permalink / raw)
  To: Clint Sbisa; +Cc: linux-pci, Lorenzo Pieralisi, benh

On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote:
> Using write-combine is crucial for performance of PCI devices where
> significant amounts of transactions go over PCI BARs.
> 
> arm64 supports write-combine PCI mappings, so the appropriate define
> has been added which will expose write-combine mappings under sysfs
> for prefetchable PCI resources.
> 
> Signed-off-by: Clint Sbisa <csbisa@amazon.com>

Fine with me, I assume Will or Catalin will apply this.

> ---
>  arch/arm64/include/asm/pci.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h
> index 70b323cf8300..b33ca260e3c9 100644
> --- a/arch/arm64/include/asm/pci.h
> +++ b/arch/arm64/include/asm/pci.h
> @@ -17,6 +17,7 @@
>  #define pcibios_assign_all_busses() \
>  	(pci_has_flag(PCI_REASSIGN_ALL_BUS))
>  
> +#define arch_can_pci_mmap_wc() 1
>  #define ARCH_GENERIC_PCI_MMAP_RESOURCE	1
>  
>  extern int isa_dma_bridge_buggy;
> -- 
> 2.23.3
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-01 18:37 ` Bjorn Helgaas
@ 2020-09-01 23:22   ` Benjamin Herrenschmidt
  2020-09-02  8:57     ` Will Deacon
  0 siblings, 1 reply; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-01 23:22 UTC (permalink / raw)
  To: Bjorn Helgaas, Clint Sbisa
  Cc: linux-pci, Lorenzo Pieralisi, Catalin Marinas, Will Deacon, linux-kernel

On Tue, 2020-09-01 at 13:37 -0500, Bjorn Helgaas wrote:
> On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote:
> > Using write-combine is crucial for performance of PCI devices where
> > significant amounts of transactions go over PCI BARs.
> > 
> > arm64 supports write-combine PCI mappings, so the appropriate
> > define
> > has been added which will expose write-combine mappings under sysfs
> > for prefetchable PCI resources.
> > 
> > Signed-off-by: Clint Sbisa <csbisa@amazon.com>
> 
> Fine with me, I assume Will or Catalin will apply this.

Haha ! Client had sent it to them originally and I told him to resend
it to linux-pci, yourself and Lorenzo :-)

So the confusion is on me.

Will, Catalin, it's all yours. You should have the original patch in
your mbox already, otherwise:

https://patchwork.kernel.org/patch/11729875/

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-01 23:22   ` Benjamin Herrenschmidt
@ 2020-09-02  8:57     ` Will Deacon
  0 siblings, 0 replies; 59+ messages in thread
From: Will Deacon @ 2020-09-02  8:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Bjorn Helgaas, Clint Sbisa, linux-pci, Lorenzo Pieralisi,
	Catalin Marinas, linux-kernel

On Wed, Sep 02, 2020 at 09:22:53AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2020-09-01 at 13:37 -0500, Bjorn Helgaas wrote:
> > On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote:
> > > Using write-combine is crucial for performance of PCI devices where
> > > significant amounts of transactions go over PCI BARs.
> > > 
> > > arm64 supports write-combine PCI mappings, so the appropriate
> > > define
> > > has been added which will expose write-combine mappings under sysfs
> > > for prefetchable PCI resources.
> > > 
> > > Signed-off-by: Clint Sbisa <csbisa@amazon.com>
> > 
> > Fine with me, I assume Will or Catalin will apply this.
> 
> Haha ! Client had sent it to them originally and I told him to resend
> it to linux-pci, yourself and Lorenzo :-)
> 
> So the confusion is on me.
> 
> Will, Catalin, it's all yours. You should have the original patch in
> your mbox already, otherwise:
> 
> https://patchwork.kernel.org/patch/11729875/

Yup, it's not the radar. We don't usually start queuing stuff until -rc3, so
I'm working through the backlog this week. Would like an Ack from Lorenzo,
though.

Will

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-08-31 15:18 [PATCH] arm64: Enable PCI write-combine resources under sysfs Clint Sbisa
  2020-08-31 23:24 ` Benjamin Herrenschmidt
  2020-09-01 18:37 ` Bjorn Helgaas
@ 2020-09-02 11:32 ` Lorenzo Pieralisi
  2020-09-02 14:29   ` Clint Sbisa
  2 siblings, 1 reply; 59+ messages in thread
From: Lorenzo Pieralisi @ 2020-09-02 11:32 UTC (permalink / raw)
  To: Clint Sbisa; +Cc: linux-pci, Bjorn Helgaas, benh

On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote:
> Using write-combine is crucial for performance of PCI devices where
> significant amounts of transactions go over PCI BARs.

Write-combine is an x86ism that means nothing on ARM64 platforms
so this should be rewritten to say what you actually mean, namely,
you want to allow prefetchable resources to be mapped with
"write combine" semantics (which means normal non-cacheable
memory on arm64) through proc/sysfs.

This is an outright can of worms and the PCI specs don't help in this
respect, since we may end up mapping resources that have read
side-effects with normal NC mappings (ie that's what "write combine" is
in arm64 - pgprot_writecombine() and that's speculative memory).

I am referring to "Additional Guidance on the Prefetchable Bit
in Memory Space BARs" in the PCI specifications - it does not make
any sense and must be removed because people use it to design
endpoints.

True - this is a problem even in kernel drivers but at least there
the ioremap_ semantics is in the driver and can be vetted.

This patch would make it user space ABI so I am a little nervous
about merging this code TBH.

> arm64 supports write-combine PCI mappings, so the appropriate define
> has been added which will expose write-combine mappings under sysfs
> for prefetchable PCI resources.
> 
> Signed-off-by: Clint Sbisa <csbisa@amazon.com>
> ---
>  arch/arm64/include/asm/pci.h | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h
> index 70b323cf8300..b33ca260e3c9 100644
> --- a/arch/arm64/include/asm/pci.h
> +++ b/arch/arm64/include/asm/pci.h
> @@ -17,6 +17,7 @@
>  #define pcibios_assign_all_busses() \
>  	(pci_has_flag(PCI_REASSIGN_ALL_BUS))
>  
> +#define arch_can_pci_mmap_wc() 1

I am not comfortable with this blanket enable. Some existing drivers,
eg:

drivers/infiniband/hw/mlx5

use this macro to detect WC capability which again, it is x86 specific,
on arm64 it means nothing and can have consequences on the driver
operations.

Thanks,
Lorenzo

>  #define ARCH_GENERIC_PCI_MMAP_RESOURCE	1
>  
>  extern int isa_dma_bridge_buggy;
> -- 
> 2.23.3
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-02 11:32 ` Lorenzo Pieralisi
@ 2020-09-02 14:29   ` Clint Sbisa
  2020-09-02 16:47     ` Lorenzo Pieralisi
  0 siblings, 1 reply; 59+ messages in thread
From: Clint Sbisa @ 2020-09-02 14:29 UTC (permalink / raw)
  To: Lorenzo Pieralisi; +Cc: linux-pci, Bjorn Helgaas, benh

On Wed, Sep 02, 2020 at 12:32:07PM +0100, Lorenzo Pieralisi wrote:
> 
> On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote:
> > Using write-combine is crucial for performance of PCI devices where
> > significant amounts of transactions go over PCI BARs.
> 
> Write-combine is an x86ism that means nothing on ARM64 platforms
> so this should be rewritten to say what you actually mean, namely,
> you want to allow prefetchable resources to be mapped with
> "write combine" semantics (which means normal non-cacheable
> memory on arm64) through proc/sysfs.
> 
> This is an outright can of worms and the PCI specs don't help in this
> respect, since we may end up mapping resources that have read
> side-effects with normal NC mappings (ie that's what "write combine" is
> in arm64 - pgprot_writecombine() and that's speculative memory).
> 
> I am referring to "Additional Guidance on the Prefetchable Bit
> in Memory Space BARs" in the PCI specifications - it does not make
> any sense and must be removed because people use it to design
> endpoints.
> 
> True - this is a problem even in kernel drivers but at least there
> the ioremap_ semantics is in the driver and can be vetted.
> 
> This patch would make it user space ABI so I am a little nervous
> about merging this code TBH.
> 

User space applications are utilizing WC already. You can see DPDK code using
resourceX_wc over the usual resourceX file at
https://github.com/DPDK/dpdk/blob/main/drivers/bus/pci/linux/pci_uio.c#L312
(commit https://github.com/DPDK/dpdk/commit/4a928ef9f61).

Given that write-combine support was added in 2008 for x86 (and is also enabled
for powerpc and ia64), I'm not sure if there'd be a downside to enabling it on
arm64 as well given how prevalent it is. Lorenzo, do you still have particular
concerns about exposing this to userspace?

I understand that my commit message is outright wrong after your explanation,
so I'll rewrite that.

> > arm64 supports write-combine PCI mappings, so the appropriate define
> > has been added which will expose write-combine mappings under sysfs
> > for prefetchable PCI resources.
> >
> > Signed-off-by: Clint Sbisa <csbisa@amazon.com>
> > ---
> >  arch/arm64/include/asm/pci.h | 1 +
> >  1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h
> > index 70b323cf8300..b33ca260e3c9 100644
> > --- a/arch/arm64/include/asm/pci.h
> > +++ b/arch/arm64/include/asm/pci.h
> > @@ -17,6 +17,7 @@
> >  #define pcibios_assign_all_busses() \
> >       (pci_has_flag(PCI_REASSIGN_ALL_BUS))
> >
> > +#define arch_can_pci_mmap_wc() 1
> 
> I am not comfortable with this blanket enable. Some existing drivers,
> eg:
> 
> drivers/infiniband/hw/mlx5
> 
> use this macro to detect WC capability which again, it is x86 specific,
> on arm64 it means nothing and can have consequences on the driver
> operations.

If that driver is fixed to check what it actually wants to check, would that
address your concern about the blanket enable? I don't see any other references
to this in kernel drivers and I think the documentation at
`filesystems/sysfs-pci.rst` outlines it pretty explicitly:

  Platforms which support write-combining maps of PCI resources must define
  arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when
  write-combining is permitted.

It is otherwise only used by pci-sysfs to determine if a resourceX_wc file
should be exposed.

Thanks,
Clint

> 
> Thanks,
> Lorenzo
> 
> >  #define ARCH_GENERIC_PCI_MMAP_RESOURCE       1
> >
> >  extern int isa_dma_bridge_buggy;
> > --
> > 2.23.3
> >

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-02 14:29   ` Clint Sbisa
@ 2020-09-02 16:47     ` Lorenzo Pieralisi
  2020-09-02 17:21       ` Catalin Marinas
  2020-09-02 23:07       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 59+ messages in thread
From: Lorenzo Pieralisi @ 2020-09-02 16:47 UTC (permalink / raw)
  To: Clint Sbisa
  Cc: linux-pci, Bjorn Helgaas, benh, linux-arm-kernel, will, catalin.marinas

[+LAKML, Will, Catalin]

This is an ARM64 arch discussion so please keep the above CCed from
now onwards.

On Wed, Sep 02, 2020 at 02:29:22PM +0000, Clint Sbisa wrote:
> On Wed, Sep 02, 2020 at 12:32:07PM +0100, Lorenzo Pieralisi wrote:
> > 
> > On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote:
> > > Using write-combine is crucial for performance of PCI devices where
> > > significant amounts of transactions go over PCI BARs.
> > 
> > Write-combine is an x86ism that means nothing on ARM64 platforms
> > so this should be rewritten to say what you actually mean, namely,
> > you want to allow prefetchable resources to be mapped with
> > "write combine" semantics (which means normal non-cacheable
> > memory on arm64) through proc/sysfs.
> > 
> > This is an outright can of worms and the PCI specs don't help in this
> > respect, since we may end up mapping resources that have read
> > side-effects with normal NC mappings (ie that's what "write combine" is
> > in arm64 - pgprot_writecombine() and that's speculative memory).
> > 
> > I am referring to "Additional Guidance on the Prefetchable Bit
> > in Memory Space BARs" in the PCI specifications - it does not make
> > any sense and must be removed because people use it to design
> > endpoints.
> > 
> > True - this is a problem even in kernel drivers but at least there
> > the ioremap_ semantics is in the driver and can be vetted.
> > 
> > This patch would make it user space ABI so I am a little nervous
> > about merging this code TBH.
> > 
> 
> User space applications are utilizing WC already. You can see DPDK code using
> resourceX_wc over the usual resourceX file at
> https://github.com/DPDK/dpdk/blob/main/drivers/bus/pci/linux/pci_uio.c#L312
> (commit https://github.com/DPDK/dpdk/commit/4a928ef9f61).

I know.

> Given that write-combine support was added in 2008 for x86 (and is
> also enabled for powerpc and ia64), I'm not sure if there'd be a
> downside to enabling it on arm64 as well given how prevalent it is.

I explained to you the reasons why this can have downsides and write
combine is a concept that does not exist in the ARM64 world, actually
it would be good if Ben can chime in to define how this works on power.

>Lorenzo, do you still have particular concerns about exposing this to
>userspace?

Yes I do and I expressed them.

The first concern is the WC ambiguity on non-x86 systems, it looks
like write combinining means everything and nothing at the same time
on != x86 arches.

On x86 prefetchable BAR == WC mapping (still conditional on arch
features ie PAT, not a blanket enable). On ARM64 WC mapping currently
corresponds to normal NC memory and the PCIe specs allow read
side-effects BAR to be marked as prefetchable, I need to force PCI sig
to remove the section I mentioned from the specifications because there
is NO way it can be detected if a prefetchable BAR maps to read
side-effects memory.

A kernel device driver would (hopefully) know, sysfs code that just
checks the prefetchable attribute and exports resource_WC does not.

As I mentioned, if the mapping is done in a device specific driver it
can be vetted and there are not many drivers mapping BARs as
ioremap_wc().

> I understand that my commit message is outright wrong after your explanation,
> so I'll rewrite that.
> 
> > > arm64 supports write-combine PCI mappings, so the appropriate define
> > > has been added which will expose write-combine mappings under sysfs
> > > for prefetchable PCI resources.
> > >
> > > Signed-off-by: Clint Sbisa <csbisa@amazon.com>
> > > ---
> > >  arch/arm64/include/asm/pci.h | 1 +
> > >  1 file changed, 1 insertion(+)
> > >
> > > diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h
> > > index 70b323cf8300..b33ca260e3c9 100644
> > > --- a/arch/arm64/include/asm/pci.h
> > > +++ b/arch/arm64/include/asm/pci.h
> > > @@ -17,6 +17,7 @@
> > >  #define pcibios_assign_all_busses() \
> > >       (pci_has_flag(PCI_REASSIGN_ALL_BUS))
> > >
> > > +#define arch_can_pci_mmap_wc() 1
> > 
> > I am not comfortable with this blanket enable. Some existing drivers,
> > eg:
> > 
> > drivers/infiniband/hw/mlx5
> > 
> > use this macro to detect WC capability which again, it is x86 specific,
> > on arm64 it means nothing and can have consequences on the driver
> > operations.
> 
> If that driver is fixed to check what it actually wants to check, would that
> address your concern about the blanket enable? I don't see any other references
> to this in kernel drivers and I think the documentation at
> `filesystems/sysfs-pci.rst` outlines it pretty explicitly:
> 
>   Platforms which support write-combining maps of PCI resources must define
>   arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when
>   write-combining is permitted.

That's exactly the problem. I am asking you: what does "write-combining
maps of PCI resources" mean ?

I understand we do want weak ordering for prefetchable BAR mappings
but my worry is that by exposing the resources as WC to user space
we are giving user space the impression that those mappings mirror
x86 WC mappings behaviour that is not true on ARM64.

Again - Ben has extensive experience on this on power I would be happy
to get his point of view before proceeding, it is important to undestand
how this work on non-x86 systems.

> It is otherwise only used by pci-sysfs to determine if a resourceX_wc
> file should be exposed.

I know - that's understood.

Thanks,
Lorenzo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-02 16:47     ` Lorenzo Pieralisi
@ 2020-09-02 17:21       ` Catalin Marinas
  2020-09-02 17:54         ` Lorenzo Pieralisi
  2020-09-02 23:08         ` Benjamin Herrenschmidt
  2020-09-02 23:07       ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 59+ messages in thread
From: Catalin Marinas @ 2020-09-02 17:21 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Clint Sbisa, linux-pci, Bjorn Helgaas, benh, linux-arm-kernel, will

On Wed, Sep 02, 2020 at 05:47:02PM +0100, Lorenzo Pieralisi wrote:
> On Wed, Sep 02, 2020 at 02:29:22PM +0000, Clint Sbisa wrote:
> > On Wed, Sep 02, 2020 at 12:32:07PM +0100, Lorenzo Pieralisi wrote:
> > > On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote:
> > > > arm64 supports write-combine PCI mappings, so the appropriate define
> > > > has been added which will expose write-combine mappings under sysfs
> > > > for prefetchable PCI resources.
> > > >
> > > > Signed-off-by: Clint Sbisa <csbisa@amazon.com>
> > > > ---
> > > >  arch/arm64/include/asm/pci.h | 1 +
> > > >  1 file changed, 1 insertion(+)
> > > >
> > > > diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h
> > > > index 70b323cf8300..b33ca260e3c9 100644
> > > > --- a/arch/arm64/include/asm/pci.h
> > > > +++ b/arch/arm64/include/asm/pci.h
> > > > @@ -17,6 +17,7 @@
> > > >  #define pcibios_assign_all_busses() \
> > > >       (pci_has_flag(PCI_REASSIGN_ALL_BUS))
> > > >
> > > > +#define arch_can_pci_mmap_wc() 1
> > > 
> > > I am not comfortable with this blanket enable. Some existing drivers,
> > > eg:
> > > 
> > > drivers/infiniband/hw/mlx5
> > > 
> > > use this macro to detect WC capability which again, it is x86 specific,
> > > on arm64 it means nothing and can have consequences on the driver
> > > operations.
> > 
> > If that driver is fixed to check what it actually wants to check, would that
> > address your concern about the blanket enable? I don't see any other references
> > to this in kernel drivers and I think the documentation at
> > `filesystems/sysfs-pci.rst` outlines it pretty explicitly:
> > 
> >   Platforms which support write-combining maps of PCI resources must define
> >   arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when
> >   write-combining is permitted.
> 
> That's exactly the problem. I am asking you: what does "write-combining
> maps of PCI resources" mean ?
> 
> I understand we do want weak ordering for prefetchable BAR mappings
> but my worry is that by exposing the resources as WC to user space
> we are giving user space the impression that those mappings mirror
> x86 WC mappings behaviour that is not true on ARM64.

Would Device_GRE be close to the x86 WC better? It won't allow unaligned
accesses and that can be problematic for the user. OTOH, it doesn't
speculate reads, so it's safer from the hardware perspective.

-- 
Catalin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-02 17:21       ` Catalin Marinas
@ 2020-09-02 17:54         ` Lorenzo Pieralisi
  2020-09-02 23:03           ` Benjamin Herrenschmidt
  2020-09-02 23:08         ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 59+ messages in thread
From: Lorenzo Pieralisi @ 2020-09-02 17:54 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Clint Sbisa, linux-pci, Bjorn Helgaas, benh, linux-arm-kernel, will

On Wed, Sep 02, 2020 at 06:21:57PM +0100, Catalin Marinas wrote:
> On Wed, Sep 02, 2020 at 05:47:02PM +0100, Lorenzo Pieralisi wrote:
> > On Wed, Sep 02, 2020 at 02:29:22PM +0000, Clint Sbisa wrote:
> > > On Wed, Sep 02, 2020 at 12:32:07PM +0100, Lorenzo Pieralisi wrote:
> > > > On Mon, Aug 31, 2020 at 03:18:27PM +0000, Clint Sbisa wrote:
> > > > > arm64 supports write-combine PCI mappings, so the appropriate define
> > > > > has been added which will expose write-combine mappings under sysfs
> > > > > for prefetchable PCI resources.
> > > > >
> > > > > Signed-off-by: Clint Sbisa <csbisa@amazon.com>
> > > > > ---
> > > > >  arch/arm64/include/asm/pci.h | 1 +
> > > > >  1 file changed, 1 insertion(+)
> > > > >
> > > > > diff --git a/arch/arm64/include/asm/pci.h b/arch/arm64/include/asm/pci.h
> > > > > index 70b323cf8300..b33ca260e3c9 100644
> > > > > --- a/arch/arm64/include/asm/pci.h
> > > > > +++ b/arch/arm64/include/asm/pci.h
> > > > > @@ -17,6 +17,7 @@
> > > > >  #define pcibios_assign_all_busses() \
> > > > >       (pci_has_flag(PCI_REASSIGN_ALL_BUS))
> > > > >
> > > > > +#define arch_can_pci_mmap_wc() 1
> > > > 
> > > > I am not comfortable with this blanket enable. Some existing drivers,
> > > > eg:
> > > > 
> > > > drivers/infiniband/hw/mlx5
> > > > 
> > > > use this macro to detect WC capability which again, it is x86 specific,
> > > > on arm64 it means nothing and can have consequences on the driver
> > > > operations.
> > > 
> > > If that driver is fixed to check what it actually wants to check, would that
> > > address your concern about the blanket enable? I don't see any other references
> > > to this in kernel drivers and I think the documentation at
> > > `filesystems/sysfs-pci.rst` outlines it pretty explicitly:
> > > 
> > >   Platforms which support write-combining maps of PCI resources must define
> > >   arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when
> > >   write-combining is permitted.
> > 
> > That's exactly the problem. I am asking you: what does "write-combining
> > maps of PCI resources" mean ?
> > 
> > I understand we do want weak ordering for prefetchable BAR mappings
> > but my worry is that by exposing the resources as WC to user space
> > we are giving user space the impression that those mappings mirror
> > x86 WC mappings behaviour that is not true on ARM64.
> 
> Would Device_GRE be close to the x86 WC better? It won't allow unaligned
> accesses and that can be problematic for the user. OTOH, it doesn't
> speculate reads, so it's safer from the hardware perspective.

Thanks Catalin for chiming in, it may yes but I need to figure out
the precise semantics of WC on x86 first.

Actually *if* I read x86 specs correctly WC mappings allow speculative
reads, which then would shift the issue on the PCI specs that allow
marking read side effects BARs as prefetchable; in other words if
an endpoint is designed with a prefetchable BAR that has read side
effects this is already an issue on x86 in the current kernel.

There is that, plus the usage of arch_can_pci_mmap_wc() in mellanox
drivers which I suspect it is yet another interpretation of x86 write
combine - I don't know what happens if we let arch_can_pci_mmap_wc() == 1
on both normalNC or deviceGRE mappings for pgprot_writecombine.

I think it is worth getting to the bottom of this before applying
this patch.

Thanks,
Lorenzo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-02 17:54         ` Lorenzo Pieralisi
@ 2020-09-02 23:03           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-02 23:03 UTC (permalink / raw)
  To: Lorenzo Pieralisi, Catalin Marinas
  Cc: Clint Sbisa, linux-pci, Bjorn Helgaas, linux-arm-kernel, will

On Wed, 2020-09-02 at 18:54 +0100, Lorenzo Pieralisi wrote:
> > > > If that driver is fixed to check what it actually wants to check, would that
> > > > address your concern about the blanket enable? I don't see any other references
> > > > to this in kernel drivers and I think the documentation at
> > > > `filesystems/sysfs-pci.rst` outlines it pretty explicitly:
> > > > 
> > > >    Platforms which support write-combining maps of PCI resources must define
> > > >    arch_can_pci_mmap_wc() which shall evaluate to non-zero at runtime when
> > > >    write-combining is permitted.
> > > 
> > > That's exactly the problem. I am asking you: what does "write-combining
> > > maps of PCI resources" mean ?
> > > 
> > > I understand we do want weak ordering for prefetchable BAR mappings
> > > but my worry is that by exposing the resources as WC to user space
> > > we are giving user space the impression that those mappings mirror
> > > x86 WC mappings behaviour that is not true on ARM64.
> > 
> > Would Device_GRE be close to the x86 WC better? It won't allow unaligned
> > accesses and that can be problematic for the user. OTOH, it doesn't
> > speculate reads, so it's safer from the hardware perspective.
> 
> Thanks Catalin for chiming in, it may yes but I need to figure out
> the precise semantics of WC on x86 first.

We never got to the bottom of that with powerpc... semantics of "WC"
are subtly different all over the archs. They key idea I think is for
us to state that a WC mapping drops all ordering guarantees :-)

That said, the goal here is to expose the sysfs _wc files, without
which, mapping of "no-side-effect" memory such as frame buffers etc...
produces something very very slow.

> Actually *if* I read x86 specs correctly WC mappings allow speculative
> reads, which then would shift the issue on the PCI specs that allow
> marking read side effects BARs as prefetchable;

Yes.

>  in other words if
> an endpoint is designed with a prefetchable BAR that has read side
> effects this is already an issue on x86 in the current kernel.

An powerpc. We remove the "G" bit. Same deal.

> There is that, plus the usage of arch_can_pci_mmap_wc() in mellanox
> drivers which I suspect it is yet another interpretation of x86 write
> combine - I don't know what happens if we let arch_can_pci_mmap_wc() == 1
> on both normalNC or deviceGRE mappings for pgprot_writecombine.
> 
> I think it is worth getting to the bottom of this before applying
> this patch.

I think it basically boils down to mapping things without side effect
and ordering guarantees but that still cannot be cached.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-02 16:47     ` Lorenzo Pieralisi
  2020-09-02 17:21       ` Catalin Marinas
@ 2020-09-02 23:07       ` Benjamin Herrenschmidt
  2020-09-03 11:08         ` Lorenzo Pieralisi
  1 sibling, 1 reply; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-02 23:07 UTC (permalink / raw)
  To: Lorenzo Pieralisi, Clint Sbisa
  Cc: linux-pci, Bjorn Helgaas, linux-arm-kernel, will, catalin.marinas

On Wed, 2020-09-02 at 17:47 +0100, Lorenzo Pieralisi wrote:
> Yes I do and I expressed them.
> 
> The first concern is the WC ambiguity on non-x86 systems, it looks
> like write combinining means everything and nothing at the same time
> on != x86 arches.
> 
> On x86 prefetchable BAR == WC mapping (still conditional on arch
> features ie PAT, not a blanket enable). On ARM64 WC mapping currently
> corresponds to normal NC memory and the PCIe specs allow read
> side-effects BAR to be marked as prefetchable, I need to force PCI
> sig
> to remove the section I mentioned from the specifications because
> there
> is NO way it can be detected if a prefetchable BAR maps to read
> side-effects memory.

Im not sure I understand your sentence. It's been a long accepted rule
in PCI land that "prefetchable" BARs means "no side effects" and in
fact allows much more than just prefetching :-)

> A kernel device driver would (hopefully) know, sysfs code that just
> checks the prefetchable attribute and exports resource_WC does not.
>
> As I mentioned, if the mapping is done in a device specific driver it
> can be vetted and there are not many drivers mapping BARs as
> ioremap_wc().

It's been what other architectures have been doing for mroe than a
decade without significant issues... I don't think you should worry too
much about this.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-02 17:21       ` Catalin Marinas
  2020-09-02 17:54         ` Lorenzo Pieralisi
@ 2020-09-02 23:08         ` Benjamin Herrenschmidt
  2020-09-02 23:08           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-02 23:08 UTC (permalink / raw)
  To: Catalin Marinas, Lorenzo Pieralisi
  Cc: Clint Sbisa, linux-pci, Bjorn Helgaas, linux-arm-kernel, will

On Wed, 2020-09-02 at 18:21 +0100, Catalin Marinas wrote:
> > I understand we do want weak ordering for prefetchable BAR mappings
> > but my worry is that by exposing the resources as WC to user space
> > we are giving user space the impression that those mappings mirror
> > x86 WC mappings behaviour that is not true on ARM64.
> 
> Would Device_GRE be close to the x86 WC better? It won't allow
> unaligned
> accesses and that can be problematic for the user. OTOH, it doesn't
> speculate reads, so it's safer from the hardware perspective.

Its accepted generally that prefetchable BARs can allow speculative
accesses, write combining, re-ordering even etc... and it's also
commonly be the target of unaligned accesses.

In reality, in PCI land, it really means "no side effects".

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-02 23:08         ` Benjamin Herrenschmidt
@ 2020-09-02 23:08           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-02 23:08 UTC (permalink / raw)
  To: Catalin Marinas, Lorenzo Pieralisi
  Cc: Clint Sbisa, linux-pci, Bjorn Helgaas, linux-arm-kernel, will

On Thu, 2020-09-03 at 09:08 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2020-09-02 at 18:21 +0100, Catalin Marinas wrote:
> > > I understand we do want weak ordering for prefetchable BAR
> > > mappings
> > > but my worry is that by exposing the resources as WC to user
> > > space
> > > we are giving user space the impression that those mappings
> > > mirror
> > > x86 WC mappings behaviour that is not true on ARM64.
> > 
> > Would Device_GRE be close to the x86 WC better? It won't allow
> > unaligned
> > accesses and that can be problematic for the user. OTOH, it doesn't
> > speculate reads, so it's safer from the hardware perspective.
> 
> Its accepted generally that prefetchable BARs can allow speculative
> accesses, write combining, re-ordering even etc... and it's also
> commonly be the target of unaligned accesses.
> 
> In reality, in PCI land, it really means "no side effects".

Correction: No *read* side effects.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-02 23:07       ` Benjamin Herrenschmidt
@ 2020-09-03 11:08         ` Lorenzo Pieralisi
  2020-09-03 14:36           ` Clint Sbisa
                             ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Lorenzo Pieralisi @ 2020-09-03 11:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Clint Sbisa, linux-pci, Bjorn Helgaas, linux-arm-kernel, will,
	catalin.marinas

On Thu, Sep 03, 2020 at 09:07:00AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2020-09-02 at 17:47 +0100, Lorenzo Pieralisi wrote:
> > Yes I do and I expressed them.
> > 
> > The first concern is the WC ambiguity on non-x86 systems, it looks
> > like write combinining means everything and nothing at the same time
> > on != x86 arches.
> > 
> > On x86 prefetchable BAR == WC mapping (still conditional on arch
> > features ie PAT, not a blanket enable). On ARM64 WC mapping currently
> > corresponds to normal NC memory and the PCIe specs allow read
> > side-effects BAR to be marked as prefetchable, I need to force PCI
> > sig
> > to remove the section I mentioned from the specifications because
> > there
> > is NO way it can be detected if a prefetchable BAR maps to read
> > side-effects memory.
> 
> Im not sure I understand your sentence. It's been a long accepted rule
> in PCI land that "prefetchable" BARs means "no side effects" and in
> fact allows much more than just prefetching :-)

I am referring to the nefarious:

"Additional Guidance on the Prefetchable Bit in Memory Space BARs"

I read it 100 times and I still have no idea how it can be implemented,
it sorts of acknowledges that read side-effects memory can be marked
as a prefetchable BAR *if* the system meets some criteria.

As if endpoint designers knew the system where their endpoint is
plugged into (+ bit (3) in a BAR is read-only).

I think that that implementation note must be removed from the
specifications - if anyone dares to follow it this whole
WC resource mapping can trigger trouble.

Good news is that it would be trouble for all arches :)

> > A kernel device driver would (hopefully) know, sysfs code that just
> > checks the prefetchable attribute and exports resource_WC does not.
> >
> > As I mentioned, if the mapping is done in a device specific driver it
> > can be vetted and there are not many drivers mapping BARs as
> > ioremap_wc().
> 
> It's been what other architectures have been doing for mroe than a
> decade without significant issues... I don't think you should worry too
> much about this.

Minus what I wrote above, I agree with you. I'd still be able to
understand what this patch changes in the mellanox driver HW
handling though - not sure what they expect from arch_can_pci_mmap_wc()
returning 1.

Thanks,
Lorenzo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-03 11:08         ` Lorenzo Pieralisi
@ 2020-09-03 14:36           ` Clint Sbisa
  2020-09-03 22:26           ` Benjamin Herrenschmidt
  2020-09-07 23:33           ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 59+ messages in thread
From: Clint Sbisa @ 2020-09-03 14:36 UTC (permalink / raw)
  To: Lorenzo Pieralisi, Yishai Hadas, Guy Levy
  Cc: Benjamin Herrenschmidt, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas

Adding some Mellanox folks to comment on their usage of arch_can_pci_mmap_wc().

On Thu, Sep 03, 2020 at 12:08:44PM +0100, Lorenzo Pieralisi wrote:
> On Thu, Sep 03, 2020 at 09:07:00AM +1000, Benjamin Herrenschmidt wrote:
> > On Wed, 2020-09-02 at 17:47 +0100, Lorenzo Pieralisi wrote:
> > > Yes I do and I expressed them.
> > >
> > > The first concern is the WC ambiguity on non-x86 systems, it looks
> > > like write combinining means everything and nothing at the same time
> > > on != x86 arches.
> > >
> > > On x86 prefetchable BAR == WC mapping (still conditional on arch
> > > features ie PAT, not a blanket enable). On ARM64 WC mapping currently
> > > corresponds to normal NC memory and the PCIe specs allow read
> > > side-effects BAR to be marked as prefetchable, I need to force PCI
> > > sig
> > > to remove the section I mentioned from the specifications because
> > > there
> > > is NO way it can be detected if a prefetchable BAR maps to read
> > > side-effects memory.
> >
> > Im not sure I understand your sentence. It's been a long accepted rule
> > in PCI land that "prefetchable" BARs means "no side effects" and in
> > fact allows much more than just prefetching :-)
> 
> I am referring to the nefarious:
> 
> "Additional Guidance on the Prefetchable Bit in Memory Space BARs"
> 
> I read it 100 times and I still have no idea how it can be implemented,
> it sorts of acknowledges that read side-effects memory can be marked
> as a prefetchable BAR *if* the system meets some criteria.
> 
> As if endpoint designers knew the system where their endpoint is
> plugged into (+ bit (3) in a BAR is read-only).
> 
> I think that that implementation note must be removed from the
> specifications - if anyone dares to follow it this whole
> WC resource mapping can trigger trouble.
> 
> Good news is that it would be trouble for all arches :)
> 
> > > A kernel device driver would (hopefully) know, sysfs code that just
> > > checks the prefetchable attribute and exports resource_WC does not.
> > >
> > > As I mentioned, if the mapping is done in a device specific driver it
> > > can be vetted and there are not many drivers mapping BARs as
> > > ioremap_wc().
> >
> > It's been what other architectures have been doing for mroe than a
> > decade without significant issues... I don't think you should worry too
> > much about this.
> 
> Minus what I wrote above, I agree with you. I'd still be able to
> understand what this patch changes in the mellanox driver HW
> handling though - not sure what they expect from arch_can_pci_mmap_wc()
> returning 1.

This seems to have been added broadly for x86, PPC, and ARM as part of initial
WC support in the driver (37aa5c36 "IB/mlx5: Add UARs write-combining and
non-cached mapping"). It was updated to use `arch_can_pci_mmap_wc()` later
(1f3db161 "IB/mlx5: Generally use the WC auto detection test result").

Guy, Yishai, there are some concerns about difference the technical definition
of WC and how WC is actually used. Can you comment on the usage of WC in mlx5
and which definition of WC the driver utilizes? We're unsure if a blanket
enable for arm64 is safe in light of the driver's use of this function.

Thanks,
Clint

> 
> Thanks,
> Lorenzo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-03 11:08         ` Lorenzo Pieralisi
  2020-09-03 14:36           ` Clint Sbisa
@ 2020-09-03 22:26           ` Benjamin Herrenschmidt
  2020-09-07 23:33           ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-03 22:26 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Clint Sbisa, linux-pci, Bjorn Helgaas, linux-arm-kernel, will,
	catalin.marinas

On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote:
> "Additional Guidance on the Prefetchable Bit in Memory Space BARs"
> 
> I read it 100 times and I still have no idea how it can be
> implemented,
> it sorts of acknowledges that read side-effects memory can be marked
> as a prefetchable BAR *if* the system meets some criteria.
> 
> As if endpoint designers knew the system where their endpoint is
> plugged into (+ bit (3) in a BAR is read-only).
> 
> I think that that implementation note must be removed from the
> specifications - if anyone dares to follow it this whole
> WC resource mapping can trigger trouble.

Ah that one ! Yes you are right its completely broken.

This part of the spec aims at working around the fact that bridges only
have 64-bit prefetchable windows, so anything non-pref has to go below
a 32-bit bridge window (effectively making most 64-bit non-pref BARs a
pointless waste of silicon).

The right fix of course would have been to create a new type of bridge
window. But PCI...

If you're going to mess around with the SIG, I would suggest that a
better approach short of the above would be to allow system software to
put 64-bit non-pref BARs below bridge pref windows on PCIe (provided
the various otehr restrictions in that note are honored such as bridges
not prefetching) and leave it at that. (Unless they already do
somewhere else, I forgot ...).

This should be sufficient to address the space concern without killing
the meaning of the prefetchable bit.

As for enabling the _wc files in sysfs, well, you need some serious
priviledge to be able to access them, so I don't see a big issue 
allowing them to exist when "prefetchable" is set regardless of that
rule. The Mellanox case might be different.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-03 11:08         ` Lorenzo Pieralisi
  2020-09-03 14:36           ` Clint Sbisa
  2020-09-03 22:26           ` Benjamin Herrenschmidt
@ 2020-09-07 23:33           ` Benjamin Herrenschmidt
  2020-09-10  9:46             ` Lorenzo Pieralisi
  2 siblings, 1 reply; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-07 23:33 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Clint Sbisa, linux-pci, Bjorn Helgaas, linux-arm-kernel, will,
	catalin.marinas, Leon Romanovsky

On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote:
> > It's been what other architectures have been doing for mroe than a
> > decade without significant issues... I don't think you should worry
> > too
> > much about this.
> 
> Minus what I wrote above, I agree with you. I'd still be able to
> understand what this patch changes in the mellanox driver HW
> handling though - not sure what they expect from
> arch_can_pci_mmap_wc()
> returning 1.

I don't know enough to get into the finer details but looking a bit it
seems when this is set, they allow extra ioctls to create buffers
mapped with pgprot_writecombine().

I suppose this means faster MMIO backet buffers for small packets (ie,
non-DMA use case).

Also note that mlx5_ib_test_wc() only uses arch_can_pci_mmap_wc() for a
non-ROCE ethernet port on a PF... For anyting else, it just seems to
actually try to do it and see what happens :-)

Leon: Can you clarify the use of arch_can_pci_mmap_wc() in mlx5 and
whether you see an issue with enabling this on arm64 ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-07 23:33           ` Benjamin Herrenschmidt
@ 2020-09-10  9:46             ` Lorenzo Pieralisi
  2020-09-10 10:54               ` Leon Romanovsky
  2020-09-10 12:37               ` Jason Gunthorpe
  0 siblings, 2 replies; 59+ messages in thread
From: Lorenzo Pieralisi @ 2020-09-10  9:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, jgg
  Cc: Clint Sbisa, linux-pci, Bjorn Helgaas, linux-arm-kernel, will,
	catalin.marinas, Leon Romanovsky

[+Jason]

On Tue, Sep 08, 2020 at 09:33:42AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote:
> > > It's been what other architectures have been doing for mroe than a
> > > decade without significant issues... I don't think you should worry
> > > too
> > > much about this.
> > 
> > Minus what I wrote above, I agree with you. I'd still be able to
> > understand what this patch changes in the mellanox driver HW
> > handling though - not sure what they expect from
> > arch_can_pci_mmap_wc()
> > returning 1.
> 
> I don't know enough to get into the finer details but looking a bit it
> seems when this is set, they allow extra ioctls to create buffers
> mapped with pgprot_writecombine().
> 
> I suppose this means faster MMIO backet buffers for small packets (ie,
> non-DMA use case).
> 
> Also note that mlx5_ib_test_wc() only uses arch_can_pci_mmap_wc() for a
> non-ROCE ethernet port on a PF... For anyting else, it just seems to
> actually try to do it and see what happens :-)
> 
> Leon: Can you clarify the use of arch_can_pci_mmap_wc() in mlx5 and
> whether you see an issue with enabling this on arm64 ?

Hi Jason,

I was wondering if you could help us with this question, we are trying
to understand what enabling arch_can_pci_mmap_wc() on arm64 would cause
in mellanox drivers wrt mappings and whether there is an expected
behaviour behind them, in particular whether there is an implicit
reliance on x86 write-combine arch/interconnect details.

Thanks,
Lorenzo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-10  9:46             ` Lorenzo Pieralisi
@ 2020-09-10 10:54               ` Leon Romanovsky
  2020-09-10 12:37               ` Jason Gunthorpe
  1 sibling, 0 replies; 59+ messages in thread
From: Leon Romanovsky @ 2020-09-10 10:54 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Benjamin Herrenschmidt, jgg, Clint Sbisa, linux-pci,
	Bjorn Helgaas, linux-arm-kernel, will, catalin.marinas

On Thu, Sep 10, 2020 at 10:46:00AM +0100, Lorenzo Pieralisi wrote:
> [+Jason]
>
> On Tue, Sep 08, 2020 at 09:33:42AM +1000, Benjamin Herrenschmidt wrote:
> > On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote:
> > > > It's been what other architectures have been doing for mroe than a
> > > > decade without significant issues... I don't think you should worry
> > > > too
> > > > much about this.
> > >
> > > Minus what I wrote above, I agree with you. I'd still be able to
> > > understand what this patch changes in the mellanox driver HW
> > > handling though - not sure what they expect from
> > > arch_can_pci_mmap_wc()
> > > returning 1.
> >
> > I don't know enough to get into the finer details but looking a bit it
> > seems when this is set, they allow extra ioctls to create buffers
> > mapped with pgprot_writecombine().
> >
> > I suppose this means faster MMIO backet buffers for small packets (ie,
> > non-DMA use case).
> >
> > Also note that mlx5_ib_test_wc() only uses arch_can_pci_mmap_wc() for a
> > non-ROCE ethernet port on a PF... For anyting else, it just seems to
> > actually try to do it and see what happens :-)
> >
> > Leon: Can you clarify the use of arch_can_pci_mmap_wc() in mlx5 and
> > whether you see an issue with enabling this on arm64 ?
>
> Hi Jason,
>
> I was wondering if you could help us with this question, we are trying
> to understand what enabling arch_can_pci_mmap_wc() on arm64 would cause
> in mellanox drivers wrt mappings and whether there is an expected
> behaviour behind them, in particular whether there is an implicit
> reliance on x86 write-combine arch/interconnect details.

Sorry, somehow I missed this email.

The arch_can_pci_mmap_wc() used in IB representors, special mode where
we can't perform write-combine test below.

The commit 1f3db161881b ("IB/mlx5: Generally use the WC auto detection test result")
describes it.

I don't see any problem with enabling arch_can_pci_mmap_wc() on ARM.

Thanks

>
> Thanks,
> Lorenzo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-10  9:46             ` Lorenzo Pieralisi
  2020-09-10 10:54               ` Leon Romanovsky
@ 2020-09-10 12:37               ` Jason Gunthorpe
  2020-09-10 15:17                 ` Lorenzo Pieralisi
  1 sibling, 1 reply; 59+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 12:37 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Benjamin Herrenschmidt, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Thu, Sep 10, 2020 at 10:46:00AM +0100, Lorenzo Pieralisi wrote:
> [+Jason]
> 
> On Tue, Sep 08, 2020 at 09:33:42AM +1000, Benjamin Herrenschmidt wrote:
> > On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote:
> > > > It's been what other architectures have been doing for mroe than a
> > > > decade without significant issues... I don't think you should worry
> > > > too
> > > > much about this.
> > > 
> > > Minus what I wrote above, I agree with you. I'd still be able to
> > > understand what this patch changes in the mellanox driver HW
> > > handling though - not sure what they expect from
> > > arch_can_pci_mmap_wc()
> > > returning 1.
> > 
> > I don't know enough to get into the finer details but looking a bit it
> > seems when this is set, they allow extra ioctls to create buffers
> > mapped with pgprot_writecombine().
> > 
> > I suppose this means faster MMIO backet buffers for small packets (ie,
> > non-DMA use case).
> > 
> > Also note that mlx5_ib_test_wc() only uses arch_can_pci_mmap_wc() for a
> > non-ROCE ethernet port on a PF... For anyting else, it just seems to
> > actually try to do it and see what happens :-)
> > 
> > Leon: Can you clarify the use of arch_can_pci_mmap_wc() in mlx5 and
> > whether you see an issue with enabling this on arm64 ?
> 
> Hi Jason,
> 
> I was wondering if you could help us with this question, we are trying
> to understand what enabling arch_can_pci_mmap_wc() on arm64 would cause
> in mellanox drivers wrt mappings and whether there is an expected
> behaviour behind them, in particular whether there is an implicit
> reliance on x86 write-combine arch/interconnect details.

Looking back at this big thread, let me add some perspective

Mellanox drivers have a performance optimization where a 64 byte MemWr
TLP from the root complex to the MMIO BAR will perform better, often
quite a bit better. We run WC in full QA'd production on PPC, ARM and
x86.

The userspace generates a burst of sequential, aligned 8 byte CPU
writes to the MMIO address and triggers an arch specific CPU barrier
to flush/fence the CPU WC buffer. At this point the CPU should emit
the 64 byte TLP toward the device ASAP.

In other words, the only usage here is only about Write. The CPU
should never, ever, generate a MemRD TLP. The code never does a read
explicitly.

If the CPU fails to generate a 64 byte TLP then the device will still
operate correctly but does a different, slower, flow.

If the CPU consistently fails WC then the overhead of trying the WC
flow is a notable net performance loss, and on these CPUs we want to
use only 8 byte write to the MMIO BAR, with NC memory.

There are many important details about how this works and how this
must interact with the CPU barriers and locking.

On x86, arch_can_pci_mmap_wc() is basically meaningless. It indicates
there is a chance that pgprot_writecombine() could work. It can also
be 0 and write combining will work just fine :\.

Thus, mlx5 switched to doing a runtime WC test to determine if the CPU
actually supports WC or not. If the arch can reliably tell the driver
then this test could be avoided. Based on this test the WC mode is
allowed for userspace.

The one call to arch_can_pci_mmap_wc() is in a case where the HW is
configured in a way that can't run the test, here we use
arch_can_pci_mmap_wc() to guess if the CPU has working WC or not.
Ideally an arch would return 1 only when the CPU has working WC.

Depending on workload WC may not be a win. In those cases userspace
will select NC. Thus the same PCI MMIO BAR region can have a mixture
of pages with WC and NC mappings to userspace.

For DEVICE_GRE.. For years now, many deployments of ARM & mlx5 devices
are using an out of tree patch to use DEVICE_GRE for WC on mlx5. This
seems to be the preferred working configuration on at least some ARM
SOCs. So far nobody from the ARM world has shown interest in making a
mainline solution. :(

I can't recall if this is because the relevant ARM SOC's don't support
pgprot_writecombine(), or it doesn't work properly.

I was told the reason ARM never enabled WC was because unaligned
access to WC memory was not supported, and there were existing drivers
that did unaligned writes that would malfunction. I thought this meant
that pgprot_writecombine() was non-working in ARM Linux? So, bit
surprised to see a patch messing with arch_can_pci_mmap_wc() and not
changing the defintion of pgprot_writecombine() ?

mlx5 is more or less a representative user WC for this kind of
'message push' methodology. Several other RDMA devices do this as
well. The methodology is important enough that recent Intel CPUs have
a dedicated instruction to push a 128 byte message in a single TLP
avoiding this whole WC mess.

Frankly, I think the kernel should introduce a well defined pgprot for
this working mode that all archs can agree upon. It should include the
alignment requirement, message push function, CPU barrier macros, and
locking macros that are needed to use this facility correctly.

Defined in a way that is compatible with DEVICE_GRE and can be used by
these 'message push' drivers. That would switch alway most of the
users in the kernel today.

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-10 12:37               ` Jason Gunthorpe
@ 2020-09-10 15:17                 ` Lorenzo Pieralisi
  2020-09-10 17:10                   ` Jason Gunthorpe
  0 siblings, 1 reply; 59+ messages in thread
From: Lorenzo Pieralisi @ 2020-09-10 15:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Thu, Sep 10, 2020 at 09:37:58AM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 10, 2020 at 10:46:00AM +0100, Lorenzo Pieralisi wrote:
> > [+Jason]
> > 
> > On Tue, Sep 08, 2020 at 09:33:42AM +1000, Benjamin Herrenschmidt wrote:
> > > On Thu, 2020-09-03 at 12:08 +0100, Lorenzo Pieralisi wrote:
> > > > > It's been what other architectures have been doing for mroe than a
> > > > > decade without significant issues... I don't think you should worry
> > > > > too
> > > > > much about this.
> > > > 
> > > > Minus what I wrote above, I agree with you. I'd still be able to
> > > > understand what this patch changes in the mellanox driver HW
> > > > handling though - not sure what they expect from
> > > > arch_can_pci_mmap_wc()
> > > > returning 1.
> > > 
> > > I don't know enough to get into the finer details but looking a bit it
> > > seems when this is set, they allow extra ioctls to create buffers
> > > mapped with pgprot_writecombine().
> > > 
> > > I suppose this means faster MMIO backet buffers for small packets (ie,
> > > non-DMA use case).
> > > 
> > > Also note that mlx5_ib_test_wc() only uses arch_can_pci_mmap_wc() for a
> > > non-ROCE ethernet port on a PF... For anyting else, it just seems to
> > > actually try to do it and see what happens :-)
> > > 
> > > Leon: Can you clarify the use of arch_can_pci_mmap_wc() in mlx5 and
> > > whether you see an issue with enabling this on arm64 ?
> > 
> > Hi Jason,
> > 
> > I was wondering if you could help us with this question, we are trying
> > to understand what enabling arch_can_pci_mmap_wc() on arm64 would cause
> > in mellanox drivers wrt mappings and whether there is an expected
> > behaviour behind them, in particular whether there is an implicit
> > reliance on x86 write-combine arch/interconnect details.
> 
> Looking back at this big thread, let me add some perspective

Thank you - it was needed.

> Mellanox drivers have a performance optimization where a 64 byte MemWr
> TLP from the root complex to the MMIO BAR will perform better, often
> quite a bit better. We run WC in full QA'd production on PPC, ARM and
> x86.
> 
> The userspace generates a burst of sequential, aligned 8 byte CPU
> writes to the MMIO address and triggers an arch specific CPU barrier
> to flush/fence the CPU WC buffer. At this point the CPU should emit
> the 64 byte TLP toward the device ASAP.

While at it - mind explaining please what those 64 bytes actully contain ?

> In other words, the only usage here is only about Write. The CPU
> should never, ever, generate a MemRD TLP. The code never does a read
> explicitly.

On arm64 pgprot_writecombine() is speculative memory (normal
non-cacheable), which may not do what you expect from it.

> If the CPU fails to generate a 64 byte TLP then the device will still
> operate correctly but does a different, slower, flow.

Side note: on ARM that TLP is not a native interconnect transaction,
reworded, it depends on what the system-bus->PCI logic does in
this respect.

> If the CPU consistently fails WC then the overhead of trying the WC
> flow is a notable net performance loss, and on these CPUs we want to
> use only 8 byte write to the MMIO BAR, with NC memory.

That's why I looped you in - that's what worries me about "enabling"
arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf
regressions that's not OK.

Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox
driver (or more broadly all drivers following this message push
semantics) to use "something else" for WC detection.

> There are many important details about how this works and how this
> must interact with the CPU barriers and locking.
> 
> On x86, arch_can_pci_mmap_wc() is basically meaningless.

On arm64 too, for the records - or better, write-combine is not
well defined, ergo I don't know what arch_can_pci_mmap_wc() means.

> It indicates there is a chance that pgprot_writecombine() could work.
> It can also be 0 and write combining will work just fine :\.
> 
> Thus, mlx5 switched to doing a runtime WC test to determine if the CPU
> actually supports WC or not. If the arch can reliably tell the driver
> then this test could be avoided. Based on this test the WC mode is
> allowed for userspace.

Can you elaborate on this runtime test please ?

> The one call to arch_can_pci_mmap_wc() is in a case where the HW is
> configured in a way that can't run the test, here we use
> arch_can_pci_mmap_wc() to guess if the CPU has working WC or not.
> Ideally an arch would return 1 only when the CPU has working WC.

Which means we can guarantee the TLP packet you mentioned above I
guess ?

We have to define "working WC" :)

> Depending on workload WC may not be a win. In those cases userspace
> will select NC. Thus the same PCI MMIO BAR region can have a mixture
> of pages with WC and NC mappings to userspace.
> 
> For DEVICE_GRE.. For years now, many deployments of ARM & mlx5 devices
> are using an out of tree patch to use DEVICE_GRE for WC on mlx5. This
> seems to be the preferred working configuration on at least some ARM
> SOCs. So far nobody from the ARM world has shown interest in making a
> mainline solution. :(
> 
> I can't recall if this is because the relevant ARM SOC's don't support
> pgprot_writecombine(), or it doesn't work properly.
> 
> I was told the reason ARM never enabled WC was because unaligned

When you say "enabled WC" I assume you mean making:

pgprot_writecombine() == DEVICE_GRE

> access to WC memory was not supported, and there were existing drivers
> that did unaligned writes that would malfunction. I thought this meant
> that pgprot_writecombine() was non-working in ARM Linux?

On arm64 pgprot_writecombine() is normal non-cacheable memory at the
moment - it works but that does not precisely do what you *expect* from
arch_can_pci_mmap_wc(), that's the whole point I am making.

> So, bit surprised to see a patch messing with arch_can_pci_mmap_wc()
> and not changing the defintion of pgprot_writecombine() ?

We can't change pgprot_writecombine() to DEVICE_GRE, it can trigger
issues on some drivers, see unaligned memory access.

> mlx5 is more or less a representative user WC for this kind of
> 'message push' methodology. Several other RDMA devices do this as
> well. The methodology is important enough that recent Intel CPUs have
> a dedicated instruction to push a 128 byte message in a single TLP
> avoiding this whole WC mess.
> 
> Frankly, I think the kernel should introduce a well defined pgprot for
> this working mode that all archs can agree upon. It should include the
> alignment requirement, message push function, CPU barrier macros, and
> locking macros that are needed to use this facility correctly.
> 
> Defined in a way that is compatible with DEVICE_GRE and can be used by
> these 'message push' drivers. That would switch alway most of the
> users in the kernel today.

That's probably the way forward - I still have concerns about this
patch as it stands given your clarifications above.

Lorenzo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-10 15:17                 ` Lorenzo Pieralisi
@ 2020-09-10 17:10                   ` Jason Gunthorpe
  2020-09-10 21:46                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 59+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 17:10 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Benjamin Herrenschmidt, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Thu, Sep 10, 2020 at 04:17:21PM +0100, Lorenzo Pieralisi wrote:

> > In other words, the only usage here is only about Write. The CPU
> > should never, ever, generate a MemRD TLP. The code never does a read
> > explicitly.
> 
> On arm64 pgprot_writecombine() is speculative memory (normal
> non-cacheable), which may not do what you expect from it.

Can you explain what this actually does on ARM? 

Can it ever speculate loads across page boundaries, or speculate loads
that never exist in the program? ie will we get random unpredicable
MemRds?

Does it/could it "combine writes"?

> > If the CPU fails to generate a 64 byte TLP then the device will still
> > operate correctly but does a different, slower, flow.
> 
> Side note: on ARM that TLP is not a native interconnect transaction,
> reworded, it depends on what the system-bus->PCI logic does in
> this respect.

I think the issue is that ARM never defined what the bits set by
pgprot_writecombine() do at a system level so we see implementations
that do not cause write combining at the PCI-E interface for those
bits. (I assume from what I've heard)

> That's why I looped you in - that's what worries me about "enabling"
> arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf
> regressions that's not OK.
> 
> Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox
> driver (or more broadly all drivers following this message push
> semantics) to use "something else" for WC detection.

arch_can_pci_mmap_wc() really only controls the sysfs resource file
and it seems very unclear who in userspace uses that these days.

vfio is now the right way to do that stuff. I don't see an obvious way
to get WC memory in VFIO though...

So, I don't think this patch will break anything, however I also don't
see a point to it as nobody should be using the sys/resource files
these days.. Curios why Clint proposed it?

> > Thus, mlx5 switched to doing a runtime WC test to determine if the CPU
> > actually supports WC or not. If the arch can reliably tell the driver
> > then this test could be avoided. Based on this test the WC mode is
> > allowed for userspace.
> 
> Can you elaborate on this runtime test please ?

mlx5 is a network device, so the purpose of the 64 byte TLP is to
push, say, a network send command to the device fully formed (eg with
the DMA list, etc). Otherwise we can only push an indication to read
the 64 byte command via DMA from the command ring. 

Avoiding the extra DMA is a performance win.

The runtime test pushes a 64 byte command to the device. If it arrives
as a single TLP then the device executes that command, otherwise it
triggers DMA and reads the 64 bytes from host memory. The test
arranges that the command pushed via the TLP and the command fetched
via DMA are different. The test can then determine which path the
device took and thus know directly if the implementation can deliver
the required TLP or not.

> > configured in a way that can't run the test, here we use
> > arch_can_pci_mmap_wc() to guess if the CPU has working WC or not.
> > Ideally an arch would return 1 only when the CPU has working WC.
> 
> Which means we can guarantee the TLP packet you mentioned above I
> guess ?

Yes. For a PCI driver I think this is the only thing that matters for
"write combining". mlx5 is pretty strict, other drivers might gain
advantage from smaller or more fragmented transfers, eg 32 byte chunks
or something.

> > Depending on workload WC may not be a win. In those cases userspace
> > will select NC. Thus the same PCI MMIO BAR region can have a mixture
> > of pages with WC and NC mappings to userspace.
> > 
> > For DEVICE_GRE.. For years now, many deployments of ARM & mlx5 devices
> > are using an out of tree patch to use DEVICE_GRE for WC on mlx5. This
> > seems to be the preferred working configuration on at least some ARM
> > SOCs. So far nobody from the ARM world has shown interest in making a
> > mainline solution. :(
> > 
> > I can't recall if this is because the relevant ARM SOC's don't support
> > pgprot_writecombine(), or it doesn't work properly.
> > 
> > I was told the reason ARM never enabled WC was because unaligned
> 
> When you say "enabled WC" I assume you mean making:
> 
> pgprot_writecombine() == DEVICE_GRE

Well, when I say "enabled WC" I mean it seems existing
pgprot_writecombine() does not give write combining at PCI-E and the
option that does give write combining is DEVICE_GRE, that has the
alignment problem and can't be used.

At least on some SOCs.

> > access to WC memory was not supported, and there were existing drivers
> > that did unaligned writes that would malfunction. I thought this meant
> > that pgprot_writecombine() was non-working in ARM Linux?
> 
> On arm64 pgprot_writecombine() is normal non-cacheable memory at the
> moment - it works but that does not precisely do what you *expect* from
> arch_can_pci_mmap_wc(), that's the whole point I am making.

Most places use pgprot_writecombine() blindly and just ignore
arch_can_pci_mmap_wc() - I suppose with the idea that
pgprot_writecombine() will be a harmless NOP if it isn't supported?

mlx5 is possibly the only place where someone actually tested and
cared about the performance difference between using a WC specific
algorithm or not.

> > mlx5 is more or less a representative user WC for this kind of
> > 'message push' methodology. Several other RDMA devices do this as
> > well. The methodology is important enough that recent Intel CPUs have
> > a dedicated instruction to push a 128 byte message in a single TLP
> > avoiding this whole WC mess.
> > 
> > Frankly, I think the kernel should introduce a well defined pgprot for
> > this working mode that all archs can agree upon. It should include the
> > alignment requirement, message push function, CPU barrier macros, and
> > locking macros that are needed to use this facility correctly.
> > 
> > Defined in a way that is compatible with DEVICE_GRE and can be used by
> > these 'message push' drivers. That would switch alway most of the
> > users in the kernel today.
> 
> That's probably the way forward - I still have concerns about this
> patch as it stands given your clarifications above.

I would very much like to see some support for DEVICE_GRE (at least on
the SOCs that require it) in ARM going forward. For whatever reason
this seems to be necessary to get write PCI combining behavior on
enough ARM hardware that it should be supported in the upstream
kernel.

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-10 17:10                   ` Jason Gunthorpe
@ 2020-09-10 21:46                     ` Benjamin Herrenschmidt
  2020-09-10 23:29                       ` Jason Gunthorpe
  0 siblings, 1 reply; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-10 21:46 UTC (permalink / raw)
  To: Jason Gunthorpe, Lorenzo Pieralisi
  Cc: Clint Sbisa, linux-pci, Bjorn Helgaas, linux-arm-kernel, will,
	catalin.marinas, Leon Romanovsky

On Thu, 2020-09-10 at 14:10 -0300, Jason Gunthorpe wrote:
> Can you explain what this actually does on ARM? 
> 
> Can it ever speculate loads across page boundaries, or speculate
> loads
> that never exist in the program? ie will we get random unpredicable
> MemRds?

Probably, at least on powerpc you will as well, that's the only way to
get write combine.

> Does it/could it "combine writes"?

I assume so for ARM, definitely for powerpc.

> > > If the CPU fails to generate a 64 byte TLP then the device will
> > > still
> > > operate correctly but does a different, slower, flow.
> > 
> > Side note: on ARM that TLP is not a native interconnect
> > transaction,
> > reworded, it depends on what the system-bus->PCI logic does in
> > this respect.
> 
> I think the issue is that ARM never defined what the bits set by
> pgprot_writecombine() do at a system level so we see implementations
> that do not cause write combining at the PCI-E interface for those
> bits. (I assume from what I've heard)

Nobody did. I think only x86 has a real "write combine" attribute. I
tried to untangled that mess years ago and didnt' get to the bottom of
it, but basically, on non-x86 archs, pgprot_writecombine will give you
what you asked ... and more.

> > That's why I looped you in - that's what worries me about
> > "enabling"
> > arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf
> > regressions that's not OK.
> > 
> > Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox
> > driver (or more broadly all drivers following this message push
> > semantics) to use "something else" for WC detection.
> 
> arch_can_pci_mmap_wc() really only controls the sysfs resource file
> and it seems very unclear who in userspace uses that these days.

dpdk under some circumstances afaik.

> vfio is now the right way to do that stuff. I don't see an obvious
> way to get WC memory in VFIO though...

Which would be a performance issue on a number of things I suppose...

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-10 21:46                     ` Benjamin Herrenschmidt
@ 2020-09-10 23:29                       ` Jason Gunthorpe
  2020-09-11  0:39                         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 59+ messages in thread
From: Jason Gunthorpe @ 2020-09-10 23:29 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Lorenzo Pieralisi, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Fri, Sep 11, 2020 at 07:46:47AM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2020-09-10 at 14:10 -0300, Jason Gunthorpe wrote:
> > Can you explain what this actually does on ARM? 
> > 
> > Can it ever speculate loads across page boundaries, or speculate
> > loads
> > that never exist in the program? ie will we get random unpredicable
> > MemRds?
> 
> Probably, at least on powerpc you will as well, that's the only way to
> get write combine.

If I remove the PROT_READ in the user space mmap will it block it?

Read TLPs are not harmful but I suspect they would cause an
undesirable random performance anomaly.

> > Does it/could it "combine writes"?
> 
> I assume so for ARM, definitely for powerpc.

Various IBM PPC chips I know work, we do test that.

> > > That's why I looped you in - that's what worries me about
> > > "enabling"
> > > arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf
> > > regressions that's not OK.
> > > 
> > > Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox
> > > driver (or more broadly all drivers following this message push
> > > semantics) to use "something else" for WC detection.
> > 
> > arch_can_pci_mmap_wc() really only controls the sysfs resource file
> > and it seems very unclear who in userspace uses that these days.
> 
> dpdk under some circumstances afaik.

And something gross for DMA then? Not sure dpdk is useful without
DMA. Why not use CONFIG_VFIO_NOIOMMU for such a non-secure thing?
 
> > vfio is now the right way to do that stuff. I don't see an obvious
> > way to get WC memory in VFIO though...
> 
> Which would be a performance issue on a number of things I suppose...

Almost nothing uses pci_iomap_wc(), so I'd be surpried if userspace
DPDK was an important user when an in-kernel driver for the same HW
doesn't use it?

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-10 23:29                       ` Jason Gunthorpe
@ 2020-09-11  0:39                         ` Benjamin Herrenschmidt
  2020-09-11 14:21                           ` Jason Gunthorpe
  2020-09-11 21:42                           ` Clint Sbisa
  0 siblings, 2 replies; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-11  0:39 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lorenzo Pieralisi, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Thu, 2020-09-10 at 20:29 -0300, Jason Gunthorpe wrote:
> > Probably, at least on powerpc you will as well, that's the only way to
> > get write combine.
> 
> If I remove the PROT_READ in the user space mmap will it block it?

No. powerpc at least doesn't have write-only mappings.

> Read TLPs are not harmful but I suspect they would cause an
> undesirable random performance anomaly.

I suspect in practice you wont get them esp. if the code has barriers
but ... it's allowed by the architecture.

> > > Does it/could it "combine writes"?
> > 
> > I assume so for ARM, definitely for powerpc.
> 
> Various IBM PPC chips I know work, we do test that.
> 
> > > > That's why I looped you in - that's what worries me about
> > > > "enabling"
> > > > arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf
> > > > regressions that's not OK.
> > > > 
> > > > Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox
> > > > driver (or more broadly all drivers following this message push
> > > > semantics) to use "something else" for WC detection.
> > > 
> > > arch_can_pci_mmap_wc() really only controls the sysfs resource file
> > > and it seems very unclear who in userspace uses that these days.
> > 
> > dpdk under some circumstances afaik.
> 
> And something gross for DMA then? Not sure dpdk is useful without
> DMA. Why not use CONFIG_VFIO_NOIOMMU for such a non-secure thing?

Clint, can you elaborate on the use case ?

> > > vfio is now the right way to do that stuff. I don't see an obvious
> > > way to get WC memory in VFIO though...
> > 
> > Which would be a performance issue on a number of things I suppose...
> 
> Almost nothing uses pci_iomap_wc(), so I'd be surpried if userspace
> DPDK was an important user when an in-kernel driver for the same HW
> doesn't use it?

Hard to know how uses those files out there but I don't like arm not
providing what pretty much all relevant archs do provide since the
semantics afaik aren't that different.

Yes, "write combine" isn't a good name.... The goal is to get WC but it
comes with the whole package on several archs. We don't even have a
reasonnable definition of the semantics of readl/writel on a WC mapping
(hint: on powerpc the barriers in them will prevent WC even on a WC
mapping) nor of what barriers might work  and how on such a mapping.

I tried a while ago and ... ugh.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-11  0:39                         ` Benjamin Herrenschmidt
@ 2020-09-11 14:21                           ` Jason Gunthorpe
  2020-09-11 21:42                           ` Clint Sbisa
  1 sibling, 0 replies; 59+ messages in thread
From: Jason Gunthorpe @ 2020-09-11 14:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Lorenzo Pieralisi, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Fri, Sep 11, 2020 at 10:39:16AM +1000, Benjamin Herrenschmidt wrote:

> Yes, "write combine" isn't a good name.... The goal is to get WC but it
> comes with the whole package on several archs. We don't even have a
> reasonnable definition of the semantics of readl/writel on a WC mapping
> (hint: on powerpc the barriers in them will prevent WC even on a WC
> mapping) nor of what barriers might work  and how on such a mapping.

Yes, you can't really use WC properly in the kernel, we don't have the
infrastructure for it. mlx5 is using __raw_writeq() and wmb() to hack
something ugly together in the kernel.

A useful API for the message method, similar to what we use in
userspace, is something like:

  /*
   * Almost always need a spinlock as multiple CPUs cannot write
   * concurrently.
   */
  spin_lock();

  /*
   * Ensure that all DMA visiable writes in program order are visible
   * to DMA before the WC TLP is sent.
   */
  barrier_wc_after_lock();

  /* Generate the TLP */
  write_wc_message(wc_iomem, message, len);

  /*
   * Writes to wc_iomem past this, by any CPU, cannot replace writes
   * already done in wc_message.
   */
  barrier_wc_before_unlock();

  spin_unlock();

And another varient without the spinlock for stuff that can be per-CPU
for a range of WC memory.

(oh actually I see most drivers are using ioremap_wc(), and there is a
bunch of them including an Amazon ethernet device...)

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-11  0:39                         ` Benjamin Herrenschmidt
  2020-09-11 14:21                           ` Jason Gunthorpe
@ 2020-09-11 21:42                           ` Clint Sbisa
  2020-09-14 14:17                             ` Jason Gunthorpe
  1 sibling, 1 reply; 59+ messages in thread
From: Clint Sbisa @ 2020-09-11 21:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jason Gunthorpe, Lorenzo Pieralisi, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Fri, Sep 11, 2020 at 10:39:16AM +1000, Benjamin Herrenschmidt wrote:
> > > > > That's why I looped you in - that's what worries me about
> > > > > "enabling"
> > > > > arch_can_pci_mmap_wc() on arm64. If we enable it and we have perf
> > > > > regressions that's not OK.
> > > > >
> > > > > Or we *can* enable arch_can_pci_mmap_wc() but force the mellanox
> > > > > driver (or more broadly all drivers following this message push
> > > > > semantics) to use "something else" for WC detection.
> > > >
> > > > arch_can_pci_mmap_wc() really only controls the sysfs resource file
> > > > and it seems very unclear who in userspace uses that these days.
> > >
> > > dpdk under some circumstances afaik.
> >
> > And something gross for DMA then? Not sure dpdk is useful without
> > DMA. Why not use CONFIG_VFIO_NOIOMMU for such a non-secure thing?
> 
> Clint, can you elaborate on the use case ?
> 

The use-case I'm targeting is the ENA pmd in DPDK. For performance reasons
(many of which are very similar to what Jason has described for mlx5), we need
to generate full-sized TLPs instead of many partial TLPs to improve efficiency.

Here's an excerpt describing the write-combine usage from
./Documentation/networking/device_drivers/ethernet/amazon/ena.rst:

- Low Latency Queue (LLQ) mode or "push-mode".
  * In this mode the driver pushes the transmit descriptors and the
    first 128 bytes of the packet directly to the ENA device memory
    space. The rest of the packet payload is fetched by the
    device. For this operation mode, the driver uses a dedicated PCI
    device memory BAR, which is mapped with write-combine capability.

There's no DMA involved with this BAR-- the driver writes a portion of the
packet contents in addition to the descriptors, which generally increases the
number of TLPs if write-combine isn't used. Furthermore, this BAR is only used
for writes and never for reads.

As Jason noted in the other reply to this email, the Linux ENA driver makes use
of WC by using devm_ioremap_wc(). The DPDK code here uses the same mechanism in
user-space to enable write-combining by mapping the resourceX_wc file if the
driver uses RTE_PCI_DRV_WC_ACTIVATE.

Clint

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-11 21:42                           ` Clint Sbisa
@ 2020-09-14 14:17                             ` Jason Gunthorpe
  2020-09-14 14:24                               ` Clint Sbisa
  2020-09-14 21:41                               ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 59+ messages in thread
From: Jason Gunthorpe @ 2020-09-14 14:17 UTC (permalink / raw)
  To: Clint Sbisa
  Cc: Benjamin Herrenschmidt, Lorenzo Pieralisi, linux-pci,
	Bjorn Helgaas, linux-arm-kernel, will, catalin.marinas,
	Leon Romanovsky

On Fri, Sep 11, 2020 at 09:42:25PM +0000, Clint Sbisa wrote:

> There's no DMA involved with this BAR-- the driver writes a portion of the
> packet contents in addition to the descriptors, which generally increases the
> number of TLPs if write-combine isn't used. Furthermore, this BAR is only used
> for writes and never for reads.

You use DPDK without DMA? How does receive work?

> As Jason noted in the other reply to this email, the Linux ENA driver makes use
> of WC by using devm_ioremap_wc().

As Ben noted we don't have kernel accessors to make this portable or
safe :(

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-14 14:17                             ` Jason Gunthorpe
@ 2020-09-14 14:24                               ` Clint Sbisa
  2020-09-14 14:38                                 ` Jason Gunthorpe
  2020-09-14 21:41                               ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 59+ messages in thread
From: Clint Sbisa @ 2020-09-14 14:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Lorenzo Pieralisi, linux-pci,
	Bjorn Helgaas, linux-arm-kernel, will, catalin.marinas,
	Leon Romanovsky

On Mon, Sep 14, 2020 at 11:17:26AM -0300, Jason Gunthorpe wrote:
> On Fri, Sep 11, 2020 at 09:42:25PM +0000, Clint Sbisa wrote:
> 
> > There's no DMA involved with this BAR-- the driver writes a portion of the
> > packet contents in addition to the descriptors, which generally increases the
> > number of TLPs if write-combine isn't used. Furthermore, this BAR is only used
> > for writes and never for reads.
> 
> You use DPDK without DMA? How does receive work?

That was not worded well. DMA is used, but the first X bytes of the packet are
written directly to this BAR instead of being DMA'd-- the rest of the data is
DMA'd.

> 
> > As Jason noted in the other reply to this email, the Linux ENA driver makes use
> > of WC by using devm_ioremap_wc().
> 
> As Ben noted we don't have kernel accessors to make this portable or
> safe :(

And therein lies our dilemma...

Clint

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-14 14:24                               ` Clint Sbisa
@ 2020-09-14 14:38                                 ` Jason Gunthorpe
  2020-09-14 21:42                                   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 59+ messages in thread
From: Jason Gunthorpe @ 2020-09-14 14:38 UTC (permalink / raw)
  To: Clint Sbisa
  Cc: Benjamin Herrenschmidt, Lorenzo Pieralisi, linux-pci,
	Bjorn Helgaas, linux-arm-kernel, will, catalin.marinas,
	Leon Romanovsky

On Mon, Sep 14, 2020 at 02:24:06PM +0000, Clint Sbisa wrote:
> On Mon, Sep 14, 2020 at 11:17:26AM -0300, Jason Gunthorpe wrote:
> > On Fri, Sep 11, 2020 at 09:42:25PM +0000, Clint Sbisa wrote:
> > 
> > > There's no DMA involved with this BAR-- the driver writes a portion of the
> > > packet contents in addition to the descriptors, which generally increases the
> > > number of TLPs if write-combine isn't used. Furthermore, this BAR is only used
> > > for writes and never for reads.
> > 
> > You use DPDK without DMA? How does receive work?
> 
> That was not worded well. DMA is used, but the first X bytes of the packet are
> written directly to this BAR instead of being DMA'd-- the rest of the data is
> DMA'd.

which is back to my original question, how do you do DMA using
/sys/xx/resources? Why not use VFIO like everything else?

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-14 14:17                             ` Jason Gunthorpe
  2020-09-14 14:24                               ` Clint Sbisa
@ 2020-09-14 21:41                               ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-14 21:41 UTC (permalink / raw)
  To: Jason Gunthorpe, Clint Sbisa
  Cc: Lorenzo Pieralisi, linux-pci, Bjorn Helgaas, linux-arm-kernel,
	will, catalin.marinas, Leon Romanovsky

On Mon, 2020-09-14 at 11:17 -0300, Jason Gunthorpe wrote:
> On Fri, Sep 11, 2020 at 09:42:25PM +0000, Clint Sbisa wrote:
> 
> > There's no DMA involved with this BAR-- the driver writes a portion
> > of the
> > packet contents in addition to the descriptors, which generally
> > increases the
> > number of TLPs if write-combine isn't used. Furthermore, this BAR
> > is only used
> > for writes and never for reads.
> 
> You use DPDK without DMA? How does receive work?
> 
> > As Jason noted in the other reply to this email, the Linux ENA
> > driver makes use
> > of WC by using devm_ioremap_wc().
> 
> As Ben noted we don't have kernel accessors to make this portable or
> safe :(

Well.. to be frank it does work "well enough" for simple cases like
frame buffers :-)

Cheers
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-14 14:38                                 ` Jason Gunthorpe
@ 2020-09-14 21:42                                   ` Benjamin Herrenschmidt
  2020-09-14 22:00                                     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-14 21:42 UTC (permalink / raw)
  To: Jason Gunthorpe, Clint Sbisa
  Cc: Lorenzo Pieralisi, linux-pci, Bjorn Helgaas, linux-arm-kernel,
	will, catalin.marinas, Leon Romanovsky

On Mon, 2020-09-14 at 11:38 -0300, Jason Gunthorpe wrote:
> On Mon, Sep 14, 2020 at 02:24:06PM +0000, Clint Sbisa wrote:
> > On Mon, Sep 14, 2020 at 11:17:26AM -0300, Jason Gunthorpe wrote:
> > > On Fri, Sep 11, 2020 at 09:42:25PM +0000, Clint Sbisa wrote:
> > > 
> > > > There's no DMA involved with this BAR-- the driver writes a
> > > > portion of the
> > > > packet contents in addition to the descriptors, which generally
> > > > increases the
> > > > number of TLPs if write-combine isn't used. Furthermore, this
> > > > BAR is only used
> > > > for writes and never for reads.
> > > 
> > > You use DPDK without DMA? How does receive work?
> > 
> > That was not worded well. DMA is used, but the first X bytes of the
> > packet are
> > written directly to this BAR instead of being DMA'd-- the rest of
> > the data is
> > DMA'd.
> 
> which is back to my original question, how do you do DMA using
> /sys/xx/resources? Why not use VFIO like everything else?

Note: All this doesnt' change the fact that sys/xx/resources_wc exists
for other archs and I see no reasons so far not to have it on ARM...

Cheers,
Ben.




^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-14 21:42                                   ` Benjamin Herrenschmidt
@ 2020-09-14 22:00                                     ` Benjamin Herrenschmidt
  2020-09-14 22:32                                       ` Clint Sbisa
  2020-09-14 22:57                                       ` Jason Gunthorpe
  0 siblings, 2 replies; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-14 22:00 UTC (permalink / raw)
  To: Jason Gunthorpe, Clint Sbisa
  Cc: Lorenzo Pieralisi, linux-pci, Bjorn Helgaas, linux-arm-kernel,
	will, catalin.marinas, Leon Romanovsky

On Tue, 2020-09-15 at 07:42 +1000, Benjamin Herrenschmidt wrote:
> 
> > which is back to my original question, how do you do DMA using
> > /sys/xx/resources? Why not use VFIO like everything else?
> 
> Note: All this doesnt' change the fact that sys/xx/resources_wc
> exists
> for other archs and I see no reasons so far not to have it on ARM...

Also... it looks like VFIO also doesn't provide a way to do WC yet
unfortunately :-(

There was a discussion 2 or 3 years ago while I was at IBM about ways
to add that functionality, but it doesn't seem to have resulted in
anything.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-14 22:00                                     ` Benjamin Herrenschmidt
@ 2020-09-14 22:32                                       ` Clint Sbisa
  2020-09-14 22:57                                       ` Jason Gunthorpe
  1 sibling, 0 replies; 59+ messages in thread
From: Clint Sbisa @ 2020-09-14 22:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jason Gunthorpe, Lorenzo Pieralisi, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Tue, Sep 15, 2020 at 08:00:27AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2020-09-15 at 07:42 +1000, Benjamin Herrenschmidt wrote:
> >
> > > which is back to my original question, how do you do DMA using
> > > /sys/xx/resources? Why not use VFIO like everything else?
> >
> > Note: All this doesnt' change the fact that sys/xx/resources_wc
> > exists
> > for other archs and I see no reasons so far not to have it on ARM...
> 
> Also... it looks like VFIO also doesn't provide a way to do WC yet
> unfortunately :-(
> 
> There was a discussion 2 or 3 years ago while I was at IBM about ways
> to add that functionality, but it doesn't seem to have resulted in
> anything.

And to complete answering the question, the ENA PMD in DPDK also supports vfio,
which-- as Ben pointed out-- doesn't support WC either.

Clint

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-14 22:00                                     ` Benjamin Herrenschmidt
  2020-09-14 22:32                                       ` Clint Sbisa
@ 2020-09-14 22:57                                       ` Jason Gunthorpe
  2020-09-14 23:25                                         ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 59+ messages in thread
From: Jason Gunthorpe @ 2020-09-14 22:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Clint Sbisa, Lorenzo Pieralisi, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Tue, Sep 15, 2020 at 08:00:27AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2020-09-15 at 07:42 +1000, Benjamin Herrenschmidt wrote:
> > 
> > > which is back to my original question, how do you do DMA using
> > > /sys/xx/resources? Why not use VFIO like everything else?
> > 
> > Note: All this doesnt' change the fact that sys/xx/resources_wc
> > exists
> > for other archs and I see no reasons so far not to have it on ARM...
> 
> Also... it looks like VFIO also doesn't provide a way to do WC yet
> unfortunately :-(

Yes, but if the driving reason for this patch is because a VFIO user
like EFA DPDK is trying to work around VFIO limitations, then I'd say
the VFIO mmap should be amended, and not so much worring about sysfs.

While there is no reason for ARM to not show the sysfs, it really
should never be used. Modern kernels in secure boot don't even show
it, for instance.

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-14 22:57                                       ` Jason Gunthorpe
@ 2020-09-14 23:25                                         ` Benjamin Herrenschmidt
  2020-09-15 10:18                                           ` Lorenzo Pieralisi
  0 siblings, 1 reply; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-14 23:25 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Clint Sbisa, Lorenzo Pieralisi, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Mon, 2020-09-14 at 19:57 -0300, Jason Gunthorpe wrote:
> On Tue, Sep 15, 2020 at 08:00:27AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2020-09-15 at 07:42 +1000, Benjamin Herrenschmidt wrote:
> > > 
> > > > which is back to my original question, how do you do DMA using
> > > > /sys/xx/resources? Why not use VFIO like everything else?
> > > 
> > > Note: All this doesnt' change the fact that sys/xx/resources_wc
> > > exists
> > > for other archs and I see no reasons so far not to have it on ARM...
> > 
> > Also... it looks like VFIO also doesn't provide a way to do WC yet
> > unfortunately :-(
> 
> Yes, but if the driving reason for this patch is because a VFIO user
> like EFA DPDK is trying to work around VFIO limitations, then I'd say
> the VFIO mmap should be amended, and not so much worring about sysfs.

I don't think the two are exclusive.

> While there is no reason for ARM to not show the sysfs, it really
> should never be used. Modern kernels in secure boot don't even show
> it, for instance.

It's useful for random things, I've used it quite a bit in a previous
life for things like in-lab hw testing etc...  There's tooling out
there, esp. in the more 'embedded' side of thing that uses this, I
don't see a good reason not to provide the same level of functionality.

So Lorenzo, imho, we should merge the patch.

As for fixing VFIO, definitely something to revive. The main contention
point was which "interface" to use to request write combine.

Let's restart that conversation with the appropriate folks, the last I
remember, the question was to figure out what interface to provide
userspace for the functionality.

Clint, do you want to drive this as well ?

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-14 23:25                                         ` Benjamin Herrenschmidt
@ 2020-09-15 10:18                                           ` Lorenzo Pieralisi
  2020-09-15 11:05                                             ` Jason Gunthorpe
  2020-09-15 23:00                                             ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 59+ messages in thread
From: Lorenzo Pieralisi @ 2020-09-15 10:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jason Gunthorpe, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Tue, Sep 15, 2020 at 09:25:57AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2020-09-14 at 19:57 -0300, Jason Gunthorpe wrote:
> > On Tue, Sep 15, 2020 at 08:00:27AM +1000, Benjamin Herrenschmidt wrote:
> > > On Tue, 2020-09-15 at 07:42 +1000, Benjamin Herrenschmidt wrote:
> > > > 
> > > > > which is back to my original question, how do you do DMA using
> > > > > /sys/xx/resources? Why not use VFIO like everything else?
> > > > 
> > > > Note: All this doesnt' change the fact that sys/xx/resources_wc
> > > > exists
> > > > for other archs and I see no reasons so far not to have it on ARM...
> > > 
> > > Also... it looks like VFIO also doesn't provide a way to do WC yet
> > > unfortunately :-(
> > 
> > Yes, but if the driving reason for this patch is because a VFIO user
> > like EFA DPDK is trying to work around VFIO limitations, then I'd say
> > the VFIO mmap should be amended, and not so much worring about sysfs.
> 
> I don't think the two are exclusive.
> 
> > While there is no reason for ARM to not show the sysfs, it really
> > should never be used. Modern kernels in secure boot don't even show
> > it, for instance.
> 
> It's useful for random things, I've used it quite a bit in a previous
> life for things like in-lab hw testing etc...  There's tooling out
> there, esp. in the more 'embedded' side of thing that uses this, I
> don't see a good reason not to provide the same level of functionality.
> 
> So Lorenzo, imho, we should merge the patch.

To sum it up:

(1) RDMA drivers need a new mapping function/attribute to define their
    message push model. Actually the message model is not necessarily related
    to write combining a la x86, so we should probably come up with a better
    and consistent naming. Enabling this patchset may trigger performance
    regressions on mellanox drivers on arm64 - this ought to be addressed.
(2) User-space/passthrough drivers rely on VFIO for BAR mappings. Since
    it looks relevant, WC message model semantics should be introduced
    there, somehow.
(3) Given the above, the sysfs interface can be enabled. I still don't
    see the advantages (other than saying it is there for other arches, ie
    what can you really do with the sysfs mappings - see Jason's VFIO/DMA
    query) but we can do it. Still, I am not happy about the possible
    mellanox regressions - that should be tested/verified before this
    patch is merged IMHO. Do we really need it for v5.10 ?

Thoughts ?

Lorenzo

> As for fixing VFIO, definitely something to revive. The main contention
> point was which "interface" to use to request write combine.
> 
> Let's restart that conversation with the appropriate folks, the last I
> remember, the question was to figure out what interface to provide
> userspace for the functionality.
> 
> Clint, do you want to drive this as well ?
> 
> Cheers,
> Ben.
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-15 10:18                                           ` Lorenzo Pieralisi
@ 2020-09-15 11:05                                             ` Jason Gunthorpe
  2020-09-15 23:17                                               ` Benjamin Herrenschmidt
  2020-09-15 23:00                                             ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 59+ messages in thread
From: Jason Gunthorpe @ 2020-09-15 11:05 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Benjamin Herrenschmidt, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Tue, Sep 15, 2020 at 11:18:31AM +0100, Lorenzo Pieralisi wrote:
> On Tue, Sep 15, 2020 at 09:25:57AM +1000, Benjamin Herrenschmidt wrote:
> > On Mon, 2020-09-14 at 19:57 -0300, Jason Gunthorpe wrote:
> > > On Tue, Sep 15, 2020 at 08:00:27AM +1000, Benjamin Herrenschmidt wrote:
> > > > On Tue, 2020-09-15 at 07:42 +1000, Benjamin Herrenschmidt wrote:
> > > > > 
> > > > > > which is back to my original question, how do you do DMA using
> > > > > > /sys/xx/resources? Why not use VFIO like everything else?
> > > > > 
> > > > > Note: All this doesnt' change the fact that sys/xx/resources_wc
> > > > > exists
> > > > > for other archs and I see no reasons so far not to have it on ARM...
> > > > 
> > > > Also... it looks like VFIO also doesn't provide a way to do WC yet
> > > > unfortunately :-(
> > > 
> > > Yes, but if the driving reason for this patch is because a VFIO user
> > > like EFA DPDK is trying to work around VFIO limitations, then I'd say
> > > the VFIO mmap should be amended, and not so much worring about sysfs.
> > 
> > I don't think the two are exclusive.
> > 
> > > While there is no reason for ARM to not show the sysfs, it really
> > > should never be used. Modern kernels in secure boot don't even show
> > > it, for instance.
> > 
> > It's useful for random things, I've used it quite a bit in a previous
> > life for things like in-lab hw testing etc...  There's tooling out
> > there, esp. in the more 'embedded' side of thing that uses this, I
> > don't see a good reason not to provide the same level of functionality.
> > 
> > So Lorenzo, imho, we should merge the patch.
> 
> To sum it up:
> 
> (1) RDMA drivers need a new mapping function/attribute to define their
>     message push model. Actually the message model is not necessarily related
>     to write combining a la x86, so we should probably come up with a better
>     and consistent naming. Enabling this patchset may trigger performance
>     regressions on mellanox drivers on arm64 - this ought to be
>     addressed.

It is pretty clear now that the certain ARM chips that don't do write
combining with pgprot_writecombine will performance regress if they
are running a certain uncommon Mellanox configuration. I suspect these
deployments are all running the out of tree patch for DEVICE_GRE
though.

So, I wouldn't worry about it - but would prefer to see ARM folks
advance some proposal to accommodate those chips that need DEVICE_GRE.

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-15 10:18                                           ` Lorenzo Pieralisi
  2020-09-15 11:05                                             ` Jason Gunthorpe
@ 2020-09-15 23:00                                             ` Benjamin Herrenschmidt
  2020-09-15 23:12                                               ` Clint Sbisa
  1 sibling, 1 reply; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-15 23:00 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Jason Gunthorpe, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Tue, 2020-09-15 at 11:18 +0100, Lorenzo Pieralisi wrote:
> > It's useful for random things, I've used it quite a bit in a previous
> > life for things like in-lab hw testing etc...  There's tooling out
> > there, esp. in the more 'embedded' side of thing that uses this, I
> > don't see a good reason not to provide the same level of functionality.
> > 
> > So Lorenzo, imho, we should merge the patch.
> 
> To sum it up:
> 
> (1) RDMA drivers need a new mapping function/attribute to define their
>     message push model. Actually the message model is not necessarily related
>     to write combining a la x86, so we should probably come up with a better
>     and consistent naming. Enabling this patchset may trigger performance
>     regressions on mellanox drivers on arm64 - this ought to be addressed.

I doubt it. Besides Mellanox will probably enable WC already through
the other code path (the use of this accessor is only one of the path
that enables the driver to do WC).

I don't think we need to solve the RDMA semantics issue that urgently
TBH, and it's definitely an orthogonal issue to that at hand.

> (2) User-space/passthrough drivers rely on VFIO for BAR mappings. Since
>     it looks relevant, WC message model semantics should be introduced
>     there, somehow.

Yes.

> (3) Given the above, the sysfs interface can be enabled. I still don't
>     see the advantages (other than saying it is there for other arches, ie
>     what can you really do with the sysfs mappings - see Jason's VFIO/DMA
>     query) but we can do it. Still, I am not happy about the possible
>     mellanox regressions - that should be tested/verified before this
>     patch is merged IMHO. Do we really need it for v5.10 ?

I don't think there's a significant risk of regression, but then I also
dont' have a way to test :-)

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-15 23:00                                             ` Benjamin Herrenschmidt
@ 2020-09-15 23:12                                               ` Clint Sbisa
  0 siblings, 0 replies; 59+ messages in thread
From: Clint Sbisa @ 2020-09-15 23:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Lorenzo Pieralisi, Jason Gunthorpe, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Wed, Sep 16, 2020 at 09:00:09AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2020-09-15 at 11:18 +0100, Lorenzo Pieralisi wrote:
> > To sum it up:
> >
> > (1) RDMA drivers need a new mapping function/attribute to define their
> >     message push model. Actually the message model is not necessarily related
> >     to write combining a la x86, so we should probably come up with a better
> >     and consistent naming. Enabling this patchset may trigger performance
> >     regressions on mellanox drivers on arm64 - this ought to be addressed.
> 
> I doubt it. Besides Mellanox will probably enable WC already through
> the other code path (the use of this accessor is only one of the path
> that enables the driver to do WC).
> 
> I don't think we need to solve the RDMA semantics issue that urgently
> TBH, and it's definitely an orthogonal issue to that at hand.
> 
> > (2) User-space/passthrough drivers rely on VFIO for BAR mappings. Since
> >     it looks relevant, WC message model semantics should be introduced
> >     there, somehow.
> 
> Yes.

I will ping some folks on the VFIO patch to see if we can get the ball rolling
there again.

> 
> > (3) Given the above, the sysfs interface can be enabled. I still don't
> >     see the advantages (other than saying it is there for other arches, ie
> >     what can you really do with the sysfs mappings - see Jason's VFIO/DMA
> >     query) but we can do it. Still, I am not happy about the possible
> >     mellanox regressions - that should be tested/verified before this
> >     patch is merged IMHO. Do we really need it for v5.10 ?
> 
> I don't think there's a significant risk of regression, but then I also
> dont' have a way to test :-)

I'm going to re-submit this patch to Catalin and Will with an updated commit
message capturing the context from this discussion (and cc everyone involved).

As for the whole device GRE / new naming context-- I'll have to defer to
Lorenzo on suggested next steps on this front.

Thanks,
Clint

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-15 11:05                                             ` Jason Gunthorpe
@ 2020-09-15 23:17                                               ` Benjamin Herrenschmidt
  2020-09-15 23:40                                                 ` Jason Gunthorpe
  0 siblings, 1 reply; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-15 23:17 UTC (permalink / raw)
  To: Jason Gunthorpe, Lorenzo Pieralisi
  Cc: Clint Sbisa, linux-pci, Bjorn Helgaas, linux-arm-kernel, will,
	catalin.marinas, Leon Romanovsky

On Tue, 2020-09-15 at 08:05 -0300, Jason Gunthorpe wrote:
> > To sum it up:
> > 
> > (1) RDMA drivers need a new mapping function/attribute to define their
> >      message push model. Actually the message model is not necessarily related
> >      to write combining a la x86, so we should probably come up with a better
> >      and consistent naming. Enabling this patchset may trigger performance
> >      regressions on mellanox drivers on arm64 - this ought to be
> >      addressed.
> 
> It is pretty clear now that the certain ARM chips that don't do write
> combining with pgprot_writecombine will performance regress if they
> are running a certain uncommon Mellanox configuration. I suspect these
> deployments are all running the out of tree patch for DEVICE_GRE
> though.

I'm not sure I understand...

Today those ARM chips will not use pgprot_writecombine (at least not
using that code path, they might still use it as the result of the
other path in the driver that can enable it). So they get
MT_DEVICE_nGnRnE (unless I missed something here).

So they will not combine.

With the patch, those device will now use MT_DEVICE_NC.

Why would that be a regression ? It will allow speculation, that
doesn't necessarily mean that the CPU will cause spurrious accesses, it
probably won't in most case... And it should allow combining, no ?

BTW. Lorenzo, why don't we use MT_DEVICE_GRE for pgprot_writecombine ?
Its not supported on some chips ?

Not that this lead me to discover annother weird thing ...

What on earth is pgprot_device() ? This is new ? On ARM it will be
MT_DEVICE_nGnRE, so it allows posted write. It seems to match what
ioremap does. Should then ioremap use it as well ?

But it's only ever used for PCI mmap. Why is it different from
pgprot_noncached() which disables posted writes (nE) ?

Because a whole lot of drivers will use pgprot_noncached() explicitly
in either mmap or vmap, with the expectation that it's somewhat the
same as what ioremap does...

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-15 23:17                                               ` Benjamin Herrenschmidt
@ 2020-09-15 23:40                                                 ` Jason Gunthorpe
  2020-09-16  7:59                                                   ` Benjamin Herrenschmidt
  2020-09-16  8:33                                                   ` Will Deacon
  0 siblings, 2 replies; 59+ messages in thread
From: Jason Gunthorpe @ 2020-09-15 23:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Lorenzo Pieralisi, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Wed, Sep 16, 2020 at 09:17:38AM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2020-09-15 at 08:05 -0300, Jason Gunthorpe wrote:
> > > To sum it up:
> > > 
> > > (1) RDMA drivers need a new mapping function/attribute to define their
> > >      message push model. Actually the message model is not necessarily related
> > >      to write combining a la x86, so we should probably come up with a better
> > >      and consistent naming. Enabling this patchset may trigger performance
> > >      regressions on mellanox drivers on arm64 - this ought to be
> > >      addressed.
> > 
> > It is pretty clear now that the certain ARM chips that don't do write
> > combining with pgprot_writecombine will performance regress if they
> > are running a certain uncommon Mellanox configuration. I suspect these
> > deployments are all running the out of tree patch for DEVICE_GRE
> > though.
> 
> I'm not sure I understand...
> 
> Today those ARM chips will not use pgprot_writecombine (at least not
> using that code path, they might still use it as the result of the
> other path in the driver that can enable it). 

Not quite, upstream kernel will never use WC on those
devices. DEVICE_GRE is not supported in upstream,
arch_can_pci_mmap_wc() is always false and the WC tester will always
fail.

> With the patch, those device will now use MT_DEVICE_NC.

Which doesn't do WC at all on some ARM implementations.
 
> Why would that be a regression ? 

Using the WC submission flow when it doesn't work costs something like
10% performance vs using the non-WC flow.

Like I said, the case where the driver can't self test probably
doesn't intersect with the ARM implementations that can't do write
combining, and if it did, the users probably run the out of tree
driver that has the hacky stuff to make it use DEVICE_GRE.

> BTW. Lorenzo, why don't we use MT_DEVICE_GRE for pgprot_writecombine ?
> Its not supported on some chips ?

It has alignment requirements drivers don't meet. We need a new
concept of "write combining and I promise to do aligned access"

> What on earth is pgprot_device() ? This is new ? On ARM it will be
> MT_DEVICE_nGnRE, so it allows posted write. It seems to match what
> ioremap does. Should then ioremap use it as well ?
> 
> But it's only ever used for PCI mmap. Why is it different from
> pgprot_noncached() which disables posted writes (nE) ?
> 
> Because a whole lot of drivers will use pgprot_noncached() explicitly
> in either mmap or vmap, with the expectation that it's somewhat the
> same as what ioremap does...

*boggle*

Only sysfs uses pci_mmap_resource_range() any other driver exposing
BAR pages, like VFIO dies not. Makes no sense at all it is different.

Delete the ill defined pgprot_device() ? Nobody has complained
something is wrong with VFIO in the 6 years since it was added...

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-15 23:40                                                 ` Jason Gunthorpe
@ 2020-09-16  7:59                                                   ` Benjamin Herrenschmidt
  2020-09-16 12:12                                                     ` Jason Gunthorpe
  2020-09-16 12:48                                                     ` Leon Romanovsky
  2020-09-16  8:33                                                   ` Will Deacon
  1 sibling, 2 replies; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-16  7:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lorenzo Pieralisi, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Tue, 2020-09-15 at 20:40 -0300, Jason Gunthorpe wrote:
> Not quite, upstream kernel will never use WC on those
> devices. DEVICE_GRE is not supported in upstream,
> arch_can_pci_mmap_wc() is always false and the WC tester will always
> fail.
> 
> > With the patch, those device will now use MT_DEVICE_NC.
> 
> Which doesn't do WC at all on some ARM implementations.

Lovely... this is arm64 btw, still the case ?

Also we could make this a variable rather than a constant and choose
a more appropriate set of flags at boot time....

> > Why would that be a regression ? 
> 
> Using the WC submission flow when it doesn't work costs something like
> 10% performance vs using the non-WC flow.

You mean the driver uses a different path to the HW which ahs that
overhead, not that MMIOs have that overhead right ?

> Like I said, the case where the driver can't self test probably
> doesn't intersect with the ARM implementations that can't do write
> combining, and if it did, the users probably run the out of tree
> driver that has the hacky stuff to make it use DEVICE_GRE.

Ok. So you are saying to go for it and ignore that Mellanox case then ?
:-)

> > BTW. Lorenzo, why don't we use MT_DEVICE_GRE for pgprot_writecombine ?
> > Its not supported on some chips ?
> 
> It has alignment requirements drivers don't meet. We need a new
> concept of "write combining and I promise to do aligned access"

Ah yes, I remember. Right, we would need to provide new/better
accessors for these kind of things. It's going to be a mess to find a
common set that works for all archs.

> > What on earth is pgprot_device() ? This is new ? On ARM it will be
> > MT_DEVICE_nGnRE, so it allows posted write. It seems to match what
> > ioremap does. Should then ioremap use it as well ?
> > 
> > But it's only ever used for PCI mmap. Why is it different from
> > pgprot_noncached() which disables posted writes (nE) ?
> > 
> > Because a whole lot of drivers will use pgprot_noncached() explicitly
> > in either mmap or vmap, with the expectation that it's somewhat the
> > same as what ioremap does...
> 
> *boggle*
> 
> Only sysfs uses pci_mmap_resource_range() any other driver exposing
> BAR pages, like VFIO dies not. Makes no sense at all it is different.
> 
> Delete the ill defined pgprot_device() ? Nobody has complained
> something is wrong with VFIO in the 6 years since it was added...

I was wondering what it was, that's it ... 

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-15 23:40                                                 ` Jason Gunthorpe
  2020-09-16  7:59                                                   ` Benjamin Herrenschmidt
@ 2020-09-16  8:33                                                   ` Will Deacon
  2020-09-16  8:48                                                     ` Catalin Marinas
  2020-09-16 12:08                                                     ` Jason Gunthorpe
  1 sibling, 2 replies; 59+ messages in thread
From: Will Deacon @ 2020-09-16  8:33 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Lorenzo Pieralisi, Clint Sbisa,
	linux-pci, Bjorn Helgaas, linux-arm-kernel, catalin.marinas,
	Leon Romanovsky

On Tue, Sep 15, 2020 at 08:40:06PM -0300, Jason Gunthorpe wrote:
> On Wed, Sep 16, 2020 at 09:17:38AM +1000, Benjamin Herrenschmidt wrote:
> > On Tue, 2020-09-15 at 08:05 -0300, Jason Gunthorpe wrote:
> > > > To sum it up:
> > > > 
> > > > (1) RDMA drivers need a new mapping function/attribute to define their
> > > >      message push model. Actually the message model is not necessarily related
> > > >      to write combining a la x86, so we should probably come up with a better
> > > >      and consistent naming. Enabling this patchset may trigger performance
> > > >      regressions on mellanox drivers on arm64 - this ought to be
> > > >      addressed.
> > > 
> > > It is pretty clear now that the certain ARM chips that don't do write
> > > combining with pgprot_writecombine will performance regress if they
> > > are running a certain uncommon Mellanox configuration. I suspect these
> > > deployments are all running the out of tree patch for DEVICE_GRE
> > > though.
> > 
> > I'm not sure I understand...
> > 
> > Today those ARM chips will not use pgprot_writecombine (at least not
> > using that code path, they might still use it as the result of the
> > other path in the driver that can enable it). 
> 
> Not quite, upstream kernel will never use WC on those
> devices. DEVICE_GRE is not supported in upstream,
> arch_can_pci_mmap_wc() is always false and the WC tester will always
> fail.
> 
> > With the patch, those device will now use MT_DEVICE_NC.
> 
> Which doesn't do WC at all on some ARM implementations.

Is that just TX2? I remember that thing being weird where GRE performed
better than NC, but I thought that was a one off (and the thing is dead).

NC is more permissive than GRE, so I think that's the right one to use; i.e.
we go for the fewest number of restrictions on the hardware. If somebody
screws up the uarch, that's up to them.

Will

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-16  8:33                                                   ` Will Deacon
@ 2020-09-16  8:48                                                     ` Catalin Marinas
  2020-09-16 14:15                                                       ` Lorenzo Pieralisi
  2020-09-16 12:08                                                     ` Jason Gunthorpe
  1 sibling, 1 reply; 59+ messages in thread
From: Catalin Marinas @ 2020-09-16  8:48 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jason Gunthorpe, Benjamin Herrenschmidt, Lorenzo Pieralisi,
	Clint Sbisa, linux-pci, Bjorn Helgaas, linux-arm-kernel,
	Leon Romanovsky

On Wed, Sep 16, 2020 at 09:33:16AM +0100, Will Deacon wrote:
> On Tue, Sep 15, 2020 at 08:40:06PM -0300, Jason Gunthorpe wrote:
> > On Wed, Sep 16, 2020 at 09:17:38AM +1000, Benjamin Herrenschmidt wrote:
> > > On Tue, 2020-09-15 at 08:05 -0300, Jason Gunthorpe wrote:
> > > > > To sum it up:
> > > > > 
> > > > > (1) RDMA drivers need a new mapping function/attribute to define their
> > > > >      message push model. Actually the message model is not necessarily related
> > > > >      to write combining a la x86, so we should probably come up with a better
> > > > >      and consistent naming. Enabling this patchset may trigger performance
> > > > >      regressions on mellanox drivers on arm64 - this ought to be
> > > > >      addressed.
> > > > 
> > > > It is pretty clear now that the certain ARM chips that don't do write
> > > > combining with pgprot_writecombine will performance regress if they
> > > > are running a certain uncommon Mellanox configuration. I suspect these
> > > > deployments are all running the out of tree patch for DEVICE_GRE
> > > > though.
> > > 
> > > I'm not sure I understand...
> > > 
> > > Today those ARM chips will not use pgprot_writecombine (at least not
> > > using that code path, they might still use it as the result of the
> > > other path in the driver that can enable it). 
> > 
> > Not quite, upstream kernel will never use WC on those
> > devices. DEVICE_GRE is not supported in upstream,
> > arch_can_pci_mmap_wc() is always false and the WC tester will always
> > fail.
> > 
> > > With the patch, those device will now use MT_DEVICE_NC.
> > 
> > Which doesn't do WC at all on some ARM implementations.
> 
> Is that just TX2? I remember that thing being weird where GRE performed
> better than NC, but I thought that was a one off (and the thing is dead).

I recall something along these lines. Hopefully ARM updated the guidance
to licensees.

> NC is more permissive than GRE, so I think that's the right one to use; i.e.
> we go for the fewest number of restrictions on the hardware. If somebody
> screws up the uarch, that's up to them.

I agree, Normal NC is better as long as the BAR can tolerate read
side-effects.

-- 
Catalin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-16  8:33                                                   ` Will Deacon
  2020-09-16  8:48                                                     ` Catalin Marinas
@ 2020-09-16 12:08                                                     ` Jason Gunthorpe
  1 sibling, 0 replies; 59+ messages in thread
From: Jason Gunthorpe @ 2020-09-16 12:08 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, Lorenzo Pieralisi, Clint Sbisa,
	linux-pci, Bjorn Helgaas, linux-arm-kernel, catalin.marinas,
	Leon Romanovsky

On Wed, Sep 16, 2020 at 09:33:16AM +0100, Will Deacon wrote:

> Is that just TX2? I remember that thing being weird where GRE performed
> better than NC, but I thought that was a one off (and the thing is dead).

It is at least TX2, yes, but it is not really dead, someone made a
respectable supercomputer out of them that still has an operational
life ahead of it.

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-16  7:59                                                   ` Benjamin Herrenschmidt
@ 2020-09-16 12:12                                                     ` Jason Gunthorpe
  2020-09-16 14:09                                                       ` Lorenzo Pieralisi
  2020-09-16 23:59                                                       ` Benjamin Herrenschmidt
  2020-09-16 12:48                                                     ` Leon Romanovsky
  1 sibling, 2 replies; 59+ messages in thread
From: Jason Gunthorpe @ 2020-09-16 12:12 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Lorenzo Pieralisi, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Wed, Sep 16, 2020 at 05:59:24PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2020-09-15 at 20:40 -0300, Jason Gunthorpe wrote:
> > Not quite, upstream kernel will never use WC on those
> > devices. DEVICE_GRE is not supported in upstream,
> > arch_can_pci_mmap_wc() is always false and the WC tester will always
> > fail.
> > 
> > > With the patch, those device will now use MT_DEVICE_NC.
> > 
> > Which doesn't do WC at all on some ARM implementations.
> 
> Lovely... this is arm64 btw, still the case ?

Yep

> Also we could make this a variable rather than a constant and choose
> a more appropriate set of flags at boot time....

It is a function, so it could check the CPU ID for the known broken
devices and block them.
 
> > > Why would that be a regression ? 
> > 
> > Using the WC submission flow when it doesn't work costs something like
> > 10% performance vs using the non-WC flow.
> 
> You mean the driver uses a different path to the HW which ahs that
> overhead, not that MMIOs have that overhead right ?

The different path has overhead of doing extra useless MMIOs because
they don't combine

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-16  7:59                                                   ` Benjamin Herrenschmidt
  2020-09-16 12:12                                                     ` Jason Gunthorpe
@ 2020-09-16 12:48                                                     ` Leon Romanovsky
  1 sibling, 0 replies; 59+ messages in thread
From: Leon Romanovsky @ 2020-09-16 12:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jason Gunthorpe, Lorenzo Pieralisi, Clint Sbisa, linux-pci,
	Bjorn Helgaas, linux-arm-kernel, will, catalin.marinas

On Wed, Sep 16, 2020 at 05:59:24PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2020-09-15 at 20:40 -0300, Jason Gunthorpe wrote:
>
> > Like I said, the case where the driver can't self test probably
> > doesn't intersect with the ARM implementations that can't do write
> > combining, and if it did, the users probably run the out of tree
> > driver that has the hacky stuff to make it use DEVICE_GRE.
>
> Ok. So you are saying to go for it and ignore that Mellanox case then ?
> :-)

Regarding Mellanox, as I wrote it in the beginning of this thread, you
can ignore mlx5 driver.
https://lore.kernel.org/linux-pci/20200910105419.GH421756@unreal/

Thanks

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-16 12:12                                                     ` Jason Gunthorpe
@ 2020-09-16 14:09                                                       ` Lorenzo Pieralisi
  2020-09-16 14:14                                                         ` Jason Gunthorpe
  2020-09-16 23:59                                                       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 59+ messages in thread
From: Lorenzo Pieralisi @ 2020-09-16 14:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Wed, Sep 16, 2020 at 09:12:26AM -0300, Jason Gunthorpe wrote:

[...]

> > You mean the driver uses a different path to the HW which ahs that
> > overhead, not that MMIOs have that overhead right ?
> 
> The different path has overhead of doing extra useless MMIOs because
> they don't combine

For my own information, this is IB user space driver code, correct ? It
tries to mmap buffer as WC and if it succeeds write into it in an
optimized fashion (that is just pure overhead on platforms where normal
NC memory - ie WC on arm64 - does not do what the _architecture_ defines
it should).

Lorenzo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-16 14:09                                                       ` Lorenzo Pieralisi
@ 2020-09-16 14:14                                                         ` Jason Gunthorpe
  0 siblings, 0 replies; 59+ messages in thread
From: Jason Gunthorpe @ 2020-09-16 14:14 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Benjamin Herrenschmidt, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Wed, Sep 16, 2020 at 03:09:14PM +0100, Lorenzo Pieralisi wrote:
> On Wed, Sep 16, 2020 at 09:12:26AM -0300, Jason Gunthorpe wrote:
> 
> [...]
> 
> > > You mean the driver uses a different path to the HW which ahs that
> > > overhead, not that MMIOs have that overhead right ?
> > 
> > The different path has overhead of doing extra useless MMIOs because
> > they don't combine
> 
> For my own information, this is IB user space driver code, correct ?

Yes, maybe DPDK too

> It tries to mmap buffer as WC and if it succeeds write into it in an
> optimized fashion (that is just pure overhead on platforms where
> normal NC memory - ie WC on arm64 - does not do what the
> _architecture_ defines it should).

Right, pure overhead if large PCI-E TLPs are not delivered to the
device.

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-16  8:48                                                     ` Catalin Marinas
@ 2020-09-16 14:15                                                       ` Lorenzo Pieralisi
  2020-09-16 17:00                                                         ` Catalin Marinas
  0 siblings, 1 reply; 59+ messages in thread
From: Lorenzo Pieralisi @ 2020-09-16 14:15 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Jason Gunthorpe, Benjamin Herrenschmidt,
	Clint Sbisa, linux-pci, Bjorn Helgaas, linux-arm-kernel,
	Leon Romanovsky

On Wed, Sep 16, 2020 at 09:48:52AM +0100, Catalin Marinas wrote:
> On Wed, Sep 16, 2020 at 09:33:16AM +0100, Will Deacon wrote:
> > On Tue, Sep 15, 2020 at 08:40:06PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Sep 16, 2020 at 09:17:38AM +1000, Benjamin Herrenschmidt wrote:
> > > > On Tue, 2020-09-15 at 08:05 -0300, Jason Gunthorpe wrote:
> > > > > > To sum it up:
> > > > > > 
> > > > > > (1) RDMA drivers need a new mapping function/attribute to define their
> > > > > >      message push model. Actually the message model is not necessarily related
> > > > > >      to write combining a la x86, so we should probably come up with a better
> > > > > >      and consistent naming. Enabling this patchset may trigger performance
> > > > > >      regressions on mellanox drivers on arm64 - this ought to be
> > > > > >      addressed.
> > > > > 
> > > > > It is pretty clear now that the certain ARM chips that don't do write
> > > > > combining with pgprot_writecombine will performance regress if they
> > > > > are running a certain uncommon Mellanox configuration. I suspect these
> > > > > deployments are all running the out of tree patch for DEVICE_GRE
> > > > > though.
> > > > 
> > > > I'm not sure I understand...
> > > > 
> > > > Today those ARM chips will not use pgprot_writecombine (at least not
> > > > using that code path, they might still use it as the result of the
> > > > other path in the driver that can enable it). 
> > > 
> > > Not quite, upstream kernel will never use WC on those
> > > devices. DEVICE_GRE is not supported in upstream,
> > > arch_can_pci_mmap_wc() is always false and the WC tester will always
> > > fail.
> > > 
> > > > With the patch, those device will now use MT_DEVICE_NC.
> > > 
> > > Which doesn't do WC at all on some ARM implementations.
> > 
> > Is that just TX2? I remember that thing being weird where GRE performed
> > better than NC, but I thought that was a one off (and the thing is dead).
> 
> I recall something along these lines. Hopefully ARM updated the guidance
> to licensees.
> 
> > NC is more permissive than GRE, so I think that's the right one to use; i.e.
> > we go for the fewest number of restrictions on the hardware. If somebody
> > screws up the uarch, that's up to them.
> 
> I agree, Normal NC is better as long as the BAR can tolerate read
> side-effects.

That we don't know but if a prefetchable BAR can't tolerate read
side effects this would be already a problem on eg x86 - that's
the best we can hope for given the current PCI specs.

+1 on normal NC. The only open point is whether we should make
arch_can_pci_mmap_wc() return false on platforms like TX2.

Thanks,
Lorenzo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-16 14:15                                                       ` Lorenzo Pieralisi
@ 2020-09-16 17:00                                                         ` Catalin Marinas
  2020-09-16 21:29                                                           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 59+ messages in thread
From: Catalin Marinas @ 2020-09-16 17:00 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Will Deacon, Jason Gunthorpe, Benjamin Herrenschmidt,
	Clint Sbisa, linux-pci, Bjorn Helgaas, linux-arm-kernel,
	Leon Romanovsky

On Wed, Sep 16, 2020 at 03:15:02PM +0100, Lorenzo Pieralisi wrote:
> On Wed, Sep 16, 2020 at 09:48:52AM +0100, Catalin Marinas wrote:
> > On Wed, Sep 16, 2020 at 09:33:16AM +0100, Will Deacon wrote:
> > > On Tue, Sep 15, 2020 at 08:40:06PM -0300, Jason Gunthorpe wrote:
> > > > On Wed, Sep 16, 2020 at 09:17:38AM +1000, Benjamin Herrenschmidt wrote:
> > > > > With the patch, those device will now use MT_DEVICE_NC.
> > > > 
> > > > Which doesn't do WC at all on some ARM implementations.
> > > 
> > > Is that just TX2? I remember that thing being weird where GRE performed
> > > better than NC, but I thought that was a one off (and the thing is dead).
> > 
> > I recall something along these lines. Hopefully ARM updated the guidance
> > to licensees.
> > 
> > > NC is more permissive than GRE, so I think that's the right one to use; i.e.
> > > we go for the fewest number of restrictions on the hardware. If somebody
> > > screws up the uarch, that's up to them.
> > 
> > I agree, Normal NC is better as long as the BAR can tolerate read
> > side-effects.
> 
> That we don't know but if a prefetchable BAR can't tolerate read
> side effects this would be already a problem on eg x86 - that's
> the best we can hope for given the current PCI specs.
> 
> +1 on normal NC. The only open point is whether we should make
> arch_can_pci_mmap_wc() return false on platforms like TX2.

I lost track in this thread whether it matters. TX2 would need Device
GRE for optimal performance but the kernel doesn't currently provide it
anyway. We could expose a new memory type, aligned_wc ;).

-- 
Catalin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-16 17:00                                                         ` Catalin Marinas
@ 2020-09-16 21:29                                                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-16 21:29 UTC (permalink / raw)
  To: Catalin Marinas, Lorenzo Pieralisi
  Cc: Will Deacon, Jason Gunthorpe, Clint Sbisa, linux-pci,
	Bjorn Helgaas, linux-arm-kernel, Leon Romanovsky

On Wed, 2020-09-16 at 18:00 +0100, Catalin Marinas wrote:
> > That we don't know but if a prefetchable BAR can't tolerate read
> > side effects this would be already a problem on eg x86 - that's
> > the best we can hope for given the current PCI specs.
> > 
> > +1 on normal NC. The only open point is whether we should make
> > arch_can_pci_mmap_wc() return false on platforms like TX2.
> 
> I lost track in this thread whether it matters. TX2 would need Device
> GRE for optimal performance but the kernel doesn't currently provide
> it anyway. We could expose a new memory type, aligned_wc ;).

Or ignore TX2 :-) Though Lorenzo has a point about making it return
false for arch_can_pci_mmap_wc() if we really care enough.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-16 12:12                                                     ` Jason Gunthorpe
  2020-09-16 14:09                                                       ` Lorenzo Pieralisi
@ 2020-09-16 23:59                                                       ` Benjamin Herrenschmidt
  2020-09-17 10:28                                                         ` Lorenzo Pieralisi
  1 sibling, 1 reply; 59+ messages in thread
From: Benjamin Herrenschmidt @ 2020-09-16 23:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Lorenzo Pieralisi, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Wed, 2020-09-16 at 09:12 -0300, Jason Gunthorpe wrote:
> > Also we could make this a variable rather than a constant and
> > choose
> > a more appropriate set of flags at boot time....
> 
> It is a function, so it could check the CPU ID for the known broken
> devices and block them.

Sure, I meant in the abstract way. It's not a hot path so it doesnt
have to be a static key.

> > > > Why would that be a regression ? 
> > > 
> > > Using the WC submission flow when it doesn't work costs something
> > > like
> > > 10% performance vs using the non-WC flow.
> > 
> > You mean the driver uses a different path to the HW which ahs that
> > overhead, not that MMIOs have that overhead right ?
> 
> The different path has overhead of doing extra useless MMIOs because
> they don't combine

I see. This might have to end up being a TX2 specific hack until the
end of times...

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-16 23:59                                                       ` Benjamin Herrenschmidt
@ 2020-09-17 10:28                                                         ` Lorenzo Pieralisi
  2020-09-17 11:32                                                           ` Jason Gunthorpe
  0 siblings, 1 reply; 59+ messages in thread
From: Lorenzo Pieralisi @ 2020-09-17 10:28 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jason Gunthorpe, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Thu, Sep 17, 2020 at 09:59:28AM +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2020-09-16 at 09:12 -0300, Jason Gunthorpe wrote:
> > > Also we could make this a variable rather than a constant and
> > > choose
> > > a more appropriate set of flags at boot time....
> > 
> > It is a function, so it could check the CPU ID for the known broken
> > devices and block them.
> 
> Sure, I meant in the abstract way. It's not a hot path so it doesnt
> have to be a static key.
> 
> > > > > Why would that be a regression ? 
> > > > 
> > > > Using the WC submission flow when it doesn't work costs something
> > > > like
> > > > 10% performance vs using the non-WC flow.
> > > 
> > > You mean the driver uses a different path to the HW which ahs that
> > > overhead, not that MMIOs have that overhead right ?
> > 
> > The different path has overhead of doing extra useless MMIOs because
> > they don't combine
> 
> I see. This might have to end up being a TX2 specific hack until the
> end of times...

True - hopefully on platforms that implement normal NC the architectural
way will not trigger user space performance regressions.

Unfortunately if we merge this patch we _do_ know from this thread
that userspace will suffer from a perf regression on TX2.

Either we ignore it or we write some code to prevent it
(ie first step make arch_can_pci_mmap_wc() return 0 on TX2 -
possibly using the arm64 errata detection mechanism).

Adding a new IO mapping API and use it in IB drivers won't solve the TX2
problem - since we still prefer normal NC over device GRE for "WC"
mappings and we would have to "downgrade" TX2 somehow.

Lorenzo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-17 10:28                                                         ` Lorenzo Pieralisi
@ 2020-09-17 11:32                                                           ` Jason Gunthorpe
  2020-09-17 14:01                                                             ` Lorenzo Pieralisi
  0 siblings, 1 reply; 59+ messages in thread
From: Jason Gunthorpe @ 2020-09-17 11:32 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Benjamin Herrenschmidt, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Thu, Sep 17, 2020 at 11:28:19AM +0100, Lorenzo Pieralisi wrote:
> Unfortunately if we merge this patch we _do_ know from this thread
> that userspace will suffer from a perf regression on TX2.
> 
> Either we ignore it or we write some code to prevent it
> (ie first step make arch_can_pci_mmap_wc() return 0 on TX2 -
> possibly using the arm64 errata detection mechanism).

If anyone complains they can send the change to arch_can_pci_mmap_wc()
- I'm pretty sure nobody will complain due to mlx5

Jason

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-17 11:32                                                           ` Jason Gunthorpe
@ 2020-09-17 14:01                                                             ` Lorenzo Pieralisi
  2020-09-17 16:08                                                               ` Will Deacon
  0 siblings, 1 reply; 59+ messages in thread
From: Lorenzo Pieralisi @ 2020-09-17 14:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Clint Sbisa, linux-pci, Bjorn Helgaas,
	linux-arm-kernel, will, catalin.marinas, Leon Romanovsky

On Thu, Sep 17, 2020 at 08:32:21AM -0300, Jason Gunthorpe wrote:
> On Thu, Sep 17, 2020 at 11:28:19AM +0100, Lorenzo Pieralisi wrote:
> > Unfortunately if we merge this patch we _do_ know from this thread
> > that userspace will suffer from a perf regression on TX2.
> > 
> > Either we ignore it or we write some code to prevent it
> > (ie first step make arch_can_pci_mmap_wc() return 0 on TX2 -
> > possibly using the arm64 errata detection mechanism).
> 
> If anyone complains they can send the change to arch_can_pci_mmap_wc()
> - I'm pretty sure nobody will complain due to mlx5

If that's the case we can go ahead and merge this patch with a
reworded commit log - this thread was very useful nonetheless
for my own information (and others hopefully) so thank you.

For VFIO WC + "message push ioremap" - to be continued.

Clint, please reword the commit and resend, not sure we can hit
v5.10 but we shall try.

Thanks,
Lorenzo

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: [PATCH] arm64: Enable PCI write-combine resources under sysfs
  2020-09-17 14:01                                                             ` Lorenzo Pieralisi
@ 2020-09-17 16:08                                                               ` Will Deacon
  0 siblings, 0 replies; 59+ messages in thread
From: Will Deacon @ 2020-09-17 16:08 UTC (permalink / raw)
  To: Lorenzo Pieralisi
  Cc: Jason Gunthorpe, Benjamin Herrenschmidt, Clint Sbisa, linux-pci,
	Bjorn Helgaas, linux-arm-kernel, catalin.marinas,
	Leon Romanovsky

On Thu, Sep 17, 2020 at 03:01:16PM +0100, Lorenzo Pieralisi wrote:
> On Thu, Sep 17, 2020 at 08:32:21AM -0300, Jason Gunthorpe wrote:
> > On Thu, Sep 17, 2020 at 11:28:19AM +0100, Lorenzo Pieralisi wrote:
> > > Unfortunately if we merge this patch we _do_ know from this thread
> > > that userspace will suffer from a perf regression on TX2.
> > > 
> > > Either we ignore it or we write some code to prevent it
> > > (ie first step make arch_can_pci_mmap_wc() return 0 on TX2 -
> > > possibly using the arm64 errata detection mechanism).
> > 
> > If anyone complains they can send the change to arch_can_pci_mmap_wc()
> > - I'm pretty sure nobody will complain due to mlx5
> 
> If that's the case we can go ahead and merge this patch with a
> reworded commit log - this thread was very useful nonetheless
> for my own information (and others hopefully) so thank you.
> 
> For VFIO WC + "message push ioremap" - to be continued.
> 
> Clint, please reword the commit and resend, not sure we can hit
> v5.10 but we shall try.

If you ack it, I'll queue it ;)

Will

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2020-09-17 19:42 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-31 15:18 [PATCH] arm64: Enable PCI write-combine resources under sysfs Clint Sbisa
2020-08-31 23:24 ` Benjamin Herrenschmidt
2020-09-01 18:37 ` Bjorn Helgaas
2020-09-01 23:22   ` Benjamin Herrenschmidt
2020-09-02  8:57     ` Will Deacon
2020-09-02 11:32 ` Lorenzo Pieralisi
2020-09-02 14:29   ` Clint Sbisa
2020-09-02 16:47     ` Lorenzo Pieralisi
2020-09-02 17:21       ` Catalin Marinas
2020-09-02 17:54         ` Lorenzo Pieralisi
2020-09-02 23:03           ` Benjamin Herrenschmidt
2020-09-02 23:08         ` Benjamin Herrenschmidt
2020-09-02 23:08           ` Benjamin Herrenschmidt
2020-09-02 23:07       ` Benjamin Herrenschmidt
2020-09-03 11:08         ` Lorenzo Pieralisi
2020-09-03 14:36           ` Clint Sbisa
2020-09-03 22:26           ` Benjamin Herrenschmidt
2020-09-07 23:33           ` Benjamin Herrenschmidt
2020-09-10  9:46             ` Lorenzo Pieralisi
2020-09-10 10:54               ` Leon Romanovsky
2020-09-10 12:37               ` Jason Gunthorpe
2020-09-10 15:17                 ` Lorenzo Pieralisi
2020-09-10 17:10                   ` Jason Gunthorpe
2020-09-10 21:46                     ` Benjamin Herrenschmidt
2020-09-10 23:29                       ` Jason Gunthorpe
2020-09-11  0:39                         ` Benjamin Herrenschmidt
2020-09-11 14:21                           ` Jason Gunthorpe
2020-09-11 21:42                           ` Clint Sbisa
2020-09-14 14:17                             ` Jason Gunthorpe
2020-09-14 14:24                               ` Clint Sbisa
2020-09-14 14:38                                 ` Jason Gunthorpe
2020-09-14 21:42                                   ` Benjamin Herrenschmidt
2020-09-14 22:00                                     ` Benjamin Herrenschmidt
2020-09-14 22:32                                       ` Clint Sbisa
2020-09-14 22:57                                       ` Jason Gunthorpe
2020-09-14 23:25                                         ` Benjamin Herrenschmidt
2020-09-15 10:18                                           ` Lorenzo Pieralisi
2020-09-15 11:05                                             ` Jason Gunthorpe
2020-09-15 23:17                                               ` Benjamin Herrenschmidt
2020-09-15 23:40                                                 ` Jason Gunthorpe
2020-09-16  7:59                                                   ` Benjamin Herrenschmidt
2020-09-16 12:12                                                     ` Jason Gunthorpe
2020-09-16 14:09                                                       ` Lorenzo Pieralisi
2020-09-16 14:14                                                         ` Jason Gunthorpe
2020-09-16 23:59                                                       ` Benjamin Herrenschmidt
2020-09-17 10:28                                                         ` Lorenzo Pieralisi
2020-09-17 11:32                                                           ` Jason Gunthorpe
2020-09-17 14:01                                                             ` Lorenzo Pieralisi
2020-09-17 16:08                                                               ` Will Deacon
2020-09-16 12:48                                                     ` Leon Romanovsky
2020-09-16  8:33                                                   ` Will Deacon
2020-09-16  8:48                                                     ` Catalin Marinas
2020-09-16 14:15                                                       ` Lorenzo Pieralisi
2020-09-16 17:00                                                         ` Catalin Marinas
2020-09-16 21:29                                                           ` Benjamin Herrenschmidt
2020-09-16 12:08                                                     ` Jason Gunthorpe
2020-09-15 23:00                                             ` Benjamin Herrenschmidt
2020-09-15 23:12                                               ` Clint Sbisa
2020-09-14 21:41                               ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).