All of lore.kernel.org
 help / color / mirror / Atom feed
* RFC on writel and writel_relaxed
@ 2018-03-21  3:07 Sinan Kaya
  2018-03-21  3:40   ` Oliver
  0 siblings, 1 reply; 216+ messages in thread
From: Sinan Kaya @ 2018-03-21  3:07 UTC (permalink / raw)
  To: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma

Hi PPC Maintainers,

We are seeking feedback on the status of relaxed write API implementation.
What is the motivation for not implementing the relaxed API? 

I see that network drivers are working around the issue by calling
__raw_write() API directly but this also breaks other architectures
like SPARC since the semantics of __raw_writel() seems to be system dependent.

This is putting drivers into a tight position and they cannot achieve true
multi-arch enablement and are forced into calling __raw APIs flavors
directly with #ifdef BIG_ENDIAN ugliness.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-21  3:07 RFC on writel and writel_relaxed Sinan Kaya
@ 2018-03-21  3:40   ` Oliver
  0 siblings, 0 replies; 216+ messages in thread
From: Oliver @ 2018-03-21  3:40 UTC (permalink / raw)
  To: Sinan Kaya; +Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Wed, Mar 21, 2018 at 2:07 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> Hi PPC Maintainers,
>
> We are seeking feedback on the status of relaxed write API implementation.
> What is the motivation for not implementing the relaxed API?

Hmm, good question.

Looks like we've implemented the relaxed_* variants by aliasing them
to the normal version since the dawn of time. There's a comment in
io.h saying something about us not having the expected semantics for
the relaxed variants, but I don't see what the issue is... Ben?

> I see that network drivers are working around the issue by calling
> __raw_write() API directly but this also breaks other architectures
> like SPARC since the semantics of __raw_writel() seems to be system dependent.

Yeah that's pretty gross. Which drivers are doing this?

> This is putting drivers into a tight position and they cannot achieve true
> multi-arch enablement and are forced into calling __raw APIs flavors
> directly with #ifdef BIG_ENDIAN ugliness.
>
> Sinan
>
> --
> Sinan Kaya
> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
> Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-21  3:40   ` Oliver
  0 siblings, 0 replies; 216+ messages in thread
From: Oliver @ 2018-03-21  3:40 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Benjamin Herrenschmidt

On Wed, Mar 21, 2018 at 2:07 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> Hi PPC Maintainers,
>
> We are seeking feedback on the status of relaxed write API implementation.
> What is the motivation for not implementing the relaxed API?

Hmm, good question.

Looks like we've implemented the relaxed_* variants by aliasing them
to the normal version since the dawn of time. There's a comment in
io.h saying something about us not having the expected semantics for
the relaxed variants, but I don't see what the issue is... Ben?

> I see that network drivers are working around the issue by calling
> __raw_write() API directly but this also breaks other architectures
> like SPARC since the semantics of __raw_writel() seems to be system dependent.

Yeah that's pretty gross. Which drivers are doing this?

> This is putting drivers into a tight position and they cannot achieve true
> multi-arch enablement and are forced into calling __raw APIs flavors
> directly with #ifdef BIG_ENDIAN ugliness.
>
> Sinan
>
> --
> Sinan Kaya
> Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
> Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-21  3:40   ` Oliver
@ 2018-03-21 13:53     ` Sinan Kaya
  -1 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-21 13:53 UTC (permalink / raw)
  To: Oliver; +Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On 3/20/2018 10:40 PM, Oliver wrote:
>> I see that network drivers are working around the issue by calling
>> __raw_write() API directly but this also breaks other architectures
>> like SPARC since the semantics of __raw_writel() seems to be system dependent.
> Yeah that's pretty gross. Which drivers are doing this?
> 

Searching for __raw_writel() and BIG_ENDIAN in drivers/net directory
should give you the idea.

In a nutshell, drivers are doing this today.

wmb()
__raw_writel();
mmiowb()

I'm in the process of posting patches ([1], [2], [3]) to various subsystems to
eliminate double barriers by replacing sequences like 

wmb()
writel()
mmiowb()

with

wmb()
writel_relaxed()
mmiowb()

Reviewers pointed out that writel_relaxed() is not __raw_writel() on PPC
and cannot take advantage of the optimization. 

Replacing writel_relaxed() with raw_writel() or vice versa is not correct.

writel_relaxed() needs to have ordering guarantees with respect to the order
device observes writes. 

x86 has compiler barrier inside the relaxed() API so that code does not
get reordered. ARM64 architecturally guarantees device writes to be observed
in order.

I was hoping that PPC could follow x86 and inject compiler barrier into the
relaxed functions. 

BTW, I have no idea what compiler barrier does on PPC and if

wrltel() == compiler barrier() + wrltel_relaxed()

can be said.

Need guidance from the PPC maintainers. 

[1] https://www.spinics.net/lists/netdev/msg490480.html
[2] https://www.spinics.net/lists/arm-kernel/msg642341.html
[3] https://www.spinics.net/lists/arm-kernel/msg642336.html

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-21 13:53     ` Sinan Kaya
  0 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-21 13:53 UTC (permalink / raw)
  To: Oliver
  Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Benjamin Herrenschmidt

On 3/20/2018 10:40 PM, Oliver wrote:
>> I see that network drivers are working around the issue by calling
>> __raw_write() API directly but this also breaks other architectures
>> like SPARC since the semantics of __raw_writel() seems to be system dependent.
> Yeah that's pretty gross. Which drivers are doing this?
> 

Searching for __raw_writel() and BIG_ENDIAN in drivers/net directory
should give you the idea.

In a nutshell, drivers are doing this today.

wmb()
__raw_writel();
mmiowb()

I'm in the process of posting patches ([1], [2], [3]) to various subsystems to
eliminate double barriers by replacing sequences like 

wmb()
writel()
mmiowb()

with

wmb()
writel_relaxed()
mmiowb()

Reviewers pointed out that writel_relaxed() is not __raw_writel() on PPC
and cannot take advantage of the optimization. 

Replacing writel_relaxed() with raw_writel() or vice versa is not correct.

writel_relaxed() needs to have ordering guarantees with respect to the order
device observes writes. 

x86 has compiler barrier inside the relaxed() API so that code does not
get reordered. ARM64 architecturally guarantees device writes to be observed
in order.

I was hoping that PPC could follow x86 and inject compiler barrier into the
relaxed functions. 

BTW, I have no idea what compiler barrier does on PPC and if

wrltel() == compiler barrier() + wrltel_relaxed()

can be said.

Need guidance from the PPC maintainers. 

[1] https://www.spinics.net/lists/netdev/msg490480.html
[2] https://www.spinics.net/lists/arm-kernel/msg642341.html
[3] https://www.spinics.net/lists/arm-kernel/msg642336.html

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-21 13:53     ` Sinan Kaya
@ 2018-03-21 13:58       ` Sinan Kaya
  -1 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-21 13:58 UTC (permalink / raw)
  To: Oliver; +Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On 3/21/2018 8:53 AM, Sinan Kaya wrote:
> BTW, I have no idea what compiler barrier does on PPC and if
> 
> wrltel() == compiler barrier() + wrltel_relaxed()
> 
> can be said.

this should have been

writel_relaxed() == compiler barrier() + __raw_writel()



-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-21 13:58       ` Sinan Kaya
  0 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-21 13:58 UTC (permalink / raw)
  To: Oliver
  Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Benjamin Herrenschmidt

On 3/21/2018 8:53 AM, Sinan Kaya wrote:
> BTW, I have no idea what compiler barrier does on PPC and if
> 
> wrltel() == compiler barrier() + wrltel_relaxed()
> 
> can be said.

this should have been

writel_relaxed() == compiler barrier() + __raw_writel()



-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
  2018-03-21 13:53     ` Sinan Kaya
@ 2018-03-21 14:35       ` David Laight
  -1 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-21 14:35 UTC (permalink / raw)
  To: 'Sinan Kaya', Oliver
  Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

> x86 has compiler barrier inside the relaxed() API so that code does not
> get reordered. ARM64 architecturally guarantees device writes to be observed
> in order.

There are places where you don't even need a compile barrier between
every write.

I had horrid problems getting some ppc code (for a specific embedded SoC)
optimised to have no extra barriers.
I ended up just writing through 'pointer to volatile' and adding an
explicit 'eieio' between the block of writes and status read.

No less painful was doing a byteswapping write to normal memory.

	David


^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
@ 2018-03-21 14:35       ` David Laight
  0 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-21 14:35 UTC (permalink / raw)
  To: 'Sinan Kaya', Oliver
  Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

PiB4ODYgaGFzIGNvbXBpbGVyIGJhcnJpZXIgaW5zaWRlIHRoZSByZWxheGVkKCkgQVBJIHNvIHRo
YXQgY29kZSBkb2VzIG5vdA0KPiBnZXQgcmVvcmRlcmVkLiBBUk02NCBhcmNoaXRlY3R1cmFsbHkg
Z3VhcmFudGVlcyBkZXZpY2Ugd3JpdGVzIHRvIGJlIG9ic2VydmVkDQo+IGluIG9yZGVyLg0KDQpU
aGVyZSBhcmUgcGxhY2VzIHdoZXJlIHlvdSBkb24ndCBldmVuIG5lZWQgYSBjb21waWxlIGJhcnJp
ZXIgYmV0d2Vlbg0KZXZlcnkgd3JpdGUuDQoNCkkgaGFkIGhvcnJpZCBwcm9ibGVtcyBnZXR0aW5n
IHNvbWUgcHBjIGNvZGUgKGZvciBhIHNwZWNpZmljIGVtYmVkZGVkIFNvQykNCm9wdGltaXNlZCB0
byBoYXZlIG5vIGV4dHJhIGJhcnJpZXJzLg0KSSBlbmRlZCB1cCBqdXN0IHdyaXRpbmcgdGhyb3Vn
aCAncG9pbnRlciB0byB2b2xhdGlsZScgYW5kIGFkZGluZyBhbg0KZXhwbGljaXQgJ2VpZWlvJyBi
ZXR3ZWVuIHRoZSBibG9jayBvZiB3cml0ZXMgYW5kIHN0YXR1cyByZWFkLg0KDQpObyBsZXNzIHBh
aW5mdWwgd2FzIGRvaW5nIGEgYnl0ZXN3YXBwaW5nIHdyaXRlIHRvIG5vcm1hbCBtZW1vcnkuDQoN
CglEYXZpZA0KDQo=

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-21 14:35       ` David Laight
  (?)
@ 2018-03-21 15:04       ` Sinan Kaya
  -1 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-21 15:04 UTC (permalink / raw)
  To: David Laight, Oliver
  Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On 3/21/2018 9:35 AM, David Laight wrote:
>> x86 has compiler barrier inside the relaxed() API so that code does not
>> get reordered. ARM64 architecturally guarantees device writes to be observed
>> in order.
> 
> There are places where you don't even need a compile barrier between
> every write.
> 
> I had horrid problems getting some ppc code (for a specific embedded SoC)
> optimised to have no extra barriers.
> I ended up just writing through 'pointer to volatile' and adding an
> explicit 'eieio' between the block of writes and status read.
> 
> No less painful was doing a byteswapping write to normal memory.

If the architecture is reordering writes to the peripheral, then removing
the compiler barrier can break the multi-arch drivers. barriers document
clearly states that device need to observe writes in order. 

Though for special cases like you mentioned, you can certainly do this:

wmb()
__raw_write/pointer access
__raw_write/pointer access
__raw_write/pointer access

/* flush everything */
mmiowb()
__raw_write/pointer access

There would be no ordering guarantee between the wmb() and mmiowb().
This can only be done for known code and known hardware. I don't believe
this applies to multi-arch drivers.


> 
> 	David
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-21 13:53     ` Sinan Kaya
@ 2018-03-22  4:24       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-22  4:24 UTC (permalink / raw)
  To: Sinan Kaya, Oliver
  Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Wed, 2018-03-21 at 08:53 -0500, Sinan Kaya wrote:
> writel_relaxed() needs to have ordering guarantees with respect to the order
> device observes writes. 

Correct.

> x86 has compiler barrier inside the relaxed() API so that code does not
> get reordered. ARM64 architecturally guarantees device writes to be observed
> in order.
> 
> I was hoping that PPC could follow x86 and inject compiler barrier into the
> relaxed functions. 
> 
> BTW, I have no idea what compiler barrier does on PPC and if
> 
> wrltel() == compiler barrier() + wrltel_relaxed()
> 
> can be said.

No, it's not sufficient.

Replacing wmb() + writel() with wmb() + writel_relaxed() will work on
PPC, it will just not give you a benefit today.

The main problem is that the semantics of writel/writel_relaxed (and
read versions) aren't very well defined in Linux esp. when it comes
to different memory types (NC, WC, ...).

I've been wanting to implement the relaxed accessors for a while but
was battling with this to try to also better support WC, and due to
other commitments, this somewhat fell down the cracks.

Two options I can think of:

 - Just make the _relaxed variants use an eieio instead of a sync, this
will effectively lift the ordering guarantee vs. cachable storage (and
thus unlock) and might give a (small) performance improvement. However,
we still have the problem that on WC mappings, neither writel nor
writel_relaxed will effectively allow combining to happen (only raw
accesses will because on powerpc *all* barriers will break combining).

 - Make writel_relaxed() be a simple store without barriers, and
readl_relaxed() be "eieio, read, eieio", thus allowing write combining
to happen between successive writel_relaxed on WC space (no change on
normal NC space) while maintaining the ordering between relaxed reads
and writes. The flip side is a (slight) increased overhead of
readl_relaxed.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-22  4:24       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-22  4:24 UTC (permalink / raw)
  To: Sinan Kaya, Oliver
  Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma

On Wed, 2018-03-21 at 08:53 -0500, Sinan Kaya wrote:
> writel_relaxed() needs to have ordering guarantees with respect to the order
> device observes writes. 

Correct.

> x86 has compiler barrier inside the relaxed() API so that code does not
> get reordered. ARM64 architecturally guarantees device writes to be observed
> in order.
> 
> I was hoping that PPC could follow x86 and inject compiler barrier into the
> relaxed functions. 
> 
> BTW, I have no idea what compiler barrier does on PPC and if
> 
> wrltel() == compiler barrier() + wrltel_relaxed()
> 
> can be said.

No, it's not sufficient.

Replacing wmb() + writel() with wmb() + writel_relaxed() will work on
PPC, it will just not give you a benefit today.

The main problem is that the semantics of writel/writel_relaxed (and
read versions) aren't very well defined in Linux esp. when it comes
to different memory types (NC, WC, ...).

I've been wanting to implement the relaxed accessors for a while but
was battling with this to try to also better support WC, and due to
other commitments, this somewhat fell down the cracks.

Two options I can think of:

 - Just make the _relaxed variants use an eieio instead of a sync, this
will effectively lift the ordering guarantee vs. cachable storage (and
thus unlock) and might give a (small) performance improvement. However,
we still have the problem that on WC mappings, neither writel nor
writel_relaxed will effectively allow combining to happen (only raw
accesses will because on powerpc *all* barriers will break combining).

 - Make writel_relaxed() be a simple store without barriers, and
readl_relaxed() be "eieio, read, eieio", thus allowing write combining
to happen between successive writel_relaxed on WC space (no change on
normal NC space) while maintaining the ordering between relaxed reads
and writes. The flip side is a (slight) increased overhead of
readl_relaxed.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-21 14:35       ` David Laight
  (?)
  (?)
@ 2018-03-22  5:24       ` Oliver
  2018-03-22  8:20           ` Gabriel Paubert
  2018-03-22 10:37           ` David Laight
  -1 siblings, 2 replies; 216+ messages in thread
From: Oliver @ 2018-03-22  5:24 UTC (permalink / raw)
  To: David Laight
  Cc: Sinan Kaya, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote:
>> x86 has compiler barrier inside the relaxed() API so that code does not
>> get reordered. ARM64 architecturally guarantees device writes to be observed
>> in order.
>
> There are places where you don't even need a compile barrier between
> every write.
>
> I had horrid problems getting some ppc code (for a specific embedded SoC)
> optimised to have no extra barriers.
> I ended up just writing through 'pointer to volatile' and adding an
> explicit 'eieio' between the block of writes and status read.

This is what you are supposed to do. For accesses to MMIO (cache
inhibited + guarded) storage the Power ISA guarantees that load-load
and store-store pairs of accesses will always occur in program order,
but there's no implicit ordering between load-store or store-load
pairs. In those cases you need an explicit eieio barrier between the
two accesses. At the HW level you can think of the CPU as having
separate queues for MMIO loads and stores. Accesses will be added to
the respective queue in program order, but there's no synchronisation
between the two queues. If the CPU is doing write combining it's easy
to imagine the whole store queue being emptied in one big gulp before
the load queue is even touched.

> No less painful was doing a byteswapping write to normal memory.

What was the problem? The reverse indexed load/store instructions are
a little awkward to use, but they work...

>
>         David
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-22  5:24       ` Oliver
@ 2018-03-22  8:20           ` Gabriel Paubert
  2018-03-22 10:37           ` David Laight
  1 sibling, 0 replies; 216+ messages in thread
From: Gabriel Paubert @ 2018-03-22  8:20 UTC (permalink / raw)
  To: Oliver
  Cc: Sinan Kaya, linux-rdma, David Laight,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote:
> On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote:
> >> x86 has compiler barrier inside the relaxed() API so that code does not
> >> get reordered. ARM64 architecturally guarantees device writes to be observed
> >> in order.
> >
> > There are places where you don't even need a compile barrier between
> > every write.
> >
> > I had horrid problems getting some ppc code (for a specific embedded SoC)
> > optimised to have no extra barriers.
> > I ended up just writing through 'pointer to volatile' and adding an
> > explicit 'eieio' between the block of writes and status read.
> 
> This is what you are supposed to do. For accesses to MMIO (cache
> inhibited + guarded) storage the Power ISA guarantees that load-load
> and store-store pairs of accesses will always occur in program order,
> but there's no implicit ordering between load-store or store-load

And even for load store, eieio is not always necessary, in the important
case of reading and writing to the same address, when modifying bits in
a control register for example.

Typically also loads will be moved ahead of stores, but not the other
way around, so in practice you won't notice a missed eieio in this case.
This does not mean you should not insert it.

> pairs. In those cases you need an explicit eieio barrier between the
> two accesses. At the HW level you can think of the CPU as having
> separate queues for MMIO loads and stores. Accesses will be added to
> the respective queue in program order, but there's no synchronisation
> between the two queues. If the CPU is doing write combining it's easy
> to imagine the whole store queue being emptied in one big gulp before
> the load queue is even touched.

Is write combining allowed on guarded storage? 

<Looking at docs>
>From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited":

"No combining occurs if the storage is also Guarded"

	Gabriel

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-22  8:20           ` Gabriel Paubert
  0 siblings, 0 replies; 216+ messages in thread
From: Gabriel Paubert @ 2018-03-22  8:20 UTC (permalink / raw)
  To: Oliver
  Cc: David Laight, Sinan Kaya, linux-rdma,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote:
> On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote:
> >> x86 has compiler barrier inside the relaxed() API so that code does not
> >> get reordered. ARM64 architecturally guarantees device writes to be observed
> >> in order.
> >
> > There are places where you don't even need a compile barrier between
> > every write.
> >
> > I had horrid problems getting some ppc code (for a specific embedded SoC)
> > optimised to have no extra barriers.
> > I ended up just writing through 'pointer to volatile' and adding an
> > explicit 'eieio' between the block of writes and status read.
> 
> This is what you are supposed to do. For accesses to MMIO (cache
> inhibited + guarded) storage the Power ISA guarantees that load-load
> and store-store pairs of accesses will always occur in program order,
> but there's no implicit ordering between load-store or store-load

And even for load store, eieio is not always necessary, in the important
case of reading and writing to the same address, when modifying bits in
a control register for example.

Typically also loads will be moved ahead of stores, but not the other
way around, so in practice you won't notice a missed eieio in this case.
This does not mean you should not insert it.

> pairs. In those cases you need an explicit eieio barrier between the
> two accesses. At the HW level you can think of the CPU as having
> separate queues for MMIO loads and stores. Accesses will be added to
> the respective queue in program order, but there's no synchronisation
> between the two queues. If the CPU is doing write combining it's easy
> to imagine the whole store queue being emptied in one big gulp before
> the load queue is even touched.

Is write combining allowed on guarded storage? 

<Looking at docs>
>From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited":

"No combining occurs if the storage is also Guarded"

	Gabriel

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-22  8:20           ` Gabriel Paubert
@ 2018-03-22  9:25             ` Oliver
  -1 siblings, 0 replies; 216+ messages in thread
From: Oliver @ 2018-03-22  9:25 UTC (permalink / raw)
  To: Gabriel Paubert
  Cc: Sinan Kaya, linux-rdma, David Laight,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Thu, Mar 22, 2018 at 7:20 PM, Gabriel Paubert <paubert@iram.es> wrote:
> On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote:
>> On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote:
>> >> x86 has compiler barrier inside the relaxed() API so that code does not
>> >> get reordered. ARM64 architecturally guarantees device writes to be observed
>> >> in order.
>> >
>> > There are places where you don't even need a compile barrier between
>> > every write.
>> >
>> > I had horrid problems getting some ppc code (for a specific embedded SoC)
>> > optimised to have no extra barriers.
>> > I ended up just writing through 'pointer to volatile' and adding an
>> > explicit 'eieio' between the block of writes and status read.
>>
>> This is what you are supposed to do. For accesses to MMIO (cache
>> inhibited + guarded) storage the Power ISA guarantees that load-load
>> and store-store pairs of accesses will always occur in program order,
>> but there's no implicit ordering between load-store or store-load
>
> And even for load store, eieio is not always necessary, in the important
> case of reading and writing to the same address, when modifying bits in
> a control register for example.
>
> Typically also loads will be moved ahead of stores, but not the other
> way around, so in practice you won't notice a missed eieio in this case.
> This does not mean you should not insert it.

Yep, but it doesn't really help us here. The generic accessors need to cope
with the general case.

>> pairs. In those cases you need an explicit eieio barrier between the
>> two accesses. At the HW level you can think of the CPU as having
>> separate queues for MMIO loads and stores. Accesses will be added to
>> the respective queue in program order, but there's no synchronisation
>> between the two queues. If the CPU is doing write combining it's easy
>> to imagine the whole store queue being emptied in one big gulp before
>> the load queue is even touched.
>
> Is write combining allowed on guarded storage?
>
> <Looking at docs>
> From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited":
>
> "No combining occurs if the storage is also Guarded"

Yeah it's not allowed. That's what I get for handwaving examples ;)

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-22  9:25             ` Oliver
  0 siblings, 0 replies; 216+ messages in thread
From: Oliver @ 2018-03-22  9:25 UTC (permalink / raw)
  To: Gabriel Paubert
  Cc: David Laight, Sinan Kaya, linux-rdma,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Thu, Mar 22, 2018 at 7:20 PM, Gabriel Paubert <paubert@iram.es> wrote:
> On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote:
>> On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote:
>> >> x86 has compiler barrier inside the relaxed() API so that code does not
>> >> get reordered. ARM64 architecturally guarantees device writes to be observed
>> >> in order.
>> >
>> > There are places where you don't even need a compile barrier between
>> > every write.
>> >
>> > I had horrid problems getting some ppc code (for a specific embedded SoC)
>> > optimised to have no extra barriers.
>> > I ended up just writing through 'pointer to volatile' and adding an
>> > explicit 'eieio' between the block of writes and status read.
>>
>> This is what you are supposed to do. For accesses to MMIO (cache
>> inhibited + guarded) storage the Power ISA guarantees that load-load
>> and store-store pairs of accesses will always occur in program order,
>> but there's no implicit ordering between load-store or store-load
>
> And even for load store, eieio is not always necessary, in the important
> case of reading and writing to the same address, when modifying bits in
> a control register for example.
>
> Typically also loads will be moved ahead of stores, but not the other
> way around, so in practice you won't notice a missed eieio in this case.
> This does not mean you should not insert it.

Yep, but it doesn't really help us here. The generic accessors need to cope
with the general case.

>> pairs. In those cases you need an explicit eieio barrier between the
>> two accesses. At the HW level you can think of the CPU as having
>> separate queues for MMIO loads and stores. Accesses will be added to
>> the respective queue in program order, but there's no synchronisation
>> between the two queues. If the CPU is doing write combining it's easy
>> to imagine the whole store queue being emptied in one big gulp before
>> the load queue is even touched.
>
> Is write combining allowed on guarded storage?
>
> <Looking at docs>
> From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited":
>
> "No combining occurs if the storage is also Guarded"

Yeah it's not allowed. That's what I get for handwaving examples ;)

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-22  4:24       ` Benjamin Herrenschmidt
@ 2018-03-22 10:15         ` Oliver
  -1 siblings, 0 replies; 216+ messages in thread
From: Oliver @ 2018-03-22 10:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Sinan Kaya, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Thu, Mar 22, 2018 at 3:24 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2018-03-21 at 08:53 -0500, Sinan Kaya wrote:
>> writel_relaxed() needs to have ordering guarantees with respect to the order
>> device observes writes.
>
> Correct.
>
>> x86 has compiler barrier inside the relaxed() API so that code does not
>> get reordered. ARM64 architecturally guarantees device writes to be observed
>> in order.
>>
>> I was hoping that PPC could follow x86 and inject compiler barrier into the
>> relaxed functions.
>>
>> BTW, I have no idea what compiler barrier does on PPC and if
>>
>> wrltel() == compiler barrier() + wrltel_relaxed()
>>
>> can be said.
>
> No, it's not sufficient.
>
> Replacing wmb() + writel() with wmb() + writel_relaxed() will work on
> PPC, it will just not give you a benefit today.
>
> The main problem is that the semantics of writel/writel_relaxed (and
> read versions) aren't very well defined in Linux esp. when it comes
> to different memory types (NC, WC, ...).
>
> I've been wanting to implement the relaxed accessors for a while but
> was battling with this to try to also better support WC, and due to
> other commitments, this somewhat fell down the cracks.
>
> Two options I can think of:
>
>  - Just make the _relaxed variants use an eieio instead of a sync, this
> will effectively lift the ordering guarantee vs. cachable storage (and
> thus unlock) and might give a (small) performance improvement.

Wouldn't we still have the unlock ordering due to the io_sync hack or
are you thinking we should remove that too for the relaxed version?

> However,
> we still have the problem that on WC mappings, neither writel nor
> writel_relaxed will effectively allow combining to happen (only raw
> accesses will because on powerpc *all* barriers will break combining).

Hmm, eieio is only architected to affect CI+G (and WT) so it shouldn't
affect combining
on non-guarded memory. Do most implementations apply it to all CI
accesses anyway?

>  - Make writel_relaxed() be a simple store without barriers, and
> readl_relaxed() be "eieio, read, eieio", thus allowing write combining
> to happen between successive writel_relaxed on WC space (no change on
> normal NC space) while maintaining the ordering between relaxed reads
> and writes. The flip side is a (slight) increased overhead of
> readl_relaxed.

Are there many drivers that actually do writeX() on WC space?
memory-barriers.txt
pretty much says that all bets are off and no ordering guarantees can be assumed
when using readX/writeX on prefetchable IO memory. It seems sketchy enough to
give me some pause, but maybe it works fine elsewhere.

Oliver

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-22 10:15         ` Oliver
  0 siblings, 0 replies; 216+ messages in thread
From: Oliver @ 2018-03-22 10:15 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma

On Thu, Mar 22, 2018 at 3:24 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2018-03-21 at 08:53 -0500, Sinan Kaya wrote:
>> writel_relaxed() needs to have ordering guarantees with respect to the order
>> device observes writes.
>
> Correct.
>
>> x86 has compiler barrier inside the relaxed() API so that code does not
>> get reordered. ARM64 architecturally guarantees device writes to be observed
>> in order.
>>
>> I was hoping that PPC could follow x86 and inject compiler barrier into the
>> relaxed functions.
>>
>> BTW, I have no idea what compiler barrier does on PPC and if
>>
>> wrltel() == compiler barrier() + wrltel_relaxed()
>>
>> can be said.
>
> No, it's not sufficient.
>
> Replacing wmb() + writel() with wmb() + writel_relaxed() will work on
> PPC, it will just not give you a benefit today.
>
> The main problem is that the semantics of writel/writel_relaxed (and
> read versions) aren't very well defined in Linux esp. when it comes
> to different memory types (NC, WC, ...).
>
> I've been wanting to implement the relaxed accessors for a while but
> was battling with this to try to also better support WC, and due to
> other commitments, this somewhat fell down the cracks.
>
> Two options I can think of:
>
>  - Just make the _relaxed variants use an eieio instead of a sync, this
> will effectively lift the ordering guarantee vs. cachable storage (and
> thus unlock) and might give a (small) performance improvement.

Wouldn't we still have the unlock ordering due to the io_sync hack or
are you thinking we should remove that too for the relaxed version?

> However,
> we still have the problem that on WC mappings, neither writel nor
> writel_relaxed will effectively allow combining to happen (only raw
> accesses will because on powerpc *all* barriers will break combining).

Hmm, eieio is only architected to affect CI+G (and WT) so it shouldn't
affect combining
on non-guarded memory. Do most implementations apply it to all CI
accesses anyway?

>  - Make writel_relaxed() be a simple store without barriers, and
> readl_relaxed() be "eieio, read, eieio", thus allowing write combining
> to happen between successive writel_relaxed on WC space (no change on
> normal NC space) while maintaining the ordering between relaxed reads
> and writes. The flip side is a (slight) increased overhead of
> readl_relaxed.

Are there many drivers that actually do writeX() on WC space?
memory-barriers.txt
pretty much says that all bets are off and no ordering guarantees can be assumed
when using readX/writeX on prefetchable IO memory. It seems sketchy enough to
give me some pause, but maybe it works fine elsewhere.

Oliver

^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
  2018-03-22  5:24       ` Oliver
@ 2018-03-22 10:37           ` David Laight
  2018-03-22 10:37           ` David Laight
  1 sibling, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-22 10:37 UTC (permalink / raw)
  To: 'Oliver'
  Cc: Sinan Kaya, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

From: Oliver
> Sent: 22 March 2018 05:24
...
> > No less painful was doing a byteswapping write to normal memory.
> 
> What was the problem? The reverse indexed load/store instructions are
> a little awkward to use, but they work...

Finding something that would generate the right instruction without any
barriers.
ISTR writing my own asm pattern.

	David


^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
@ 2018-03-22 10:37           ` David Laight
  0 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-22 10:37 UTC (permalink / raw)
  To: 'Oliver'
  Cc: Sinan Kaya, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

RnJvbTogT2xpdmVyDQo+IFNlbnQ6IDIyIE1hcmNoIDIwMTggMDU6MjQNCi4uLg0KPiA+IE5vIGxl
c3MgcGFpbmZ1bCB3YXMgZG9pbmcgYSBieXRlc3dhcHBpbmcgd3JpdGUgdG8gbm9ybWFsIG1lbW9y
eS4NCj4gDQo+IFdoYXQgd2FzIHRoZSBwcm9ibGVtPyBUaGUgcmV2ZXJzZSBpbmRleGVkIGxvYWQv
c3RvcmUgaW5zdHJ1Y3Rpb25zIGFyZQ0KPiBhIGxpdHRsZSBhd2t3YXJkIHRvIHVzZSwgYnV0IHRo
ZXkgd29yay4uLg0KDQpGaW5kaW5nIHNvbWV0aGluZyB0aGF0IHdvdWxkIGdlbmVyYXRlIHRoZSBy
aWdodCBpbnN0cnVjdGlvbiB3aXRob3V0IGFueQ0KYmFycmllcnMuDQpJU1RSIHdyaXRpbmcgbXkg
b3duIGFzbSBwYXR0ZXJuLg0KDQoJRGF2aWQNCg0K

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-22  9:25             ` Oliver
@ 2018-03-22 11:25               ` Gabriel Paubert
  -1 siblings, 0 replies; 216+ messages in thread
From: Gabriel Paubert @ 2018-03-22 11:25 UTC (permalink / raw)
  To: Oliver
  Cc: Sinan Kaya, linux-rdma, David Laight,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Thu, Mar 22, 2018 at 08:25:43PM +1100, Oliver wrote:
> On Thu, Mar 22, 2018 at 7:20 PM, Gabriel Paubert <paubert@iram.es> wrote:
> > On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote:
> >> On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote:
> >> >> x86 has compiler barrier inside the relaxed() API so that code does not
> >> >> get reordered. ARM64 architecturally guarantees device writes to be observed
> >> >> in order.
> >> >
> >> > There are places where you don't even need a compile barrier between
> >> > every write.
> >> >
> >> > I had horrid problems getting some ppc code (for a specific embedded SoC)
> >> > optimised to have no extra barriers.
> >> > I ended up just writing through 'pointer to volatile' and adding an
> >> > explicit 'eieio' between the block of writes and status read.
> >>
> >> This is what you are supposed to do. For accesses to MMIO (cache
> >> inhibited + guarded) storage the Power ISA guarantees that load-load
> >> and store-store pairs of accesses will always occur in program order,
> >> but there's no implicit ordering between load-store or store-load
> >
> > And even for load store, eieio is not always necessary, in the important
> > case of reading and writing to the same address, when modifying bits in
> > a control register for example.
> >
> > Typically also loads will be moved ahead of stores, but not the other
> > way around, so in practice you won't notice a missed eieio in this case.
> > This does not mean you should not insert it.
> 
> Yep, but it doesn't really help us here. The generic accessors need to cope
> with the general case.

A generic accessor for modifying fields in a device register might be an 
useful addition to the current set. This is a fairly frequent operation.

Actually I did add macros to do exactly this in drivers for our own 
hardware here almost 20 years ago. I was fed up with writing
writel(readl(reg) & mask | value, reg), especially when reg was not
that simple (one device had over 100 registers). The macros obviously 
guaranteed that both accesses would be to the same register, something 
easy to get wrong with cut and paste.

> 
> >> pairs. In those cases you need an explicit eieio barrier between the
> >> two accesses. At the HW level you can think of the CPU as having
> >> separate queues for MMIO loads and stores. Accesses will be added to
> >> the respective queue in program order, but there's no synchronisation
> >> between the two queues. If the CPU is doing write combining it's easy
> >> to imagine the whole store queue being emptied in one big gulp before
> >> the load queue is even touched.
> >
> > Is write combining allowed on guarded storage?
> >
> > <Looking at docs>
> > From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited":
> >
> > "No combining occurs if the storage is also Guarded"
> 
> Yeah it's not allowed. That's what I get for handwaving examples ;)

At least it means that, for cache-inhibited guarded storage, there is a 
one to one correspondance between instructions and bus cycles. The only 
issue left is ordering ;)

	Gabriel

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-22 11:25               ` Gabriel Paubert
  0 siblings, 0 replies; 216+ messages in thread
From: Gabriel Paubert @ 2018-03-22 11:25 UTC (permalink / raw)
  To: Oliver
  Cc: David Laight, Sinan Kaya, linux-rdma,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Thu, Mar 22, 2018 at 08:25:43PM +1100, Oliver wrote:
> On Thu, Mar 22, 2018 at 7:20 PM, Gabriel Paubert <paubert@iram.es> wrote:
> > On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote:
> >> On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote:
> >> >> x86 has compiler barrier inside the relaxed() API so that code does not
> >> >> get reordered. ARM64 architecturally guarantees device writes to be observed
> >> >> in order.
> >> >
> >> > There are places where you don't even need a compile barrier between
> >> > every write.
> >> >
> >> > I had horrid problems getting some ppc code (for a specific embedded SoC)
> >> > optimised to have no extra barriers.
> >> > I ended up just writing through 'pointer to volatile' and adding an
> >> > explicit 'eieio' between the block of writes and status read.
> >>
> >> This is what you are supposed to do. For accesses to MMIO (cache
> >> inhibited + guarded) storage the Power ISA guarantees that load-load
> >> and store-store pairs of accesses will always occur in program order,
> >> but there's no implicit ordering between load-store or store-load
> >
> > And even for load store, eieio is not always necessary, in the important
> > case of reading and writing to the same address, when modifying bits in
> > a control register for example.
> >
> > Typically also loads will be moved ahead of stores, but not the other
> > way around, so in practice you won't notice a missed eieio in this case.
> > This does not mean you should not insert it.
> 
> Yep, but it doesn't really help us here. The generic accessors need to cope
> with the general case.

A generic accessor for modifying fields in a device register might be an 
useful addition to the current set. This is a fairly frequent operation.

Actually I did add macros to do exactly this in drivers for our own 
hardware here almost 20 years ago. I was fed up with writing
writel(readl(reg) & mask | value, reg), especially when reg was not
that simple (one device had over 100 registers). The macros obviously 
guaranteed that both accesses would be to the same register, something 
easy to get wrong with cut and paste.

> 
> >> pairs. In those cases you need an explicit eieio barrier between the
> >> two accesses. At the HW level you can think of the CPU as having
> >> separate queues for MMIO loads and stores. Accesses will be added to
> >> the respective queue in program order, but there's no synchronisation
> >> between the two queues. If the CPU is doing write combining it's easy
> >> to imagine the whole store queue being emptied in one big gulp before
> >> the load queue is even touched.
> >
> > Is write combining allowed on guarded storage?
> >
> > <Looking at docs>
> > From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited":
> >
> > "No combining occurs if the storage is also Guarded"
> 
> Yeah it's not allowed. That's what I get for handwaving examples ;)

At least it means that, for cache-inhibited guarded storage, there is a 
one to one correspondance between instructions and bus cycles. The only 
issue left is ordering ;)

	Gabriel

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-22 10:15         ` Oliver
@ 2018-03-22 13:52           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-22 13:52 UTC (permalink / raw)
  To: Oliver
  Cc: Sinan Kaya, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Thu, 2018-03-22 at 21:15 +1100, Oliver wrote:
> On Thu, Mar 22, 2018 at 3:24 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Wed, 2018-03-21 at 08:53 -0500, Sinan Kaya wrote:
> > > writel_relaxed() needs to have ordering guarantees with respect to the order
> > > device observes writes.
> > 
> > Correct.
> > 
> > > x86 has compiler barrier inside the relaxed() API so that code does not
> > > get reordered. ARM64 architecturally guarantees device writes to be observed
> > > in order.
> > > 
> > > I was hoping that PPC could follow x86 and inject compiler barrier into the
> > > relaxed functions.
> > > 
> > > BTW, I have no idea what compiler barrier does on PPC and if
> > > 
> > > wrltel() == compiler barrier() + wrltel_relaxed()
> > > 
> > > can be said.
> > 
> > No, it's not sufficient.

Just to clarify ... barrier() is just a compiler barrier, it means the
compiler will generate things in the order they are written. This isn't
sufficient on archs with an OO memory model, where an actual memory
barrier instruction needs to be emited.

As for Oliver comments...

> > Replacing wmb() + writel() with wmb() + writel_relaxed() will work on
> > PPC, it will just not give you a benefit today.
> > 
> > The main problem is that the semantics of writel/writel_relaxed (and
> > read versions) aren't very well defined in Linux esp. when it comes
> > to different memory types (NC, WC, ...).
> > 
> > I've been wanting to implement the relaxed accessors for a while but
> > was battling with this to try to also better support WC, and due to
> > other commitments, this somewhat fell down the cracks.
> > 
> > Two options I can think of:
> > 
> >  - Just make the _relaxed variants use an eieio instead of a sync, this
> > will effectively lift the ordering guarantee vs. cachable storage (and
> > thus unlock) and might give a (small) performance improvement.
> 
> Wouldn't we still have the unlock ordering due to the io_sync hack or
> are you thinking we should remove that too for the relaxed version?

Well, the documentation says we don't care about synchronization vs.
locks so we should probably remove it (then we need to make sure mmiowb
works, thus sets the flag).

> > However,
> > we still have the problem that on WC mappings, neither writel nor
> > writel_relaxed will effectively allow combining to happen (only raw
> > accesses will because on powerpc *all* barriers will break combining).
> 
> Hmm, eieio is only architected to affect CI+G (and WT) so it shouldn't
> affect combining
> on non-guarded memory. Do most implementations apply it to all CI
> accesses anyway?

Yes, as far as I know all implementations will stop combining on *any*
barrier instruction.

> >  - Make writel_relaxed() be a simple store without barriers, and
> > readl_relaxed() be "eieio, read, eieio", thus allowing write combining
> > to happen between successive writel_relaxed on WC space (no change on
> > normal NC space) while maintaining the ordering between relaxed reads
> > and writes. The flip side is a (slight) increased overhead of
> > readl_relaxed.
> 
> Are there many drivers that actually do writeX() on WC space?
> memory-barriers.txt
> pretty much says that all bets are off and no ordering guarantees can be assumed
> when using readX/writeX on prefetchable IO memory. It seems sketchy enough to
> give me some pause, but maybe it works fine elsewhere.

I don't know whether any does it, but I want to provide a way for a
driver to somewhat reliably obtain write combine semantics without
having to hand code endian swap and other horrors involved with using
__raw_* accessors.

So my thinking is to define that the combination of WC + writel_relaxed
gives you that, which it does at least on x86 and ARM afaik.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-22 13:52           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-22 13:52 UTC (permalink / raw)
  To: Oliver
  Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma

On Thu, 2018-03-22 at 21:15 +1100, Oliver wrote:
> On Thu, Mar 22, 2018 at 3:24 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Wed, 2018-03-21 at 08:53 -0500, Sinan Kaya wrote:
> > > writel_relaxed() needs to have ordering guarantees with respect to the order
> > > device observes writes.
> > 
> > Correct.
> > 
> > > x86 has compiler barrier inside the relaxed() API so that code does not
> > > get reordered. ARM64 architecturally guarantees device writes to be observed
> > > in order.
> > > 
> > > I was hoping that PPC could follow x86 and inject compiler barrier into the
> > > relaxed functions.
> > > 
> > > BTW, I have no idea what compiler barrier does on PPC and if
> > > 
> > > wrltel() == compiler barrier() + wrltel_relaxed()
> > > 
> > > can be said.
> > 
> > No, it's not sufficient.

Just to clarify ... barrier() is just a compiler barrier, it means the
compiler will generate things in the order they are written. This isn't
sufficient on archs with an OO memory model, where an actual memory
barrier instruction needs to be emited.

As for Oliver comments...

> > Replacing wmb() + writel() with wmb() + writel_relaxed() will work on
> > PPC, it will just not give you a benefit today.
> > 
> > The main problem is that the semantics of writel/writel_relaxed (and
> > read versions) aren't very well defined in Linux esp. when it comes
> > to different memory types (NC, WC, ...).
> > 
> > I've been wanting to implement the relaxed accessors for a while but
> > was battling with this to try to also better support WC, and due to
> > other commitments, this somewhat fell down the cracks.
> > 
> > Two options I can think of:
> > 
> >  - Just make the _relaxed variants use an eieio instead of a sync, this
> > will effectively lift the ordering guarantee vs. cachable storage (and
> > thus unlock) and might give a (small) performance improvement.
> 
> Wouldn't we still have the unlock ordering due to the io_sync hack or
> are you thinking we should remove that too for the relaxed version?

Well, the documentation says we don't care about synchronization vs.
locks so we should probably remove it (then we need to make sure mmiowb
works, thus sets the flag).

> > However,
> > we still have the problem that on WC mappings, neither writel nor
> > writel_relaxed will effectively allow combining to happen (only raw
> > accesses will because on powerpc *all* barriers will break combining).
> 
> Hmm, eieio is only architected to affect CI+G (and WT) so it shouldn't
> affect combining
> on non-guarded memory. Do most implementations apply it to all CI
> accesses anyway?

Yes, as far as I know all implementations will stop combining on *any*
barrier instruction.

> >  - Make writel_relaxed() be a simple store without barriers, and
> > readl_relaxed() be "eieio, read, eieio", thus allowing write combining
> > to happen between successive writel_relaxed on WC space (no change on
> > normal NC space) while maintaining the ordering between relaxed reads
> > and writes. The flip side is a (slight) increased overhead of
> > readl_relaxed.
> 
> Are there many drivers that actually do writeX() on WC space?
> memory-barriers.txt
> pretty much says that all bets are off and no ordering guarantees can be assumed
> when using readX/writeX on prefetchable IO memory. It seems sketchy enough to
> give me some pause, but maybe it works fine elsewhere.

I don't know whether any does it, but I want to provide a way for a
driver to somewhat reliably obtain write combine semantics without
having to hand code endian swap and other horrors involved with using
__raw_* accessors.

So my thinking is to define that the combination of WC + writel_relaxed
gives you that, which it does at least on x86 and ARM afaik.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-22 13:52           ` Benjamin Herrenschmidt
@ 2018-03-22 17:51             ` Sinan Kaya
  -1 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-22 17:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Oliver
  Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
>>> No, it's not sufficient.
> Just to clarify ... barrier() is just a compiler barrier, it means the
> compiler will generate things in the order they are written. This isn't
> sufficient on archs with an OO memory model, where an actual memory
> barrier instruction needs to be emited.

Surprisingly, ARM64 GCC compiler generates a write barrier as
opposed to preventing code reordering.

I was curious if this is an ARM only thing or not. 

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-22 17:51             ` Sinan Kaya
  0 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-22 17:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Oliver
  Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma

On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
>>> No, it's not sufficient.
> Just to clarify ... barrier() is just a compiler barrier, it means the
> compiler will generate things in the order they are written. This isn't
> sufficient on archs with an OO memory model, where an actual memory
> barrier instruction needs to be emited.

Surprisingly, ARM64 GCC compiler generates a write barrier as
opposed to preventing code reordering.

I was curious if this is an ARM only thing or not. 

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-22 17:51             ` Sinan Kaya
@ 2018-03-23  0:16               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-23  0:16 UTC (permalink / raw)
  To: Sinan Kaya, Oliver
  Cc: linux-rdma, Marc Zyngier, linuxppc dev list, Will Deacon

On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
> On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
> > > > No, it's not sufficient.
> > 
> > Just to clarify ... barrier() is just a compiler barrier, it means the
> > compiler will generate things in the order they are written. This isn't
> > sufficient on archs with an OO memory model, where an actual memory
> > barrier instruction needs to be emited.
> 
> Surprisingly, ARM64 GCC compiler generates a write barrier as
> opposed to preventing code reordering.
> 
> I was curious if this is an ARM only thing or not. 

Are you sure of that ? I thought it's the ARM implementation of writel
that had an explicit write barrier in it:

#define writel(v,c)		({ __iowmb(); writel_relaxed((v),(c)); })

And __iowmb() is 

#define __iowmb()		wmb()

Note, I'm a bit dubious about this in ARM:

#define readl(c)		({ u32 __v = readl_relaxed(c); __iormb(); __v; }

Will, Marc, on powerpc, we put a sync *before* the read in readl etc...

The reasoning was there could be some DMA setup followed by a side
effect readl rather than a side effect writel to trigger a DMA. Granted
I wouldn't expect modern devices to be that stupid, but I have vague
memory of some devices back in the day having that sort of read ops.

In general, I though the model offerred by x86 and thus by Linux
readl/writel was full synchronization both before and after the MMIO,
vs either other MMIO or all other forms of ops (cachable memory, locks
etc...).

Also, can't the above readl_relaxed leak out of a lock ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-23  0:16               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-23  0:16 UTC (permalink / raw)
  To: Sinan Kaya, Oliver
  Cc: linuxppc dev list, linux-rdma, Marc Zyngier, Will Deacon

On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
> On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
> > > > No, it's not sufficient.
> > 
> > Just to clarify ... barrier() is just a compiler barrier, it means the
> > compiler will generate things in the order they are written. This isn't
> > sufficient on archs with an OO memory model, where an actual memory
> > barrier instruction needs to be emited.
> 
> Surprisingly, ARM64 GCC compiler generates a write barrier as
> opposed to preventing code reordering.
> 
> I was curious if this is an ARM only thing or not. 

Are you sure of that ? I thought it's the ARM implementation of writel
that had an explicit write barrier in it:

#define writel(v,c)		({ __iowmb(); writel_relaxed((v),(c)); })

And __iowmb() is 

#define __iowmb()		wmb()

Note, I'm a bit dubious about this in ARM:

#define readl(c)		({ u32 __v = readl_relaxed(c); __iormb(); __v; }

Will, Marc, on powerpc, we put a sync *before* the read in readl etc...

The reasoning was there could be some DMA setup followed by a side
effect readl rather than a side effect writel to trigger a DMA. Granted
I wouldn't expect modern devices to be that stupid, but I have vague
memory of some devices back in the day having that sort of read ops.

In general, I though the model offerred by x86 and thus by Linux
readl/writel was full synchronization both before and after the MMIO,
vs either other MMIO or all other forms of ops (cachable memory, locks
etc...).

Also, can't the above readl_relaxed leak out of a lock ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-23  0:16               ` Benjamin Herrenschmidt
@ 2018-03-23 13:42                 ` Sinan Kaya
  -1 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-23 13:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Oliver
  Cc: linux-rdma, Marc Zyngier, linuxppc dev list, Will Deacon

On 3/22/2018 8:16 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
>> On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
>>>>> No, it's not sufficient.
>>>
>>> Just to clarify ... barrier() is just a compiler barrier, it means the
>>> compiler will generate things in the order they are written. This isn't
>>> sufficient on archs with an OO memory model, where an actual memory
>>> barrier instruction needs to be emited.
>>
>> Surprisingly, ARM64 GCC compiler generates a write barrier as
>> opposed to preventing code reordering.
>>
>> I was curious if this is an ARM only thing or not. 
> 
> Are you sure of that ? I thought it's the ARM implementation of writel
> that had an explicit write barrier in it:

Yes, I'm %100 sure. The answer is both writel() and barrier() generates
a write barrier instruction. I found this by searching the kernel disassembly
for back to back "dsb st" instruction.

> 
> #define writel(v,c)		({ __iowmb(); writel_relaxed((v),(c)); })
> 
> And __iowmb() is 
> 
> #define __iowmb()		wmb()
> 
> Note, I'm a bit dubious about this in ARM:
> 
> #define readl(c)		({ u32 __v = readl_relaxed(c); __iormb(); __v; }
> 
> Will, Marc, on powerpc, we put a sync *before* the read in readl etc...
> 
> The reasoning was there could be some DMA setup followed by a side
> effect readl rather than a side effect writel to trigger a DMA. Granted
> I wouldn't expect modern devices to be that stupid, but I have vague
> memory of some devices back in the day having that sort of read ops.
> 
> In general, I though the model offerred by x86 and thus by Linux
> readl/writel was full synchronization both before and after the MMIO,
> vs either other MMIO or all other forms of ops (cachable memory, locks
> etc...).
> 
> Also, can't the above readl_relaxed leak out of a lock ?

I think you are asking about PPC, correct? 

I read somewhere that PPC implementation keeps track of MMIO accesses and
has an implicit barrier inside the spin_unlock() code for such accesses.
Isn't this true?

> 
> Cheers,
> Ben.
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-23 13:42                 ` Sinan Kaya
  0 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-23 13:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Oliver
  Cc: linuxppc dev list, linux-rdma, Marc Zyngier, Will Deacon

On 3/22/2018 8:16 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
>> On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
>>>>> No, it's not sufficient.
>>>
>>> Just to clarify ... barrier() is just a compiler barrier, it means the
>>> compiler will generate things in the order they are written. This isn't
>>> sufficient on archs with an OO memory model, where an actual memory
>>> barrier instruction needs to be emited.
>>
>> Surprisingly, ARM64 GCC compiler generates a write barrier as
>> opposed to preventing code reordering.
>>
>> I was curious if this is an ARM only thing or not. 
> 
> Are you sure of that ? I thought it's the ARM implementation of writel
> that had an explicit write barrier in it:

Yes, I'm %100 sure. The answer is both writel() and barrier() generates
a write barrier instruction. I found this by searching the kernel disassembly
for back to back "dsb st" instruction.

> 
> #define writel(v,c)		({ __iowmb(); writel_relaxed((v),(c)); })
> 
> And __iowmb() is 
> 
> #define __iowmb()		wmb()
> 
> Note, I'm a bit dubious about this in ARM:
> 
> #define readl(c)		({ u32 __v = readl_relaxed(c); __iormb(); __v; }
> 
> Will, Marc, on powerpc, we put a sync *before* the read in readl etc...
> 
> The reasoning was there could be some DMA setup followed by a side
> effect readl rather than a side effect writel to trigger a DMA. Granted
> I wouldn't expect modern devices to be that stupid, but I have vague
> memory of some devices back in the day having that sort of read ops.
> 
> In general, I though the model offerred by x86 and thus by Linux
> readl/writel was full synchronization both before and after the MMIO,
> vs either other MMIO or all other forms of ops (cachable memory, locks
> etc...).
> 
> Also, can't the above readl_relaxed leak out of a lock ?

I think you are asking about PPC, correct? 

I read somewhere that PPC implementation keeps track of MMIO accesses and
has an implicit barrier inside the spin_unlock() code for such accesses.
Isn't this true?

> 
> Cheers,
> Ben.
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-22 13:52           ` Benjamin Herrenschmidt
@ 2018-03-23 16:35             ` Jason Gunthorpe
  -1 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-23 16:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Oliver, linux-rdma

On Fri, Mar 23, 2018 at 12:52:02AM +1100, Benjamin Herrenschmidt wrote:

> > >  - Make writel_relaxed() be a simple store without barriers, and
> > > readl_relaxed() be "eieio, read, eieio", thus allowing write combining
> > > to happen between successive writel_relaxed on WC space (no change on
> > > normal NC space) while maintaining the ordering between relaxed reads
> > > and writes. The flip side is a (slight) increased overhead of
> > > readl_relaxed.
> > 
> > Are there many drivers that actually do writeX() on WC space?
> > memory-barriers.txt
> > pretty much says that all bets are off and no ordering guarantees can be assumed
> > when using readX/writeX on prefetchable IO memory. It seems sketchy enough to
> > give me some pause, but maybe it works fine elsewhere.
> 
> I don't know whether any does it, but I want to provide a way for a
> driver to somewhat reliably obtain write combine semantics without
> having to hand code endian swap and other horrors involved with using
> __raw_* accessors.

Many of the drivers in drivers/infiniband work with write combining
memory.

The usual pattern is a desire to push 32 or 64 bytes to the WC BAR as
efficiently as possible, ideally in a single PCI-E TLP.

A memcpy_to_wc primitive could probably cover these use cases, no need
to redesign the IO accessors..

The WC memory is never read, so read/write order is not important to
any infiniband driver.

What is very important is keeping the WC behavior isolated within the
spinlock. WC to the same addresses cannot be permitted in this pattern:

   writel(addr = 0);
   mmiowmb();
   spin_unlock();
   spin_lock()
   writel(addr = 0);

The CPU must always generate two PCI-E TLPs to the device.

This is a super performance critical operation for most drivers and
directly impacts network performance.

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-23 16:35             ` Jason Gunthorpe
  0 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-23 16:35 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Oliver, Sinan Kaya,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma

On Fri, Mar 23, 2018 at 12:52:02AM +1100, Benjamin Herrenschmidt wrote:

> > >  - Make writel_relaxed() be a simple store without barriers, and
> > > readl_relaxed() be "eieio, read, eieio", thus allowing write combining
> > > to happen between successive writel_relaxed on WC space (no change on
> > > normal NC space) while maintaining the ordering between relaxed reads
> > > and writes. The flip side is a (slight) increased overhead of
> > > readl_relaxed.
> > 
> > Are there many drivers that actually do writeX() on WC space?
> > memory-barriers.txt
> > pretty much says that all bets are off and no ordering guarantees can be assumed
> > when using readX/writeX on prefetchable IO memory. It seems sketchy enough to
> > give me some pause, but maybe it works fine elsewhere.
> 
> I don't know whether any does it, but I want to provide a way for a
> driver to somewhat reliably obtain write combine semantics without
> having to hand code endian swap and other horrors involved with using
> __raw_* accessors.

Many of the drivers in drivers/infiniband work with write combining
memory.

The usual pattern is a desire to push 32 or 64 bytes to the WC BAR as
efficiently as possible, ideally in a single PCI-E TLP.

A memcpy_to_wc primitive could probably cover these use cases, no need
to redesign the IO accessors..

The WC memory is never read, so read/write order is not important to
any infiniband driver.

What is very important is keeping the WC behavior isolated within the
spinlock. WC to the same addresses cannot be permitted in this pattern:

   writel(addr = 0);
   mmiowmb();
   spin_unlock();
   spin_lock()
   writel(addr = 0);

The CPU must always generate two PCI-E TLPs to the device.

This is a super performance critical operation for most drivers and
directly impacts network performance.

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-23 13:42                 ` Sinan Kaya
@ 2018-03-24  1:22                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-24  1:22 UTC (permalink / raw)
  To: Sinan Kaya, Oliver
  Cc: linux-rdma, Marc Zyngier, linuxppc dev list, Will Deacon

On Fri, 2018-03-23 at 09:42 -0400, Sinan Kaya wrote:
> On 3/22/2018 8:16 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
> > > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
> > > > > > No, it's not sufficient.
> > > > 
> > > > Just to clarify ... barrier() is just a compiler barrier, it means the
> > > > compiler will generate things in the order they are written. This isn't
> > > > sufficient on archs with an OO memory model, where an actual memory
> > > > barrier instruction needs to be emited.
> > > 
> > > Surprisingly, ARM64 GCC compiler generates a write barrier as
> > > opposed to preventing code reordering.
> > > 
> > > I was curious if this is an ARM only thing or not. 
> > 
> > Are you sure of that ? I thought it's the ARM implementation of writel
> > that had an explicit write barrier in it:
> 
> Yes, I'm %100 sure. The answer is both writel() and barrier() generates
> a write barrier instruction. I found this by searching the kernel disassembly
> for back to back "dsb st" instruction.

I'm not sure you are correct here. As I wrote below, the implementatoin
of writel() contains an *explicit" memory barrier which is completely
different to a barrier() instruction:
> 
> > #define writel(v,c)		({ __iowmb(); writel_relaxed((v),(c)); })
> > 
> > And __iowmb() is 
> > 
> > #define __iowmb()		wmb()
> > 
> > Note, I'm a bit dubious about this in ARM:
> > 
> > #define readl(c)		({ u32 __v = readl_relaxed(c); __iormb(); __v; }
> > 
> > Will, Marc, on powerpc, we put a sync *before* the read in readl etc...
> > 
> > The reasoning was there could be some DMA setup followed by a side
> > effect readl rather than a side effect writel to trigger a DMA. Granted
> > I wouldn't expect modern devices to be that stupid, but I have vague
> > memory of some devices back in the day having that sort of read ops.
> > 
> > In general, I though the model offerred by x86 and thus by Linux
> > readl/writel was full synchronization both before and after the MMIO,
> > vs either other MMIO or all other forms of ops (cachable memory, locks
> > etc...).
> > 
> > Also, can't the above readl_relaxed leak out of a lock ?
> 
> I think you are asking about PPC, correct? 

No, I'm asking about ARM.

> I read somewhere that PPC implementation keeps track of MMIO accesses and
> has an implicit barrier inside the spin_unlock() code for such accesses.
> Isn't this true?

Yes, I wrote that code :-) We keep track of writes and put an implicit
stronger barrier in unlock on ppc64 because drivers never get mmiowb
right.

Cheers,
Ben.

> > 
> > Cheers,
> > Ben.
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-24  1:22                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-24  1:22 UTC (permalink / raw)
  To: Sinan Kaya, Oliver
  Cc: linuxppc dev list, linux-rdma, Marc Zyngier, Will Deacon

On Fri, 2018-03-23 at 09:42 -0400, Sinan Kaya wrote:
> On 3/22/2018 8:16 PM, Benjamin Herrenschmidt wrote:
> > On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
> > > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
> > > > > > No, it's not sufficient.
> > > > 
> > > > Just to clarify ... barrier() is just a compiler barrier, it means the
> > > > compiler will generate things in the order they are written. This isn't
> > > > sufficient on archs with an OO memory model, where an actual memory
> > > > barrier instruction needs to be emited.
> > > 
> > > Surprisingly, ARM64 GCC compiler generates a write barrier as
> > > opposed to preventing code reordering.
> > > 
> > > I was curious if this is an ARM only thing or not. 
> > 
> > Are you sure of that ? I thought it's the ARM implementation of writel
> > that had an explicit write barrier in it:
> 
> Yes, I'm %100 sure. The answer is both writel() and barrier() generates
> a write barrier instruction. I found this by searching the kernel disassembly
> for back to back "dsb st" instruction.

I'm not sure you are correct here. As I wrote below, the implementatoin
of writel() contains an *explicit" memory barrier which is completely
different to a barrier() instruction:
> 
> > #define writel(v,c)		({ __iowmb(); writel_relaxed((v),(c)); })
> > 
> > And __iowmb() is 
> > 
> > #define __iowmb()		wmb()
> > 
> > Note, I'm a bit dubious about this in ARM:
> > 
> > #define readl(c)		({ u32 __v = readl_relaxed(c); __iormb(); __v; }
> > 
> > Will, Marc, on powerpc, we put a sync *before* the read in readl etc...
> > 
> > The reasoning was there could be some DMA setup followed by a side
> > effect readl rather than a side effect writel to trigger a DMA. Granted
> > I wouldn't expect modern devices to be that stupid, but I have vague
> > memory of some devices back in the day having that sort of read ops.
> > 
> > In general, I though the model offerred by x86 and thus by Linux
> > readl/writel was full synchronization both before and after the MMIO,
> > vs either other MMIO or all other forms of ops (cachable memory, locks
> > etc...).
> > 
> > Also, can't the above readl_relaxed leak out of a lock ?
> 
> I think you are asking about PPC, correct? 

No, I'm asking about ARM.

> I read somewhere that PPC implementation keeps track of MMIO accesses and
> has an implicit barrier inside the spin_unlock() code for such accesses.
> Isn't this true?

Yes, I wrote that code :-) We keep track of writes and put an implicit
stronger barrier in unlock on ppc64 because drivers never get mmiowb
right.

Cheers,
Ben.

> > 
> > Cheers,
> > Ben.
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-23 16:35             ` Jason Gunthorpe
@ 2018-03-24  1:23               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-24  1:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Oliver, linux-rdma

On Fri, 2018-03-23 at 10:35 -0600, Jason Gunthorpe wrote:
> On Fri, Mar 23, 2018 at 12:52:02AM +1100, Benjamin Herrenschmidt wrote:
> 
> > > >  - Make writel_relaxed() be a simple store without barriers, and
> > > > readl_relaxed() be "eieio, read, eieio", thus allowing write combining
> > > > to happen between successive writel_relaxed on WC space (no change on
> > > > normal NC space) while maintaining the ordering between relaxed reads
> > > > and writes. The flip side is a (slight) increased overhead of
> > > > readl_relaxed.
> > > 
> > > Are there many drivers that actually do writeX() on WC space?
> > > memory-barriers.txt
> > > pretty much says that all bets are off and no ordering guarantees can be assumed
> > > when using readX/writeX on prefetchable IO memory. It seems sketchy enough to
> > > give me some pause, but maybe it works fine elsewhere.
> > 
> > I don't know whether any does it, but I want to provide a way for a
> > driver to somewhat reliably obtain write combine semantics without
> > having to hand code endian swap and other horrors involved with using
> > __raw_* accessors.
> 
> Many of the drivers in drivers/infiniband work with write combining
> memory.
> 
> The usual pattern is a desire to push 32 or 64 bytes to the WC BAR as
> efficiently as possible, ideally in a single PCI-E TLP.
> 
> A memcpy_to_wc primitive could probably cover these use cases, no need
> to redesign the IO accessors..
> 
> The WC memory is never read, so read/write order is not important to
> any infiniband driver.
> 
> What is very important is keeping the WC behavior isolated within the
> spinlock. WC to the same addresses cannot be permitted in this pattern:
> 
>    writel(addr = 0);
>    mmiowmb();
>    spin_unlock();
>    spin_lock()
>    writel(addr = 0);
> 
> The CPU must always generate two PCI-E TLPs to the device.

On powerpc you'll never get write combining with writel. So that at
least is covered.

> This is a super performance critical operation for most drivers and
> directly impacts network performance.
> 
> Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-24  1:23               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-24  1:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Oliver, Sinan Kaya,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma

On Fri, 2018-03-23 at 10:35 -0600, Jason Gunthorpe wrote:
> On Fri, Mar 23, 2018 at 12:52:02AM +1100, Benjamin Herrenschmidt wrote:
> 
> > > >  - Make writel_relaxed() be a simple store without barriers, and
> > > > readl_relaxed() be "eieio, read, eieio", thus allowing write combining
> > > > to happen between successive writel_relaxed on WC space (no change on
> > > > normal NC space) while maintaining the ordering between relaxed reads
> > > > and writes. The flip side is a (slight) increased overhead of
> > > > readl_relaxed.
> > > 
> > > Are there many drivers that actually do writeX() on WC space?
> > > memory-barriers.txt
> > > pretty much says that all bets are off and no ordering guarantees can be assumed
> > > when using readX/writeX on prefetchable IO memory. It seems sketchy enough to
> > > give me some pause, but maybe it works fine elsewhere.
> > 
> > I don't know whether any does it, but I want to provide a way for a
> > driver to somewhat reliably obtain write combine semantics without
> > having to hand code endian swap and other horrors involved with using
> > __raw_* accessors.
> 
> Many of the drivers in drivers/infiniband work with write combining
> memory.
> 
> The usual pattern is a desire to push 32 or 64 bytes to the WC BAR as
> efficiently as possible, ideally in a single PCI-E TLP.
> 
> A memcpy_to_wc primitive could probably cover these use cases, no need
> to redesign the IO accessors..
> 
> The WC memory is never read, so read/write order is not important to
> any infiniband driver.
> 
> What is very important is keeping the WC behavior isolated within the
> spinlock. WC to the same addresses cannot be permitted in this pattern:
> 
>    writel(addr = 0);
>    mmiowmb();
>    spin_unlock();
>    spin_lock()
>    writel(addr = 0);
> 
> The CPU must always generate two PCI-E TLPs to the device.

On powerpc you'll never get write combining with writel. So that at
least is covered.

> This is a super performance critical operation for most drivers and
> directly impacts network performance.
> 
> Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-24  1:22                   ` Benjamin Herrenschmidt
@ 2018-03-24 15:06                     ` Sinan Kaya
  -1 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-24 15:06 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Oliver
  Cc: linux-rdma, Marc Zyngier, linuxppc dev list, Will Deacon

On 3/23/2018 9:22 PM, Benjamin Herrenschmidt wrote:
>> Yes, I'm %100 sure. The answer is both writel() and barrier() generates
>> a write barrier instruction. I found this by searching the kernel disassembly
>> for back to back "dsb st" instruction.
> I'm not sure you are correct here. As I wrote below, the implementatoin
> of writel() contains an *explicit" memory barrier which is completely
> different to a barrier() instruction:

OK. I did some directed tests and I'm taking it back. 
barrier() is a compiler reordering statement only.

What got me confused was this sequence:

wmb()
barrier()
writel()

I thought that the second barrier instruction was coming from barrier() but
it was actually coming from writel().

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-24 15:06                     ` Sinan Kaya
  0 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-24 15:06 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Oliver
  Cc: linuxppc dev list, linux-rdma, Marc Zyngier, Will Deacon

On 3/23/2018 9:22 PM, Benjamin Herrenschmidt wrote:
>> Yes, I'm %100 sure. The answer is both writel() and barrier() generates
>> a write barrier instruction. I found this by searching the kernel disassembly
>> for back to back "dsb st" instruction.
> I'm not sure you are correct here. As I wrote below, the implementatoin
> of writel() contains an *explicit" memory barrier which is completely
> different to a barrier() instruction:

OK. I did some directed tests and I'm taking it back. 
barrier() is a compiler reordering statement only.

What got me confused was this sequence:

wmb()
barrier()
writel()

I thought that the second barrier instruction was coming from barrier() but
it was actually coming from writel().

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
  2018-03-24  1:23               ` Benjamin Herrenschmidt
@ 2018-03-26 11:08                 ` David Laight
  -1 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-26 11:08 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Jason Gunthorpe
  Cc: Sinan Kaya, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma

> > This is a super performance critical operation for most drivers and
> > directly impacts network performance.

Perhaps there ought to be writel_nobarrier() (etc) that never contain
any barriers at all.
This might mean that they are always just the memory operation,
but it would make it more obvious what the driver was doing.

The driver would then be explicitly responsible for all the rmb(), wmb()
and mmiowb() (etc).
Performance critical paths could then avoid all the extra barriers.

	David



^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
@ 2018-03-26 11:08                 ` David Laight
  0 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-26 11:08 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Jason Gunthorpe
  Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Oliver, linux-rdma

PiA+IFRoaXMgaXMgYSBzdXBlciBwZXJmb3JtYW5jZSBjcml0aWNhbCBvcGVyYXRpb24gZm9yIG1v
c3QgZHJpdmVycyBhbmQNCj4gPiBkaXJlY3RseSBpbXBhY3RzIG5ldHdvcmsgcGVyZm9ybWFuY2Uu
DQoNClBlcmhhcHMgdGhlcmUgb3VnaHQgdG8gYmUgd3JpdGVsX25vYmFycmllcigpIChldGMpIHRo
YXQgbmV2ZXIgY29udGFpbg0KYW55IGJhcnJpZXJzIGF0IGFsbC4NClRoaXMgbWlnaHQgbWVhbiB0
aGF0IHRoZXkgYXJlIGFsd2F5cyBqdXN0IHRoZSBtZW1vcnkgb3BlcmF0aW9uLA0KYnV0IGl0IHdv
dWxkIG1ha2UgaXQgbW9yZSBvYnZpb3VzIHdoYXQgdGhlIGRyaXZlciB3YXMgZG9pbmcuDQoNClRo
ZSBkcml2ZXIgd291bGQgdGhlbiBiZSBleHBsaWNpdGx5IHJlc3BvbnNpYmxlIGZvciBhbGwgdGhl
IHJtYigpLCB3bWIoKQ0KYW5kIG1taW93YigpIChldGMpLg0KUGVyZm9ybWFuY2UgY3JpdGljYWwg
cGF0aHMgY291bGQgdGhlbiBhdm9pZCBhbGwgdGhlIGV4dHJhIGJhcnJpZXJzLg0KDQoJRGF2aWQN
Cg0KDQo=

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-23  0:16               ` Benjamin Herrenschmidt
@ 2018-03-26 11:44                 ` Will Deacon
  -1 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-26 11:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Sinan Kaya, linuxppc dev list, Oliver, Marc Zyngier, linux-rdma

Hi Ben,

I don't seem to have the beginning of this thread, so please bounce it over
if you'd like me to look at it!

On Fri, Mar 23, 2018 at 11:16:08AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
> > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
> > > > > No, it's not sufficient.
> > > 
> > > Just to clarify ... barrier() is just a compiler barrier, it means the
> > > compiler will generate things in the order they are written. This isn't
> > > sufficient on archs with an OO memory model, where an actual memory
> > > barrier instruction needs to be emited.
> > 
> > Surprisingly, ARM64 GCC compiler generates a write barrier as
> > opposed to preventing code reordering.

In context, this looks like a misunderstanding somewhere. barrier() is a
compiler barrier for us just like everybody else and we use the generic
implementation with the empty asm + memory clobber.

> > I was curious if this is an ARM only thing or not. 
> 
> Are you sure of that ? I thought it's the ARM implementation of writel
> that had an explicit write barrier in it:
> 
> #define writel(v,c)		({ __iowmb(); writel_relaxed((v),(c)); })
> 
> And __iowmb() is 
> 
> #define __iowmb()		wmb()
> 
> Note, I'm a bit dubious about this in ARM:
> 
> #define readl(c)		({ u32 __v = readl_relaxed(c); __iormb(); __v; }
> 
> Will, Marc, on powerpc, we put a sync *before* the read in readl etc...
> 
> The reasoning was there could be some DMA setup followed by a side
> effect readl rather than a side effect writel to trigger a DMA. Granted
> I wouldn't expect modern devices to be that stupid, but I have vague
> memory of some devices back in the day having that sort of read ops.

The reason we have it afterwards was for something like:

	while (!(readl(&status_register) & DMA_DONE))
		data = *dma_buffer

to ensure that we don't read stale data from the buffer. You might also
need this for systems with spurious/early IRQ delivery for DMA completion.

You'd have to throw in an explicit mb() if you wanted to order prior writel
before the side-effectcs of a a later readl.

> In general, I though the model offerred by x86 and thus by Linux
> readl/writel was full synchronization both before and after the MMIO,
> vs either other MMIO or all other forms of ops (cachable memory, locks
> etc...).
> 
> Also, can't the above readl_relaxed leak out of a lock ?

No, it's ordered with respect to the release store to the lockword but
that doesn't mean that an unlock does anything like ensure that the read
has been satisifed (in particular, for your scenario above where it has
side-effects then unlocking the lock doesn't guarantee that they've
occurred).

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 11:44                 ` Will Deacon
  0 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-26 11:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Sinan Kaya, Oliver, linuxppc dev list, linux-rdma, Marc Zyngier

Hi Ben,

I don't seem to have the beginning of this thread, so please bounce it over
if you'd like me to look at it!

On Fri, Mar 23, 2018 at 11:16:08AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
> > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
> > > > > No, it's not sufficient.
> > > 
> > > Just to clarify ... barrier() is just a compiler barrier, it means the
> > > compiler will generate things in the order they are written. This isn't
> > > sufficient on archs with an OO memory model, where an actual memory
> > > barrier instruction needs to be emited.
> > 
> > Surprisingly, ARM64 GCC compiler generates a write barrier as
> > opposed to preventing code reordering.

In context, this looks like a misunderstanding somewhere. barrier() is a
compiler barrier for us just like everybody else and we use the generic
implementation with the empty asm + memory clobber.

> > I was curious if this is an ARM only thing or not. 
> 
> Are you sure of that ? I thought it's the ARM implementation of writel
> that had an explicit write barrier in it:
> 
> #define writel(v,c)		({ __iowmb(); writel_relaxed((v),(c)); })
> 
> And __iowmb() is 
> 
> #define __iowmb()		wmb()
> 
> Note, I'm a bit dubious about this in ARM:
> 
> #define readl(c)		({ u32 __v = readl_relaxed(c); __iormb(); __v; }
> 
> Will, Marc, on powerpc, we put a sync *before* the read in readl etc...
> 
> The reasoning was there could be some DMA setup followed by a side
> effect readl rather than a side effect writel to trigger a DMA. Granted
> I wouldn't expect modern devices to be that stupid, but I have vague
> memory of some devices back in the day having that sort of read ops.

The reason we have it afterwards was for something like:

	while (!(readl(&status_register) & DMA_DONE))
		data = *dma_buffer

to ensure that we don't read stale data from the buffer. You might also
need this for systems with spurious/early IRQ delivery for DMA completion.

You'd have to throw in an explicit mb() if you wanted to order prior writel
before the side-effectcs of a a later readl.

> In general, I though the model offerred by x86 and thus by Linux
> readl/writel was full synchronization both before and after the MMIO,
> vs either other MMIO or all other forms of ops (cachable memory, locks
> etc...).
> 
> Also, can't the above readl_relaxed leak out of a lock ?

No, it's ordered with respect to the release store to the lockword but
that doesn't mean that an unlock does anything like ensure that the read
has been satisifed (in particular, for your scenario above where it has
side-effects then unlocking the lock doesn't guarantee that they've
occurred).

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 11:44                 ` Will Deacon
@ 2018-03-26 12:11                   ` okaya
  -1 siblings, 0 replies; 216+ messages in thread
From: okaya @ 2018-03-26 12:11 UTC (permalink / raw)
  To: Will Deacon; +Cc: linux-rdma, Oliver, linuxppc dev list, Marc Zyngier

On 2018-03-26 07:44, Will Deacon wrote:
> Hi Ben,
> 
> I don't seem to have the beginning of this thread, so please bounce it 
> over
> if you'd like me to look at it!
> 

https://www.spinics.net/lists/linux-rdma/msg62570.html

https://www.spinics.net/lists/linux-rdma/index.html#62666


> On Fri, Mar 23, 2018 at 11:16:08AM +1100, Benjamin Herrenschmidt wrote:
>> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
>> > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
>> > > > > No, it's not sufficient.
>> > >
>> > > Just to clarify ... barrier() is just a compiler barrier, it means the
>> > > compiler will generate things in the order they are written. This isn't
>> > > sufficient on archs with an OO memory model, where an actual memory
>> > > barrier instruction needs to be emited.
>> >
>> > Surprisingly, ARM64 GCC compiler generates a write barrier as
>> > opposed to preventing code reordering.
> 
> In context, this looks like a misunderstanding somewhere. barrier() is 
> a
> compiler barrier for us just like everybody else and we use the generic
> implementation with the empty asm + memory clobber.
> 

True, I clarified it this weekend

https://www.spinics.net/lists/linux-rdma/msg62788.html

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 12:11                   ` okaya
  0 siblings, 0 replies; 216+ messages in thread
From: okaya @ 2018-03-26 12:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, Oliver, linuxppc dev list, linux-rdma,
	Marc Zyngier

On 2018-03-26 07:44, Will Deacon wrote:
> Hi Ben,
> 
> I don't seem to have the beginning of this thread, so please bounce it 
> over
> if you'd like me to look at it!
> 

https://www.spinics.net/lists/linux-rdma/msg62570.html

https://www.spinics.net/lists/linux-rdma/index.html#62666


> On Fri, Mar 23, 2018 at 11:16:08AM +1100, Benjamin Herrenschmidt wrote:
>> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
>> > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
>> > > > > No, it's not sufficient.
>> > >
>> > > Just to clarify ... barrier() is just a compiler barrier, it means the
>> > > compiler will generate things in the order they are written. This isn't
>> > > sufficient on archs with an OO memory model, where an actual memory
>> > > barrier instruction needs to be emited.
>> >
>> > Surprisingly, ARM64 GCC compiler generates a write barrier as
>> > opposed to preventing code reordering.
> 
> In context, this looks like a misunderstanding somewhere. barrier() is 
> a
> compiler barrier for us just like everybody else and we use the generic
> implementation with the empty asm + memory clobber.
> 

True, I clarified it this weekend

https://www.spinics.net/lists/linux-rdma/msg62788.html

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 12:11                   ` okaya
@ 2018-03-26 12:42                     ` Sinan Kaya
  -1 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-26 12:42 UTC (permalink / raw)
  To: Will Deacon; +Cc: linux-rdma, Oliver, linuxppc dev list, Marc Zyngier

On 3/26/2018 8:11 AM, okaya@codeaurora.org wrote:
> On 2018-03-26 07:44, Will Deacon wrote:
>> Hi Ben,
>>
>> I don't seem to have the beginning of this thread, so please bounce it over
>> if you'd like me to look at it!
>>
> 
> https://www.spinics.net/lists/linux-rdma/msg62570.html
> 
> https://www.spinics.net/lists/linux-rdma/index.html#62666
> 

To add some more details on why we are looking at this now:

I posted several patches last week to remove duplicate barriers on ARM while
trying to make the code friendly with other architectures.

https://www.spinics.net/lists/netdev/msg491842.html
https://www.spinics.net/lists/linux-rdma/msg62434.html
https://www.spinics.net/lists/arm-kernel/msg642336.html

The conversation on this thread is interesting.

https://patchwork.kernel.org/patch/10288987/

1. I tried to replace wmb()+writel() with wmb()+writel_relaxed().
2. writel_relaxed() is equal to writel() at this moment for PPC.
3. Chelsio developers wanted to pull it into wmb()+__raw_writel() direction
to take advantage of the same optimization for PPC.
4. Dave informed us that behavior of __raw_write() is not identical on all
architectures.
5. We decided to go back to PPC and ask to implement writel_relaxed()
instead of coming up with writel_realy_relaxed() API.

> 
>> On Fri, Mar 23, 2018 at 11:16:08AM +1100, Benjamin Herrenschmidt wrote:
>>> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
>>> > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
>>> > > > > No, it's not sufficient.
>>> > >
>>> > > Just to clarify ... barrier() is just a compiler barrier, it means the
>>> > > compiler will generate things in the order they are written. This isn't
>>> > > sufficient on archs with an OO memory model, where an actual memory
>>> > > barrier instruction needs to be emited.
>>> >
>>> > Surprisingly, ARM64 GCC compiler generates a write barrier as
>>> > opposed to preventing code reordering.
>>
>> In context, this looks like a misunderstanding somewhere. barrier() is a
>> compiler barrier for us just like everybody else and we use the generic
>> implementation with the empty asm + memory clobber.
>>
> 
> True, I clarified it this weekend
> 
> https://www.spinics.net/lists/linux-rdma/msg62788.html
> 
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 12:42                     ` Sinan Kaya
  0 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-26 12:42 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, Oliver, linuxppc dev list, linux-rdma,
	Marc Zyngier

On 3/26/2018 8:11 AM, okaya@codeaurora.org wrote:
> On 2018-03-26 07:44, Will Deacon wrote:
>> Hi Ben,
>>
>> I don't seem to have the beginning of this thread, so please bounce it over
>> if you'd like me to look at it!
>>
> 
> https://www.spinics.net/lists/linux-rdma/msg62570.html
> 
> https://www.spinics.net/lists/linux-rdma/index.html#62666
> 

To add some more details on why we are looking at this now:

I posted several patches last week to remove duplicate barriers on ARM while
trying to make the code friendly with other architectures.

https://www.spinics.net/lists/netdev/msg491842.html
https://www.spinics.net/lists/linux-rdma/msg62434.html
https://www.spinics.net/lists/arm-kernel/msg642336.html

The conversation on this thread is interesting.

https://patchwork.kernel.org/patch/10288987/

1. I tried to replace wmb()+writel() with wmb()+writel_relaxed().
2. writel_relaxed() is equal to writel() at this moment for PPC.
3. Chelsio developers wanted to pull it into wmb()+__raw_writel() direction
to take advantage of the same optimization for PPC.
4. Dave informed us that behavior of __raw_write() is not identical on all
architectures.
5. We decided to go back to PPC and ask to implement writel_relaxed()
instead of coming up with writel_realy_relaxed() API.

> 
>> On Fri, Mar 23, 2018 at 11:16:08AM +1100, Benjamin Herrenschmidt wrote:
>>> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote:
>>> > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote:
>>> > > > > No, it's not sufficient.
>>> > >
>>> > > Just to clarify ... barrier() is just a compiler barrier, it means the
>>> > > compiler will generate things in the order they are written. This isn't
>>> > > sufficient on archs with an OO memory model, where an actual memory
>>> > > barrier instruction needs to be emited.
>>> >
>>> > Surprisingly, ARM64 GCC compiler generates a write barrier as
>>> > opposed to preventing code reordering.
>>
>> In context, this looks like a misunderstanding somewhere. barrier() is a
>> compiler barrier for us just like everybody else and we use the generic
>> implementation with the empty asm + memory clobber.
>>
> 
> True, I clarified it this weekend
> 
> https://www.spinics.net/lists/linux-rdma/msg62788.html
> 
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-21 13:58       ` Sinan Kaya
@ 2018-03-26 13:43         ` Arnd Bergmann
  -1 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-26 13:43 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Oliver

On Wed, Mar 21, 2018 at 2:58 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> On 3/21/2018 8:53 AM, Sinan Kaya wrote:
>> BTW, I have no idea what compiler barrier does on PPC and if
>>
>> wrltel() == compiler barrier() + wrltel_relaxed()
>>
>> can be said.
>
> this should have been
>
> writel_relaxed() == compiler barrier() + __raw_writel()

I don't think anyone clarified this so far, but there are additional differences
between the two, writel_relaxed() assumes we are talking to a 32-bit
little-endian
MMIO register, while __raw_writel() is primarily used for writing into
memory-type
regions with no particular byte order. This means:

- writel_relaxed() must perform a byte swap when running on big-endian kernels
- when used with __packed MMIO pointers, __raw_writel() may turn into a series
  of byte writes, while writel_relaxed() must result in a single 32-bit access.
- A set if consecutive writel_relaxed() on the same device is issued in program
  order, while __raw_writel() is not ordered. This typically requires
only a compiler
  barrier, but may also need a CPU barrier (in addition to the
barriers we use to
  serialize with spinlocks and DMA in writel() but not writel_relaxed()).

        Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 13:43         ` Arnd Bergmann
  0 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-26 13:43 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Oliver, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Wed, Mar 21, 2018 at 2:58 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> On 3/21/2018 8:53 AM, Sinan Kaya wrote:
>> BTW, I have no idea what compiler barrier does on PPC and if
>>
>> wrltel() == compiler barrier() + wrltel_relaxed()
>>
>> can be said.
>
> this should have been
>
> writel_relaxed() == compiler barrier() + __raw_writel()

I don't think anyone clarified this so far, but there are additional differences
between the two, writel_relaxed() assumes we are talking to a 32-bit
little-endian
MMIO register, while __raw_writel() is primarily used for writing into
memory-type
regions with no particular byte order. This means:

- writel_relaxed() must perform a byte swap when running on big-endian kernels
- when used with __packed MMIO pointers, __raw_writel() may turn into a series
  of byte writes, while writel_relaxed() must result in a single 32-bit access.
- A set if consecutive writel_relaxed() on the same device is issued in program
  order, while __raw_writel() is not ordered. This typically requires
only a compiler
  barrier, but may also need a CPU barrier (in addition to the
barriers we use to
  serialize with spinlocks and DMA in writel() but not writel_relaxed()).

        Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 13:43         ` Arnd Bergmann
@ 2018-03-26 16:00           ` Sinan Kaya
  -1 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-26 16:00 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Oliver

On 3/26/2018 9:43 AM, Arnd Bergmann wrote:
> On Wed, Mar 21, 2018 at 2:58 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
>> On 3/21/2018 8:53 AM, Sinan Kaya wrote:
>>> BTW, I have no idea what compiler barrier does on PPC and if
>>>
>>> wrltel() == compiler barrier() + wrltel_relaxed()
>>>
>>> can be said.
>>
>> this should have been
>>
>> writel_relaxed() == compiler barrier() + __raw_writel()
> 
> I don't think anyone clarified this so far, but there are additional differences
> between the two, writel_relaxed() assumes we are talking to a 32-bit
> little-endian
> MMIO register, while __raw_writel() is primarily used for writing into
> memory-type
> regions with no particular byte order. This means:
> 
> - writel_relaxed() must perform a byte swap when running on big-endian kernels
> - when used with __packed MMIO pointers, __raw_writel() may turn into a series
>   of byte writes, while writel_relaxed() must result in a single 32-bit access.
> - A set if consecutive writel_relaxed() on the same device is issued in program
>   order, while __raw_writel() is not ordered. This typically requires
> only a compiler
>   barrier, but may also need a CPU barrier (in addition to the
> barriers we use to
>   serialize with spinlocks and DMA in writel() but not writel_relaxed()).

Thanks for the great summary. I didn't know that __raw_writel() could get converted
to byte writes.

> 
>         Arnd
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 16:00           ` Sinan Kaya
  0 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-26 16:00 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Oliver, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On 3/26/2018 9:43 AM, Arnd Bergmann wrote:
> On Wed, Mar 21, 2018 at 2:58 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
>> On 3/21/2018 8:53 AM, Sinan Kaya wrote:
>>> BTW, I have no idea what compiler barrier does on PPC and if
>>>
>>> wrltel() == compiler barrier() + wrltel_relaxed()
>>>
>>> can be said.
>>
>> this should have been
>>
>> writel_relaxed() == compiler barrier() + __raw_writel()
> 
> I don't think anyone clarified this so far, but there are additional differences
> between the two, writel_relaxed() assumes we are talking to a 32-bit
> little-endian
> MMIO register, while __raw_writel() is primarily used for writing into
> memory-type
> regions with no particular byte order. This means:
> 
> - writel_relaxed() must perform a byte swap when running on big-endian kernels
> - when used with __packed MMIO pointers, __raw_writel() may turn into a series
>   of byte writes, while writel_relaxed() must result in a single 32-bit access.
> - A set if consecutive writel_relaxed() on the same device is issued in program
>   order, while __raw_writel() is not ordered. This typically requires
> only a compiler
>   barrier, but may also need a CPU barrier (in addition to the
> barriers we use to
>   serialize with spinlocks and DMA in writel() but not writel_relaxed()).

Thanks for the great summary. I didn't know that __raw_writel() could get converted
to byte writes.

> 
>         Arnd
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 11:08                 ` David Laight
@ 2018-03-26 16:54                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-26 16:54 UTC (permalink / raw)
  To: David Laight
  Cc: Sinan Kaya, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma

On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
> > > This is a super performance critical operation for most drivers and
> > > directly impacts network performance.
> 
> Perhaps there ought to be writel_nobarrier() (etc) that never contain
> any barriers at all.
> This might mean that they are always just the memory operation,
> but it would make it more obvious what the driver was doing.

I think that is what writel_relaxed is supposed to be.

The only restriction it has is that the writes to a single device
using UC memory must be kept in program order..

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 16:54                   ` Jason Gunthorpe
  0 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-26 16:54 UTC (permalink / raw)
  To: David Laight
  Cc: 'Benjamin Herrenschmidt',
	Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Oliver, linux-rdma

On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
> > > This is a super performance critical operation for most drivers and
> > > directly impacts network performance.
> 
> Perhaps there ought to be writel_nobarrier() (etc) that never contain
> any barriers at all.
> This might mean that they are always just the memory operation,
> but it would make it more obvious what the driver was doing.

I think that is what writel_relaxed is supposed to be.

The only restriction it has is that the writes to a single device
using UC memory must be kept in program order..

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 16:54                   ` Jason Gunthorpe
@ 2018-03-26 19:44                     ` Arnd Bergmann
  -1 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-26 19:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	David Laight, Oliver, linux-rdma

On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
>> > > This is a super performance critical operation for most drivers and
>> > > directly impacts network performance.
>>
>> Perhaps there ought to be writel_nobarrier() (etc) that never contain
>> any barriers at all.
>> This might mean that they are always just the memory operation,
>> but it would make it more obvious what the driver was doing.
>
> I think that is what writel_relaxed is supposed to be.
>
> The only restriction it has is that the writes to a single device
> using UC memory must be kept in program order..

Not sure about whether we have ever defined what happens to
writel_relaxed() on WC memory though: On ARM, we disallow
the compiler to combine writes, but the CPU still might.

It's also not entirely clear to me what we want writel() inside a
spinlock to mean: should the spinlock guarantee that two writel()
calls on different CPUs that are protected by spinlocks are
serialized by those locks, or not?

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 19:44                     ` Arnd Bergmann
  0 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-26 19:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Laight, Sinan Kaya, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma

On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
>> > > This is a super performance critical operation for most drivers and
>> > > directly impacts network performance.
>>
>> Perhaps there ought to be writel_nobarrier() (etc) that never contain
>> any barriers at all.
>> This might mean that they are always just the memory operation,
>> but it would make it more obvious what the driver was doing.
>
> I think that is what writel_relaxed is supposed to be.
>
> The only restriction it has is that the writes to a single device
> using UC memory must be kept in program order..

Not sure about whether we have ever defined what happens to
writel_relaxed() on WC memory though: On ARM, we disallow
the compiler to combine writes, but the CPU still might.

It's also not entirely clear to me what we want writel() inside a
spinlock to mean: should the spinlock guarantee that two writel()
calls on different CPUs that are protected by spinlocks are
serialized by those locks, or not?

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 19:44                     ` Arnd Bergmann
@ 2018-03-26 20:25                       ` Jason Gunthorpe
  -1 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-26 20:25 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	David Laight, Oliver, linux-rdma

On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
> >> > > This is a super performance critical operation for most drivers and
> >> > > directly impacts network performance.
> >>
> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
> >> any barriers at all.
> >> This might mean that they are always just the memory operation,
> >> but it would make it more obvious what the driver was doing.
> >
> > I think that is what writel_relaxed is supposed to be.
> >
> > The only restriction it has is that the writes to a single device
> > using UC memory must be kept in program order..
> 
> Not sure about whether we have ever defined what happens to
> writel_relaxed() on WC memory though: On ARM, we disallow
> the compiler to combine writes, but the CPU still might.

If the driver uses WC memory then I think it should not expect
anything in terms of how writes map to TLPs other than nothing
combines across mmiowb() and mmiowb() is fully globally ordered when
enclosed in a spinlock.

The entire point of using WC memory is usually to get combining :) If
the driver doesn't want that then it should map UC..

> It's also not entirely clear to me what we want writel() inside a
> spinlock to mean: should the spinlock guarantee that two writel()
> calls on different CPUs that are protected by spinlocks are
> serialized by those locks, or not?

Yes for writel, I think that is already defined by the barriers
document

The same document says that _relaxed() does not give that guarentee.

The lwn articule on this went into some depth on the interaction with
spinlocks.

As far as I can see, containment in a spinlock seems to be the only
different between writel and writel_relaxed..

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 20:25                       ` Jason Gunthorpe
  0 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-26 20:25 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David Laight, Sinan Kaya, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma

On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
> >> > > This is a super performance critical operation for most drivers and
> >> > > directly impacts network performance.
> >>
> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
> >> any barriers at all.
> >> This might mean that they are always just the memory operation,
> >> but it would make it more obvious what the driver was doing.
> >
> > I think that is what writel_relaxed is supposed to be.
> >
> > The only restriction it has is that the writes to a single device
> > using UC memory must be kept in program order..
> 
> Not sure about whether we have ever defined what happens to
> writel_relaxed() on WC memory though: On ARM, we disallow
> the compiler to combine writes, but the CPU still might.

If the driver uses WC memory then I think it should not expect
anything in terms of how writes map to TLPs other than nothing
combines across mmiowb() and mmiowb() is fully globally ordered when
enclosed in a spinlock.

The entire point of using WC memory is usually to get combining :) If
the driver doesn't want that then it should map UC..

> It's also not entirely clear to me what we want writel() inside a
> spinlock to mean: should the spinlock guarantee that two writel()
> calls on different CPUs that are protected by spinlocks are
> serialized by those locks, or not?

Yes for writel, I think that is already defined by the barriers
document

The same document says that _relaxed() does not give that guarentee.

The lwn articule on this went into some depth on the interaction with
spinlocks.

As far as I can see, containment in a spinlock seems to be the only
different between writel and writel_relaxed..

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 20:25                       ` Jason Gunthorpe
@ 2018-03-26 20:43                         ` Arnd Bergmann
  -1 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-26 20:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	David Laight, Oliver, linux-rdma

On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
>> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
>> >> > > This is a super performance critical operation for most drivers and
>> >> > > directly impacts network performance.
>> >>
>> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
>> >> any barriers at all.
>> >> This might mean that they are always just the memory operation,
>> >> but it would make it more obvious what the driver was doing.
>> >
>> > I think that is what writel_relaxed is supposed to be.
>> >
>> > The only restriction it has is that the writes to a single device
>> > using UC memory must be kept in program order..
>>
>> Not sure about whether we have ever defined what happens to
>> writel_relaxed() on WC memory though: On ARM, we disallow
>> the compiler to combine writes, but the CPU still might.
>
> If the driver uses WC memory then I think it should not expect
> anything in terms of how writes map to TLPs other than nothing
> combines across mmiowb() and mmiowb() is fully globally ordered when
> enclosed in a spinlock.
>
> The entire point of using WC memory is usually to get combining :) If
> the driver doesn't want that then it should map UC..

Usually, WC memory is used with memcpy_toio() though, which
by definition doesn't have any barriers between accesses, and
is required to get the correct byte ordering on writes to memory buffers.

>> It's also not entirely clear to me what we want writel() inside a
>> spinlock to mean: should the spinlock guarantee that two writel()
>> calls on different CPUs that are protected by spinlocks are
>> serialized by those locks, or not?
>
> Yes for writel, I think that is already defined by the barriers
> document

Sorry, I meant writel_relaxed(), not writel()

> The same document says that _relaxed() does not give that guarentee.
>
> The lwn articule on this went into some depth on the interaction with
> spinlocks.
>
> As far as I can see, containment in a spinlock seems to be the only
> different between writel and writel_relaxed..

I was always puzzled by this: The intention of _relaxed() on ARM
(where it originates) was to skip the barrier that serializes DMA
with MMIO, not to skip the serialization between MMIO and locks.

I never fully understood the part about the locks, but from what
I remember, ARM is still serialized without the barrier here, but
dropping the barrier on powerpc writel_relaxed() would not
serialize against locks or DMA.

         Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 20:43                         ` Arnd Bergmann
  0 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-26 20:43 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Laight, Sinan Kaya, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma

On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
>> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
>> >> > > This is a super performance critical operation for most drivers and
>> >> > > directly impacts network performance.
>> >>
>> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
>> >> any barriers at all.
>> >> This might mean that they are always just the memory operation,
>> >> but it would make it more obvious what the driver was doing.
>> >
>> > I think that is what writel_relaxed is supposed to be.
>> >
>> > The only restriction it has is that the writes to a single device
>> > using UC memory must be kept in program order..
>>
>> Not sure about whether we have ever defined what happens to
>> writel_relaxed() on WC memory though: On ARM, we disallow
>> the compiler to combine writes, but the CPU still might.
>
> If the driver uses WC memory then I think it should not expect
> anything in terms of how writes map to TLPs other than nothing
> combines across mmiowb() and mmiowb() is fully globally ordered when
> enclosed in a spinlock.
>
> The entire point of using WC memory is usually to get combining :) If
> the driver doesn't want that then it should map UC..

Usually, WC memory is used with memcpy_toio() though, which
by definition doesn't have any barriers between accesses, and
is required to get the correct byte ordering on writes to memory buffers.

>> It's also not entirely clear to me what we want writel() inside a
>> spinlock to mean: should the spinlock guarantee that two writel()
>> calls on different CPUs that are protected by spinlocks are
>> serialized by those locks, or not?
>
> Yes for writel, I think that is already defined by the barriers
> document

Sorry, I meant writel_relaxed(), not writel()

> The same document says that _relaxed() does not give that guarentee.
>
> The lwn articule on this went into some depth on the interaction with
> spinlocks.
>
> As far as I can see, containment in a spinlock seems to be the only
> different between writel and writel_relaxed..

I was always puzzled by this: The intention of _relaxed() on ARM
(where it originates) was to skip the barrier that serializes DMA
with MMIO, not to skip the serialization between MMIO and locks.

I never fully understood the part about the locks, but from what
I remember, ARM is still serialized without the barrier here, but
dropping the barrier on powerpc writel_relaxed() would not
serialize against locks or DMA.

         Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 20:43                         ` Arnd Bergmann
@ 2018-03-26 21:09                           ` Jason Gunthorpe
  -1 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-26 21:09 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	David Laight, Oliver, linux-rdma

On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote:
> On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
> >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
> >> >> > > This is a super performance critical operation for most drivers and
> >> >> > > directly impacts network performance.
> >> >>
> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
> >> >> any barriers at all.
> >> >> This might mean that they are always just the memory operation,
> >> >> but it would make it more obvious what the driver was doing.
> >> >
> >> > I think that is what writel_relaxed is supposed to be.
> >> >
> >> > The only restriction it has is that the writes to a single device
> >> > using UC memory must be kept in program order..
> >>
> >> Not sure about whether we have ever defined what happens to
> >> writel_relaxed() on WC memory though: On ARM, we disallow
> >> the compiler to combine writes, but the CPU still might.
> >
> > If the driver uses WC memory then I think it should not expect
> > anything in terms of how writes map to TLPs other than nothing
> > combines across mmiowb() and mmiowb() is fully globally ordered when
> > enclosed in a spinlock.
> >
> > The entire point of using WC memory is usually to get combining :) If
> > the driver doesn't want that then it should map UC..
> 
> Usually, WC memory is used with memcpy_toio() though, which
> by definition doesn't have any barriers between accesses, and
> is required to get the correct byte ordering on writes to memory buffers.

memcpy_toio is too expensive to actually use for anything performance
though. It is too pessimistic. What the drivers usually want is a
unwound block of 4 or 8 8-byte copies. No function calls, no
branching. Everything is already known to be aligned.

Most of the drivers have a unwound loop with writeq() or something to
do it.

> > The same document says that _relaxed() does not give that guarentee.
> >
> > The lwn articule on this went into some depth on the interaction with
> > spinlocks.
> >
> > As far as I can see, containment in a spinlock seems to be the only
> > different between writel and writel_relaxed..
> 
> I was always puzzled by this: The intention of _relaxed() on ARM
> (where it originates) was to skip the barrier that serializes DMA
> with MMIO, not to skip the serialization between MMIO and locks.

But that was never a requirement of writel(),
Documentation/memory-barriers.txt gives an explicit example demanding
the wmb() before writel() for ordering system memory against writel.

I actually have no idea why ARM had that barrier, I always assumed it
was to give program ordering to the accesses and that _relaxed allowed
re-ordering (the usual meaning of relaxed)..

But the barrier document makes it pretty clear that the only
difference between the two is spinlock containment, and WillD wrote
this text, so I belive it is accurate for ARM.

Very confusing.

> I never fully understood the part about the locks, but from what
> I remember, ARM is still serialized without the barrier here, but
> dropping the barrier on powerpc writel_relaxed() would not
> serialize against locks or DMA.

WC is usually the problem here.. I've been told it is necessary on ARM
as well..

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 21:09                           ` Jason Gunthorpe
  0 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-26 21:09 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David Laight, Sinan Kaya, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma

On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote:
> On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
> >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
> >> >> > > This is a super performance critical operation for most drivers and
> >> >> > > directly impacts network performance.
> >> >>
> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
> >> >> any barriers at all.
> >> >> This might mean that they are always just the memory operation,
> >> >> but it would make it more obvious what the driver was doing.
> >> >
> >> > I think that is what writel_relaxed is supposed to be.
> >> >
> >> > The only restriction it has is that the writes to a single device
> >> > using UC memory must be kept in program order..
> >>
> >> Not sure about whether we have ever defined what happens to
> >> writel_relaxed() on WC memory though: On ARM, we disallow
> >> the compiler to combine writes, but the CPU still might.
> >
> > If the driver uses WC memory then I think it should not expect
> > anything in terms of how writes map to TLPs other than nothing
> > combines across mmiowb() and mmiowb() is fully globally ordered when
> > enclosed in a spinlock.
> >
> > The entire point of using WC memory is usually to get combining :) If
> > the driver doesn't want that then it should map UC..
> 
> Usually, WC memory is used with memcpy_toio() though, which
> by definition doesn't have any barriers between accesses, and
> is required to get the correct byte ordering on writes to memory buffers.

memcpy_toio is too expensive to actually use for anything performance
though. It is too pessimistic. What the drivers usually want is a
unwound block of 4 or 8 8-byte copies. No function calls, no
branching. Everything is already known to be aligned.

Most of the drivers have a unwound loop with writeq() or something to
do it.

> > The same document says that _relaxed() does not give that guarentee.
> >
> > The lwn articule on this went into some depth on the interaction with
> > spinlocks.
> >
> > As far as I can see, containment in a spinlock seems to be the only
> > different between writel and writel_relaxed..
> 
> I was always puzzled by this: The intention of _relaxed() on ARM
> (where it originates) was to skip the barrier that serializes DMA
> with MMIO, not to skip the serialization between MMIO and locks.

But that was never a requirement of writel(),
Documentation/memory-barriers.txt gives an explicit example demanding
the wmb() before writel() for ordering system memory against writel.

I actually have no idea why ARM had that barrier, I always assumed it
was to give program ordering to the accesses and that _relaxed allowed
re-ordering (the usual meaning of relaxed)..

But the barrier document makes it pretty clear that the only
difference between the two is spinlock containment, and WillD wrote
this text, so I belive it is accurate for ARM.

Very confusing.

> I never fully understood the part about the locks, but from what
> I remember, ARM is still serialized without the barrier here, but
> dropping the barrier on powerpc writel_relaxed() would not
> serialize against locks or DMA.

WC is usually the problem here.. I've been told it is necessary on ARM
as well..

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 16:54                   ` Jason Gunthorpe
@ 2018-03-26 21:26                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 21:26 UTC (permalink / raw)
  To: Jason Gunthorpe, David Laight
  Cc: Sinan Kaya, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma

On Mon, 2018-03-26 at 10:54 -0600, Jason Gunthorpe wrote:
> On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
> > > > This is a super performance critical operation for most drivers and
> > > > directly impacts network performance.
> > 
> > Perhaps there ought to be writel_nobarrier() (etc) that never contain
> > any barriers at all.
> > This might mean that they are always just the memory operation,
> > but it would make it more obvious what the driver was doing.
> 
> I think that is what writel_relaxed is supposed to be.
> 
> The only restriction it has is that the writes to a single device
> using UC memory must be kept in program order..

Which requires barriers on some architectures :-)

Also we don't have a clear definition of what happens on WC memory.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 21:26                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 21:26 UTC (permalink / raw)
  To: Jason Gunthorpe, David Laight
  Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Oliver, linux-rdma

On Mon, 2018-03-26 at 10:54 -0600, Jason Gunthorpe wrote:
> On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
> > > > This is a super performance critical operation for most drivers and
> > > > directly impacts network performance.
> > 
> > Perhaps there ought to be writel_nobarrier() (etc) that never contain
> > any barriers at all.
> > This might mean that they are always just the memory operation,
> > but it would make it more obvious what the driver was doing.
> 
> I think that is what writel_relaxed is supposed to be.
> 
> The only restriction it has is that the writes to a single device
> using UC memory must be kept in program order..

Which requires barriers on some architectures :-)

Also we don't have a clear definition of what happens on WC memory.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 21:09                           ` Jason Gunthorpe
@ 2018-03-26 21:30                             ` Arnd Bergmann
  -1 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-26 21:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya,
	David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Mon, Mar 26, 2018 at 11:09 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote:
>> On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
>> >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> >> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
>> >> >> > > This is a super performance critical operation for most drivers and
>> >> >> > > directly impacts network performance.
>> >> >>
>> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
>> >> >> any barriers at all.
>> >> >> This might mean that they are always just the memory operation,
>> >> >> but it would make it more obvious what the driver was doing.
>> >> >
>> >> > I think that is what writel_relaxed is supposed to be.
>> >> >
>> >> > The only restriction it has is that the writes to a single device
>> >> > using UC memory must be kept in program order..
>> >>
>> >> Not sure about whether we have ever defined what happens to
>> >> writel_relaxed() on WC memory though: On ARM, we disallow
>> >> the compiler to combine writes, but the CPU still might.
>> >
>> > If the driver uses WC memory then I think it should not expect
>> > anything in terms of how writes map to TLPs other than nothing
>> > combines across mmiowb() and mmiowb() is fully globally ordered when
>> > enclosed in a spinlock.
>> >
>> > The entire point of using WC memory is usually to get combining :) If
>> > the driver doesn't want that then it should map UC..
>>
>> Usually, WC memory is used with memcpy_toio() though, which
>> by definition doesn't have any barriers between accesses, and
>> is required to get the correct byte ordering on writes to memory buffers.
>
> memcpy_toio is too expensive to actually use for anything performance
> though. It is too pessimistic. What the drivers usually want is a
> unwound block of 4 or 8 8-byte copies. No function calls, no
> branching. Everything is already known to be aligned.
>
> Most of the drivers have a unwound loop with writeq() or something to
> do it.

But isn't the writeq() barrier much more expensive than anything you'd
do in function calls?

>> > The same document says that _relaxed() does not give that guarentee.
>> >
>> > The lwn articule on this went into some depth on the interaction with
>> > spinlocks.
>> >
>> > As far as I can see, containment in a spinlock seems to be the only
>> > different between writel and writel_relaxed..
>>
>> I was always puzzled by this: The intention of _relaxed() on ARM
>> (where it originates) was to skip the barrier that serializes DMA
>> with MMIO, not to skip the serialization between MMIO and locks.
>
> But that was never a requirement of writel(),
> Documentation/memory-barriers.txt gives an explicit example demanding
> the wmb() before writel() for ordering system memory against writel.

Indeed, but it's in an example for when to use dma_wmb(), not wmb().
Adding Alexander Duyck to Cc, he added that section as part of
1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
dma_wmb()"). Also adding the other people that were involved with that.

> I actually have no idea why ARM had that barrier, I always assumed it
> was to give program ordering to the accesses and that _relaxed allowed
> re-ordering (the usual meaning of relaxed)..
>
> But the barrier document makes it pretty clear that the only
> difference between the two is spinlock containment, and WillD wrote
> this text, so I belive it is accurate for ARM.
>
> Very confusing.

It does mention serialization with both DMA and locks in the
section about  readX_relaxed()/writeX_relaxed(). The part
about DMA is very clear here, and I must have just forgotten
the exact semantics with regards to spinlocks. I'm still not
sure what prevents a writel() from leaking out the end of a
spinlock section that doesn't happen with writel_relaxed(), since
the barrier in writel() comes before the access, and the
spin_unlock() shouldn't affect the external buses.

      Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 21:30                             ` Arnd Bergmann
  0 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-26 21:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Laight, Sinan Kaya, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Benjamin Herrenschmidt,
	Paul E. McKenney

On Mon, Mar 26, 2018 at 11:09 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote:
>> On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote:
>> >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> >> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote:
>> >> >> > > This is a super performance critical operation for most drivers and
>> >> >> > > directly impacts network performance.
>> >> >>
>> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain
>> >> >> any barriers at all.
>> >> >> This might mean that they are always just the memory operation,
>> >> >> but it would make it more obvious what the driver was doing.
>> >> >
>> >> > I think that is what writel_relaxed is supposed to be.
>> >> >
>> >> > The only restriction it has is that the writes to a single device
>> >> > using UC memory must be kept in program order..
>> >>
>> >> Not sure about whether we have ever defined what happens to
>> >> writel_relaxed() on WC memory though: On ARM, we disallow
>> >> the compiler to combine writes, but the CPU still might.
>> >
>> > If the driver uses WC memory then I think it should not expect
>> > anything in terms of how writes map to TLPs other than nothing
>> > combines across mmiowb() and mmiowb() is fully globally ordered when
>> > enclosed in a spinlock.
>> >
>> > The entire point of using WC memory is usually to get combining :) If
>> > the driver doesn't want that then it should map UC..
>>
>> Usually, WC memory is used with memcpy_toio() though, which
>> by definition doesn't have any barriers between accesses, and
>> is required to get the correct byte ordering on writes to memory buffers.
>
> memcpy_toio is too expensive to actually use for anything performance
> though. It is too pessimistic. What the drivers usually want is a
> unwound block of 4 or 8 8-byte copies. No function calls, no
> branching. Everything is already known to be aligned.
>
> Most of the drivers have a unwound loop with writeq() or something to
> do it.

But isn't the writeq() barrier much more expensive than anything you'd
do in function calls?

>> > The same document says that _relaxed() does not give that guarentee.
>> >
>> > The lwn articule on this went into some depth on the interaction with
>> > spinlocks.
>> >
>> > As far as I can see, containment in a spinlock seems to be the only
>> > different between writel and writel_relaxed..
>>
>> I was always puzzled by this: The intention of _relaxed() on ARM
>> (where it originates) was to skip the barrier that serializes DMA
>> with MMIO, not to skip the serialization between MMIO and locks.
>
> But that was never a requirement of writel(),
> Documentation/memory-barriers.txt gives an explicit example demanding
> the wmb() before writel() for ordering system memory against writel.

Indeed, but it's in an example for when to use dma_wmb(), not wmb().
Adding Alexander Duyck to Cc, he added that section as part of
1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
dma_wmb()"). Also adding the other people that were involved with that.

> I actually have no idea why ARM had that barrier, I always assumed it
> was to give program ordering to the accesses and that _relaxed allowed
> re-ordering (the usual meaning of relaxed)..
>
> But the barrier document makes it pretty clear that the only
> difference between the two is spinlock containment, and WillD wrote
> this text, so I belive it is accurate for ARM.
>
> Very confusing.

It does mention serialization with both DMA and locks in the
section about  readX_relaxed()/writeX_relaxed(). The part
about DMA is very clear here, and I must have just forgotten
the exact semantics with regards to spinlocks. I'm still not
sure what prevents a writel() from leaking out the end of a
spinlock section that doesn't happen with writel_relaxed(), since
the barrier in writel() comes before the access, and the
spin_unlock() shouldn't affect the external buses.

      Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 21:30                             ` Arnd Bergmann
@ 2018-03-26 21:46                               ` Sinan Kaya
  -1 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-26 21:46 UTC (permalink / raw)
  To: Arnd Bergmann, Jason Gunthorpe
  Cc: Paul E. McKenney, linux-rdma, Will Deacon, David Laight, Oliver,
	Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
>> But that was never a requirement of writel(),
>> Documentation/memory-barriers.txt gives an explicit example demanding
>> the wmb() before writel() for ordering system memory against writel.
> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> Adding Alexander Duyck to Cc, he added that section as part of
> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> dma_wmb()"). Also adding the other people that were involved with that.
> 

ARM developers can get away with not including wmb() in their code and use
writel() to observe memory writes due to implicit barriers.

However, same code will not work on Intel.

writel() has a compiler barrier in it for x86.
wmb() has a sync operation in it for x86. 

Unless wmb() is called, PCIe device won't observe memory updates from the CPU.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 21:46                               ` Sinan Kaya
  0 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-26 21:46 UTC (permalink / raw)
  To: Arnd Bergmann, Jason Gunthorpe
  Cc: David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Benjamin Herrenschmidt,
	Paul E. McKenney

On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
>> But that was never a requirement of writel(),
>> Documentation/memory-barriers.txt gives an explicit example demanding
>> the wmb() before writel() for ordering system memory against writel.
> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> Adding Alexander Duyck to Cc, he added that section as part of
> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> dma_wmb()"). Also adding the other people that were involved with that.
> 

ARM developers can get away with not including wmb() in their code and use
writel() to observe memory writes due to implicit barriers.

However, same code will not work on Intel.

writel() has a compiler barrier in it for x86.
wmb() has a sync operation in it for x86. 

Unless wmb() is called, PCIe device won't observe memory updates from the CPU.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 21:30                             ` Arnd Bergmann
@ 2018-03-26 22:00                               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 22:00 UTC (permalink / raw)
  To: Arnd Bergmann, Jason Gunthorpe
  Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya,
	David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
>  Most of the drivers have a unwound loop with writeq() or something to
> > do it.
> 
> But isn't the writeq() barrier much more expensive than anything you'd
> do in function calls?

It is for us, and will break any write combining.

> > > > The same document says that _relaxed() does not give that guarentee.
> > > > 
> > > > The lwn articule on this went into some depth on the interaction with
> > > > spinlocks.
> > > > 
> > > > As far as I can see, containment in a spinlock seems to be the only
> > > > different between writel and writel_relaxed..
> > > 
> > > I was always puzzled by this: The intention of _relaxed() on ARM
> > > (where it originates) was to skip the barrier that serializes DMA
> > > with MMIO, not to skip the serialization between MMIO and locks.
> > 
> > But that was never a requirement of writel(),
> > Documentation/memory-barriers.txt gives an explicit example demanding
> > the wmb() before writel() for ordering system memory against writel.

This is a bug in the documentation.

> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> Adding Alexander Duyck to Cc, he added that section as part of
> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> dma_wmb()"). Also adding the other people that were involved with that.

Linus himself made it very clear years ago. readl and writel have to
order vs memory accesses.

> > I actually have no idea why ARM had that barrier, I always assumed it
> > was to give program ordering to the accesses and that _relaxed allowed
> > re-ordering (the usual meaning of relaxed)..
> > 
> > But the barrier document makes it pretty clear that the only
> > difference between the two is spinlock containment, and WillD wrote
> > this text, so I belive it is accurate for ARM.
> > 
> > Very confusing.
> 
> It does mention serialization with both DMA and locks in the
> section about  readX_relaxed()/writeX_relaxed(). The part
> about DMA is very clear here, and I must have just forgotten
> the exact semantics with regards to spinlocks. I'm still not
> sure what prevents a writel() from leaking out the end of a
> spinlock section that doesn't happen with writel_relaxed(), since
> the barrier in writel() comes before the access, and the
> spin_unlock() shouldn't affect the external buses.

So...

Historically, what happened is that we (we means whoever participated
in the discussion on the list with Linus calling the shots really)
decided that there was no sane way for drivers to understand a world
where readl/writel didn't fully order things vs. memory accesses (ie,
DMA).

So it should always be correct to do:

	- Write to some in-memory buffer
	- writel() to kick the DMA read of that buffer

without any extra barrier.

The spinlock situation however got murky. Mostly that came up because
on architecture (I forgot who, might have been ia64) has a hard time
providing that consistency without making writel insanely expensive.

Thus they created mmiowb whose main purpose was precisely to order
writel with a following spin_unlock.

I decided not to go down that path on power because getting all drivers
"fixed" to do the right thing was going to be a losing battle, and
instead added per-cpu tracking of writel in order to "escalate" to a
heavier barrier in spin_unlock itself when necessary.

Now, all this happened more than a decade ago and it's possible that
the understanding or expectations "shifted" over time...

Cheers,
Ben.
 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 22:00                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 22:00 UTC (permalink / raw)
  To: Arnd Bergmann, Jason Gunthorpe
  Cc: David Laight, Sinan Kaya, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
>  Most of the drivers have a unwound loop with writeq() or something to
> > do it.
> 
> But isn't the writeq() barrier much more expensive than anything you'd
> do in function calls?

It is for us, and will break any write combining.

> > > > The same document says that _relaxed() does not give that guarentee.
> > > > 
> > > > The lwn articule on this went into some depth on the interaction with
> > > > spinlocks.
> > > > 
> > > > As far as I can see, containment in a spinlock seems to be the only
> > > > different between writel and writel_relaxed..
> > > 
> > > I was always puzzled by this: The intention of _relaxed() on ARM
> > > (where it originates) was to skip the barrier that serializes DMA
> > > with MMIO, not to skip the serialization between MMIO and locks.
> > 
> > But that was never a requirement of writel(),
> > Documentation/memory-barriers.txt gives an explicit example demanding
> > the wmb() before writel() for ordering system memory against writel.

This is a bug in the documentation.

> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> Adding Alexander Duyck to Cc, he added that section as part of
> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> dma_wmb()"). Also adding the other people that were involved with that.

Linus himself made it very clear years ago. readl and writel have to
order vs memory accesses.

> > I actually have no idea why ARM had that barrier, I always assumed it
> > was to give program ordering to the accesses and that _relaxed allowed
> > re-ordering (the usual meaning of relaxed)..
> > 
> > But the barrier document makes it pretty clear that the only
> > difference between the two is spinlock containment, and WillD wrote
> > this text, so I belive it is accurate for ARM.
> > 
> > Very confusing.
> 
> It does mention serialization with both DMA and locks in the
> section about  readX_relaxed()/writeX_relaxed(). The part
> about DMA is very clear here, and I must have just forgotten
> the exact semantics with regards to spinlocks. I'm still not
> sure what prevents a writel() from leaking out the end of a
> spinlock section that doesn't happen with writel_relaxed(), since
> the barrier in writel() comes before the access, and the
> spin_unlock() shouldn't affect the external buses.

So...

Historically, what happened is that we (we means whoever participated
in the discussion on the list with Linus calling the shots really)
decided that there was no sane way for drivers to understand a world
where readl/writel didn't fully order things vs. memory accesses (ie,
DMA).

So it should always be correct to do:

	- Write to some in-memory buffer
	- writel() to kick the DMA read of that buffer

without any extra barrier.

The spinlock situation however got murky. Mostly that came up because
on architecture (I forgot who, might have been ia64) has a hard time
providing that consistency without making writel insanely expensive.

Thus they created mmiowb whose main purpose was precisely to order
writel with a following spin_unlock.

I decided not to go down that path on power because getting all drivers
"fixed" to do the right thing was going to be a losing battle, and
instead added per-cpu tracking of writel in order to "escalate" to a
heavier barrier in spin_unlock itself when necessary.

Now, all this happened more than a decade ago and it's possible that
the understanding or expectations "shifted" over time...

Cheers,
Ben.
 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 21:46                               ` Sinan Kaya
@ 2018-03-26 22:01                                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 22:01 UTC (permalink / raw)
  To: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe
  Cc: Paul E. McKenney, linux-rdma, Will Deacon, David Laight, Oliver,
	Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
> > > But that was never a requirement of writel(),
> > > Documentation/memory-barriers.txt gives an explicit example demanding
> > > the wmb() before writel() for ordering system memory against writel.
> > 
> > Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> > Adding Alexander Duyck to Cc, he added that section as part of
> > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> > dma_wmb()"). Also adding the other people that were involved with that.
> > 
> 
> ARM developers can get away with not including wmb() in their code and use
> writel() to observe memory writes due to implicit barriers.
> 
> However, same code will not work on Intel.

Wrong. It will.

You do NOT need wmb between writes to memory and writel.

> writel() has a compiler barrier in it for x86.
> wmb() has a sync operation in it for x86. 
> 
> Unless wmb() is called, PCIe device won't observe memory updates from the CPU.

This is completely wrong. They will. Intel provides the necessary
ordering guarantees without an explicit wmb.

Otherwise almost all drivers out there are broken which I very much
doubt :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 22:01                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 22:01 UTC (permalink / raw)
  To: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe
  Cc: David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
> > > But that was never a requirement of writel(),
> > > Documentation/memory-barriers.txt gives an explicit example demanding
> > > the wmb() before writel() for ordering system memory against writel.
> > 
> > Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> > Adding Alexander Duyck to Cc, he added that section as part of
> > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> > dma_wmb()"). Also adding the other people that were involved with that.
> > 
> 
> ARM developers can get away with not including wmb() in their code and use
> writel() to observe memory writes due to implicit barriers.
> 
> However, same code will not work on Intel.

Wrong. It will.

You do NOT need wmb between writes to memory and writel.

> writel() has a compiler barrier in it for x86.
> wmb() has a sync operation in it for x86. 
> 
> Unless wmb() is called, PCIe device won't observe memory updates from the CPU.

This is completely wrong. They will. Intel provides the necessary
ordering guarantees without an explicit wmb.

Otherwise almost all drivers out there are broken which I very much
doubt :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 22:01                                 ` Benjamin Herrenschmidt
@ 2018-03-26 22:08                                   ` Sinan Kaya
  -1 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-26 22:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Arnd Bergmann, Jason Gunthorpe
  Cc: linux-rdma, Will Deacon, Alexander Duyck, David Laight, Oliver,
	Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On 3/26/2018 6:01 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
>> On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
>>>> But that was never a requirement of writel(),
>>>> Documentation/memory-barriers.txt gives an explicit example demanding
>>>> the wmb() before writel() for ordering system memory against writel.
>>>
>>> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
>>> Adding Alexander Duyck to Cc, he added that section as part of
>>> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
>>> dma_wmb()"). Also adding the other people that were involved with that.
>>>
>>
>> ARM developers can get away with not including wmb() in their code and use
>> writel() to observe memory writes due to implicit barriers.
>>
>> However, same code will not work on Intel.
> 
> Wrong. It will.
> 
> You do NOT need wmb between writes to memory and writel.

If writel() provides such a guarantee, why do I see code sequences like

wmb()
writel() 

all over the place.

> 
>> writel() has a compiler barrier in it for x86.
>> wmb() has a sync operation in it for x86. 
>>
>> Unless wmb() is called, PCIe device won't observe memory updates from the CPU.
> 
> This is completely wrong. They will. Intel provides the necessary
> ordering guarantees without an explicit wmb.
> 

I'm still reserving my doubts here. I was told about an explicit
wmb() requirement last week.

> Otherwise almost all drivers out there are broken which I very much
> doubt :-)
> 
> Cheers,
> Ben.
> 
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 22:08                                   ` Sinan Kaya
  0 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-26 22:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Arnd Bergmann, Jason Gunthorpe
  Cc: David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On 3/26/2018 6:01 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
>> On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
>>>> But that was never a requirement of writel(),
>>>> Documentation/memory-barriers.txt gives an explicit example demanding
>>>> the wmb() before writel() for ordering system memory against writel.
>>>
>>> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
>>> Adding Alexander Duyck to Cc, he added that section as part of
>>> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
>>> dma_wmb()"). Also adding the other people that were involved with that.
>>>
>>
>> ARM developers can get away with not including wmb() in their code and use
>> writel() to observe memory writes due to implicit barriers.
>>
>> However, same code will not work on Intel.
> 
> Wrong. It will.
> 
> You do NOT need wmb between writes to memory and writel.

If writel() provides such a guarantee, why do I see code sequences like

wmb()
writel() 

all over the place.

> 
>> writel() has a compiler barrier in it for x86.
>> wmb() has a sync operation in it for x86. 
>>
>> Unless wmb() is called, PCIe device won't observe memory updates from the CPU.
> 
> This is completely wrong. They will. Intel provides the necessary
> ordering guarantees without an explicit wmb.
> 

I'm still reserving my doubts here. I was told about an explicit
wmb() requirement last week.

> Otherwise almost all drivers out there are broken which I very much
> doubt :-)
> 
> Cheers,
> Ben.
> 
> 
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 22:01                                 ` Benjamin Herrenschmidt
@ 2018-03-26 22:27                                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-26 22:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon,
	Sinan Kaya, David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
> > > > But that was never a requirement of writel(),
> > > > Documentation/memory-barriers.txt gives an explicit example demanding
> > > > the wmb() before writel() for ordering system memory against writel.
> > > 
> > > Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> > > Adding Alexander Duyck to Cc, he added that section as part of
> > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> > > dma_wmb()"). Also adding the other people that were involved with that.
> > > 
> > 
> > ARM developers can get away with not including wmb() in their code and use
> > writel() to observe memory writes due to implicit barriers.
> > 
> > However, same code will not work on Intel.
> 
> Wrong. It will.
> 
> You do NOT need wmb between writes to memory and writel.
> 
> > writel() has a compiler barrier in it for x86.
> > wmb() has a sync operation in it for x86. 
> > 
> > Unless wmb() is called, PCIe device won't observe memory updates from the CPU.
> 
> This is completely wrong. They will. Intel provides the necessary
> ordering guarantees without an explicit wmb.
> 
> Otherwise almost all drivers out there are broken which I very much
> doubt :-)

But.. Sinan is right, you look anywhere in the driver tree and you
find stuff like this:

drivers/net/ethernet/intel/i40e/i40e_txrx.c

        /* Force memory writes to complete before letting h/w
         * know there are new descriptors to fetch.
         */
        wmb();


It is *systemic*

I even see patches adding wmb() based on actual observed memory
corruption during testing on Intel:

https://patchwork.kernel.org/patch/10177207/

So you think all of this is unnecessary and writel is totally strongly
ordered, even on multi-socket Intel?

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 22:27                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-26 22:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Sinan Kaya, Arnd Bergmann, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
> > > > But that was never a requirement of writel(),
> > > > Documentation/memory-barriers.txt gives an explicit example demanding
> > > > the wmb() before writel() for ordering system memory against writel.
> > > 
> > > Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> > > Adding Alexander Duyck to Cc, he added that section as part of
> > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> > > dma_wmb()"). Also adding the other people that were involved with that.
> > > 
> > 
> > ARM developers can get away with not including wmb() in their code and use
> > writel() to observe memory writes due to implicit barriers.
> > 
> > However, same code will not work on Intel.
> 
> Wrong. It will.
> 
> You do NOT need wmb between writes to memory and writel.
> 
> > writel() has a compiler barrier in it for x86.
> > wmb() has a sync operation in it for x86. 
> > 
> > Unless wmb() is called, PCIe device won't observe memory updates from the CPU.
> 
> This is completely wrong. They will. Intel provides the necessary
> ordering guarantees without an explicit wmb.
> 
> Otherwise almost all drivers out there are broken which I very much
> doubt :-)

But.. Sinan is right, you look anywhere in the driver tree and you
find stuff like this:

drivers/net/ethernet/intel/i40e/i40e_txrx.c

        /* Force memory writes to complete before letting h/w
         * know there are new descriptors to fetch.
         */
        wmb();


It is *systemic*

I even see patches adding wmb() based on actual observed memory
corruption during testing on Intel:

https://patchwork.kernel.org/patch/10177207/

So you think all of this is unnecessary and writel is totally strongly
ordered, even on multi-socket Intel?

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 22:08                                   ` Sinan Kaya
@ 2018-03-26 22:28                                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 22:28 UTC (permalink / raw)
  To: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe
  Cc: linux-rdma, Will Deacon, Alexander Duyck, David Laight, Oliver,
	Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Mon, 2018-03-26 at 18:08 -0400, Sinan Kaya wrote:
> On 3/26/2018 6:01 PM, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > > On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
> > > > > But that was never a requirement of writel(),
> > > > > Documentation/memory-barriers.txt gives an explicit example demanding
> > > > > the wmb() before writel() for ordering system memory against writel.
> > > > 
> > > > Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> > > > Adding Alexander Duyck to Cc, he added that section as part of
> > > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> > > > dma_wmb()"). Also adding the other people that were involved with that.
> > > > 
> > > 
> > > ARM developers can get away with not including wmb() in their code and use
> > > writel() to observe memory writes due to implicit barriers.
> > > 
> > > However, same code will not work on Intel.
> > 
> > Wrong. It will.
> > 
> > You do NOT need wmb between writes to memory and writel.
> 
> If writel() provides such a guarantee, why do I see code sequences like
> 
> wmb()
> writel() 
> 
> all over the place.

Because it was badly documented and people didn't know what to do, or
maybe the underlying mapping is WC ?

I don't know for sure but I can tell you Linus opinion on the matter
back in the days was very clear and that's why we implemented writel
the way we did on powerpc.

> > 
> > > writel() has a compiler barrier in it for x86.
> > > wmb() has a sync operation in it for x86. 
> > > 
> > > Unless wmb() is called, PCIe device won't observe memory updates from the CPU.
> > 
> > This is completely wrong. They will. Intel provides the necessary
> > ordering guarantees without an explicit wmb.
> > 
> 
> I'm still reserving my doubts here. I was told about an explicit
> wmb() requirement last week.

By whome ?

> > Otherwise almost all drivers out there are broken which I very much
> > doubt :-)
> > 
> > Cheers,
> > Ben.
> > 
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 22:28                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 22:28 UTC (permalink / raw)
  To: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe
  Cc: David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Mon, 2018-03-26 at 18:08 -0400, Sinan Kaya wrote:
> On 3/26/2018 6:01 PM, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > > On 3/26/2018 5:30 PM, Arnd Bergmann wrote:
> > > > > But that was never a requirement of writel(),
> > > > > Documentation/memory-barriers.txt gives an explicit example demanding
> > > > > the wmb() before writel() for ordering system memory against writel.
> > > > 
> > > > Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> > > > Adding Alexander Duyck to Cc, he added that section as part of
> > > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> > > > dma_wmb()"). Also adding the other people that were involved with that.
> > > > 
> > > 
> > > ARM developers can get away with not including wmb() in their code and use
> > > writel() to observe memory writes due to implicit barriers.
> > > 
> > > However, same code will not work on Intel.
> > 
> > Wrong. It will.
> > 
> > You do NOT need wmb between writes to memory and writel.
> 
> If writel() provides such a guarantee, why do I see code sequences like
> 
> wmb()
> writel() 
> 
> all over the place.

Because it was badly documented and people didn't know what to do, or
maybe the underlying mapping is WC ?

I don't know for sure but I can tell you Linus opinion on the matter
back in the days was very clear and that's why we implemented writel
the way we did on powerpc.

> > 
> > > writel() has a compiler barrier in it for x86.
> > > wmb() has a sync operation in it for x86. 
> > > 
> > > Unless wmb() is called, PCIe device won't observe memory updates from the CPU.
> > 
> > This is completely wrong. They will. Intel provides the necessary
> > ordering guarantees without an explicit wmb.
> > 
> 
> I'm still reserving my doubts here. I was told about an explicit
> wmb() requirement last week.

By whome ?

> > Otherwise almost all drivers out there are broken which I very much
> > doubt :-)
> > 
> > Cheers,
> > Ben.
> > 
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 22:27                                   ` Jason Gunthorpe
@ 2018-03-26 22:36                                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 22:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon,
	Sinan Kaya, David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > Otherwise almost all drivers out there are broken which I very much
> > doubt :-)
> 
> But.. Sinan is right, you look anywhere in the driver tree and you
> find stuff like this:
> 
> drivers/net/ethernet/intel/i40e/i40e_txrx.c
> 
>         /* Force memory writes to complete before letting h/w
>          * know there are new descriptors to fetch.
>          */
>         wmb();
> 
> 
> It is *systemic*

Yes, because they all copied e1000e :-) If you look at the comment in
there, it does say it's only for weakly ordered archs such as ia64, and
even then, probably predates Linus strong statement on the matter.

> I even see patches adding wmb() based on actual observed memory
> corruption during testing on Intel:
> 
> https://patchwork.kernel.org/patch/10177207/
> 
> So you think all of this is unnecessary and writel is totally strongly
> ordered, even on multi-socket Intel?

I don't kow, it used to be the case, at least that's what drove us to
define things the way we did.

Maybe things changed, but if that's the case, nobody knows for sure,
and we probably want to get Linus POV on the matter.

I know I still write drivers that do not add a wmb in that case because
I expect things to work without it.

If that has changed, we probably can relax some of the barriers in our
implementations of writel on a number of architectures, but not before
auditing a bunch more drivers to make sure they have the write wmb()'s

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 22:36                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 22:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sinan Kaya, Arnd Bergmann, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > Otherwise almost all drivers out there are broken which I very much
> > doubt :-)
> 
> But.. Sinan is right, you look anywhere in the driver tree and you
> find stuff like this:
> 
> drivers/net/ethernet/intel/i40e/i40e_txrx.c
> 
>         /* Force memory writes to complete before letting h/w
>          * know there are new descriptors to fetch.
>          */
>         wmb();
> 
> 
> It is *systemic*

Yes, because they all copied e1000e :-) If you look at the comment in
there, it does say it's only for weakly ordered archs such as ia64, and
even then, probably predates Linus strong statement on the matter.

> I even see patches adding wmb() based on actual observed memory
> corruption during testing on Intel:
> 
> https://patchwork.kernel.org/patch/10177207/
> 
> So you think all of this is unnecessary and writel is totally strongly
> ordered, even on multi-socket Intel?

I don't kow, it used to be the case, at least that's what drove us to
define things the way we did.

Maybe things changed, but if that's the case, nobody knows for sure,
and we probably want to get Linus POV on the matter.

I know I still write drivers that do not add a wmb in that case because
I expect things to work without it.

If that has changed, we probably can relax some of the barriers in our
implementations of writel on a number of architectures, but not before
auditing a bunch more drivers to make sure they have the write wmb()'s

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 22:36                                     ` Benjamin Herrenschmidt
@ 2018-03-26 22:42                                       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 22:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon,
	Sinan Kaya, David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, 2018-03-27 at 09:36 +1100, Benjamin Herrenschmidt wrote:
> I don't kow, it used to be the case, at least that's what drove us to
> define things the way we did.
> 
> Maybe things changed, but if that's the case, nobody knows for sure,
> and we probably want to get Linus POV on the matter.
> 
> I know I still write drivers that do not add a wmb in that case because
> I expect things to work without it.
> 
> If that has changed, we probably can relax some of the barriers in our
> implementations of writel on a number of architectures, but not before
> auditing a bunch more drivers to make sure they have the write wmb()'s

Note also that this was the entire point behind the definition of
the _relaxed() accessors, to lift that specific ordering guarantee.

If you now says that memory + writel requires a wmb() in between
then you made writel be identical to writel_relaxed.

You might notice that Documentation/driver-api/device-io.rst makes no
mention of wmb() at all.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 22:42                                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 22:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sinan Kaya, Arnd Bergmann, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Tue, 2018-03-27 at 09:36 +1100, Benjamin Herrenschmidt wrote:
> I don't kow, it used to be the case, at least that's what drove us to
> define things the way we did.
> 
> Maybe things changed, but if that's the case, nobody knows for sure,
> and we probably want to get Linus POV on the matter.
> 
> I know I still write drivers that do not add a wmb in that case because
> I expect things to work without it.
> 
> If that has changed, we probably can relax some of the barriers in our
> implementations of writel on a number of architectures, but not before
> auditing a bunch more drivers to make sure they have the write wmb()'s

Note also that this was the entire point behind the definition of
the _relaxed() accessors, to lift that specific ordering guarantee.

If you now says that memory + writel requires a wmb() in between
then you made writel be identical to writel_relaxed.

You might notice that Documentation/driver-api/device-io.rst makes no
mention of wmb() at all.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 22:36                                     ` Benjamin Herrenschmidt
@ 2018-03-26 22:50                                       ` Jason Gunthorpe
  -1 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-26 22:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon,
	Sinan Kaya, David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > > Otherwise almost all drivers out there are broken which I very much
> > > doubt :-)
> > 
> > But.. Sinan is right, you look anywhere in the driver tree and you
> > find stuff like this:
> > 
> > drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > 
> >         /* Force memory writes to complete before letting h/w
> >          * know there are new descriptors to fetch.
> >          */
> >         wmb();
> > 
> > 
> > It is *systemic*
> 
> Yes, because they all copied e1000e :-) If you look at the comment in
> there, it does say it's only for weakly ordered archs such as ia64, and
> even then, probably predates Linus strong statement on the matter.

Hahah, sure I'll buy that..

But still, if this really is the case, a *strong* statement in
barriers.txt to that effect (and not an example demanding the wmb()!)
would be very helpful for those of us that have to review driver code!

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 22:50                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-26 22:50 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Sinan Kaya, Arnd Bergmann, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > > Otherwise almost all drivers out there are broken which I very much
> > > doubt :-)
> > 
> > But.. Sinan is right, you look anywhere in the driver tree and you
> > find stuff like this:
> > 
> > drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > 
> >         /* Force memory writes to complete before letting h/w
> >          * know there are new descriptors to fetch.
> >          */
> >         wmb();
> > 
> > 
> > It is *systemic*
> 
> Yes, because they all copied e1000e :-) If you look at the comment in
> there, it does say it's only for weakly ordered archs such as ia64, and
> even then, probably predates Linus strong statement on the matter.

Hahah, sure I'll buy that..

But still, if this really is the case, a *strong* statement in
barriers.txt to that effect (and not an example demanding the wmb()!)
would be very helpful for those of us that have to review driver code!

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 22:50                                       ` Jason Gunthorpe
@ 2018-03-26 23:59                                         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 23:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon,
	Sinan Kaya, David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Mon, 2018-03-26 at 16:50 -0600, Jason Gunthorpe wrote:
> On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > > > Otherwise almost all drivers out there are broken which I very much
> > > > doubt :-)
> > > 
> > > But.. Sinan is right, you look anywhere in the driver tree and you
> > > find stuff like this:
> > > 
> > > drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > > 
> > >         /* Force memory writes to complete before letting h/w
> > >          * know there are new descriptors to fetch.
> > >          */
> > >         wmb();
> > > 
> > > 
> > > It is *systemic*
> > 
> > Yes, because they all copied e1000e :-) If you look at the comment in
> > there, it does say it's only for weakly ordered archs such as ia64, and
> > even then, probably predates Linus strong statement on the matter.
> 
> Hahah, sure I'll buy that..
> 
> But still, if this really is the case, a *strong* statement in
> barriers.txt to that effect (and not an example demanding the wmb()!)
> would be very helpful for those of us that have to review driver code!

I agree, and that Mellanox bug you pointed me to seems to indicate that
this may not even be true on x86 anymore ...

I think we might need to revisit this properly...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-26 23:59                                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-26 23:59 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Sinan Kaya, Arnd Bergmann, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Mon, 2018-03-26 at 16:50 -0600, Jason Gunthorpe wrote:
> On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > > > Otherwise almost all drivers out there are broken which I very much
> > > > doubt :-)
> > > 
> > > But.. Sinan is right, you look anywhere in the driver tree and you
> > > find stuff like this:
> > > 
> > > drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > > 
> > >         /* Force memory writes to complete before letting h/w
> > >          * know there are new descriptors to fetch.
> > >          */
> > >         wmb();
> > > 
> > > 
> > > It is *systemic*
> > 
> > Yes, because they all copied e1000e :-) If you look at the comment in
> > there, it does say it's only for weakly ordered archs such as ia64, and
> > even then, probably predates Linus strong statement on the matter.
> 
> Hahah, sure I'll buy that..
> 
> But still, if this really is the case, a *strong* statement in
> barriers.txt to that effect (and not an example demanding the wmb()!)
> would be very helpful for those of us that have to review driver code!

I agree, and that Mellanox bug you pointed me to seems to indicate that
this may not even be true on x86 anymore ...

I think we might need to revisit this properly...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 23:59                                         ` Benjamin Herrenschmidt
@ 2018-03-27  1:39                                           ` Jason Gunthorpe
  -1 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-27  1:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon,
	Sinan Kaya, David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, Mar 27, 2018 at 10:59:40AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 16:50 -0600, Jason Gunthorpe wrote:
> > On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote:
> > > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > > > > Otherwise almost all drivers out there are broken which I very much
> > > > > doubt :-)
> > > > 
> > > > But.. Sinan is right, you look anywhere in the driver tree and you
> > > > find stuff like this:
> > > > 
> > > > drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > > > 
> > > >         /* Force memory writes to complete before letting h/w
> > > >          * know there are new descriptors to fetch.
> > > >          */
> > > >         wmb();
> > > > 
> > > > 
> > > > It is *systemic*
> > > 
> > > Yes, because they all copied e1000e :-) If you look at the comment in
> > > there, it does say it's only for weakly ordered archs such as ia64, and
> > > even then, probably predates Linus strong statement on the matter.
> > 
> > Hahah, sure I'll buy that..
> > 
> > But still, if this really is the case, a *strong* statement in
> > barriers.txt to that effect (and not an example demanding the wmb()!)
> > would be very helpful for those of us that have to review driver code!
> 
> I agree, and that Mellanox bug you pointed me to seems to indicate that
> this may not even be true on x86 anymore ...

However, with bugs like that it is hard to know what is going on.. It
could be a CPU bug instead.

> I think we might need to revisit this properly...

I would love to hear a definitive statement from Intel on what wmb();
writel(); does on x86..

Sinan's patches are backwards if writel is ordered, instead of
using writel_relaxed, they should be eliminating the wmb().

But there is no way patches like that could go ahead until
barriers.txt is updated..

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27  1:39                                           ` Jason Gunthorpe
  0 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-27  1:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Sinan Kaya, Arnd Bergmann, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Tue, Mar 27, 2018 at 10:59:40AM +1100, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 16:50 -0600, Jason Gunthorpe wrote:
> > On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote:
> > > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote:
> > > > > Otherwise almost all drivers out there are broken which I very much
> > > > > doubt :-)
> > > > 
> > > > But.. Sinan is right, you look anywhere in the driver tree and you
> > > > find stuff like this:
> > > > 
> > > > drivers/net/ethernet/intel/i40e/i40e_txrx.c
> > > > 
> > > >         /* Force memory writes to complete before letting h/w
> > > >          * know there are new descriptors to fetch.
> > > >          */
> > > >         wmb();
> > > > 
> > > > 
> > > > It is *systemic*
> > > 
> > > Yes, because they all copied e1000e :-) If you look at the comment in
> > > there, it does say it's only for weakly ordered archs such as ia64, and
> > > even then, probably predates Linus strong statement on the matter.
> > 
> > Hahah, sure I'll buy that..
> > 
> > But still, if this really is the case, a *strong* statement in
> > barriers.txt to that effect (and not an example demanding the wmb()!)
> > would be very helpful for those of us that have to review driver code!
> 
> I agree, and that Mellanox bug you pointed me to seems to indicate that
> this may not even be true on x86 anymore ...

However, with bugs like that it is hard to know what is going on.. It
could be a CPU bug instead.

> I think we might need to revisit this properly...

I would love to hear a definitive statement from Intel on what wmb();
writel(); does on x86..

Sinan's patches are backwards if writel is ordered, instead of
using writel_relaxed, they should be eliminating the wmb().

But there is no way patches like that could go ahead until
barriers.txt is updated..

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 22:27                                   ` Jason Gunthorpe
@ 2018-03-27  7:56                                     ` Arnd Bergmann
  -1 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27  7:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya,
	David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
>> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
>
> I even see patches adding wmb() based on actual observed memory
> corruption during testing on Intel:
>
> https://patchwork.kernel.org/patch/10177207/
>
> So you think all of this is unnecessary and writel is totally strongly
> ordered, even on multi-socket Intel?

This example adds a wmb() between two writes to a coherent DMA
area, it is definitely required there. I'm pretty sure I've never seen
any bug reports pointing to a missing wmb() between memory
and MMIO write accesses, but if you remember seeing them in the
list, maybe you can look again for some evidence of something going
wrong on x86 without it?

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27  7:56                                     ` Arnd Bergmann
  0 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27  7:56 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Sinan Kaya, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
>> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
>
> I even see patches adding wmb() based on actual observed memory
> corruption during testing on Intel:
>
> https://patchwork.kernel.org/patch/10177207/
>
> So you think all of this is unnecessary and writel is totally strongly
> ordered, even on multi-socket Intel?

This example adds a wmb() between two writes to a coherent DMA
area, it is definitely required there. I'm pretty sure I've never seen
any bug reports pointing to a missing wmb() between memory
and MMIO write accesses, but if you remember seeing them in the
list, maybe you can look again for some evidence of something going
wrong on x86 without it?

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27  7:56                                     ` Arnd Bergmann
@ 2018-03-27  8:56                                       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27  8:56 UTC (permalink / raw)
  To: Arnd Bergmann, Jason Gunthorpe
  Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya,
	David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > 
> > I even see patches adding wmb() based on actual observed memory
> > corruption during testing on Intel:
> > 
> > https://patchwork.kernel.org/patch/10177207/
> > 
> > So you think all of this is unnecessary and writel is totally strongly
> > ordered, even on multi-socket Intel?
> 
> This example adds a wmb() between two writes to a coherent DMA
> area, it is definitely required there. 

Ah you are right, I incorrectly assumed that the "prod_db" function was
an MMIO. So we do NOT have a counter example where wmb is needed on
x86, pfiew ! :-)

> I'm pretty sure I've never seen
> any bug reports pointing to a missing wmb() between memory
> and MMIO write accesses, but if you remember seeing them in the
> list, maybe you can look again for some evidence of something going
> wrong on x86 without it?

The interesting thing is that we do seem to have a whole LOT of these
spurrious wmb before writel all over the tree, I suspect because of
that incorrect recommendation in memory-barriers.txt.

We should fix that.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27  8:56                                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27  8:56 UTC (permalink / raw)
  To: Arnd Bergmann, Jason Gunthorpe
  Cc: Sinan Kaya, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > 
> > I even see patches adding wmb() based on actual observed memory
> > corruption during testing on Intel:
> > 
> > https://patchwork.kernel.org/patch/10177207/
> > 
> > So you think all of this is unnecessary and writel is totally strongly
> > ordered, even on multi-socket Intel?
> 
> This example adds a wmb() between two writes to a coherent DMA
> area, it is definitely required there. 

Ah you are right, I incorrectly assumed that the "prod_db" function was
an MMIO. So we do NOT have a counter example where wmb is needed on
x86, pfiew ! :-)

> I'm pretty sure I've never seen
> any bug reports pointing to a missing wmb() between memory
> and MMIO write accesses, but if you remember seeing them in the
> list, maybe you can look again for some evidence of something going
> wrong on x86 without it?

The interesting thing is that we do seem to have a whole LOT of these
spurrious wmb before writel all over the tree, I suspect because of
that incorrect recommendation in memory-barriers.txt.

We should fix that.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27  7:56                                     ` Arnd Bergmann
@ 2018-03-27  9:42                                       ` Will Deacon
  -1 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27  9:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Paul E. McKenney, linux-rdma, Sinan Kaya, Jason Gunthorpe,
	David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> >
> > I even see patches adding wmb() based on actual observed memory
> > corruption during testing on Intel:
> >
> > https://patchwork.kernel.org/patch/10177207/
> >
> > So you think all of this is unnecessary and writel is totally strongly
> > ordered, even on multi-socket Intel?
> 
> This example adds a wmb() between two writes to a coherent DMA
> area, it is definitely required there. I'm pretty sure I've never seen
> any bug reports pointing to a missing wmb() between memory
> and MMIO write accesses, but if you remember seeing them in the
> list, maybe you can look again for some evidence of something going
> wrong on x86 without it?

If this is just about ordering accesses to coherent DMA, then using
dma_wmb() instead will be much better performance on arm/arm64.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27  9:42                                       ` Will Deacon
  0 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27  9:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jason Gunthorpe, Benjamin Herrenschmidt, Sinan Kaya,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney

On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> >
> > I even see patches adding wmb() based on actual observed memory
> > corruption during testing on Intel:
> >
> > https://patchwork.kernel.org/patch/10177207/
> >
> > So you think all of this is unnecessary and writel is totally strongly
> > ordered, even on multi-socket Intel?
> 
> This example adds a wmb() between two writes to a coherent DMA
> area, it is definitely required there. I'm pretty sure I've never seen
> any bug reports pointing to a missing wmb() between memory
> and MMIO write accesses, but if you remember seeing them in the
> list, maybe you can look again for some evidence of something going
> wrong on x86 without it?

If this is just about ordering accesses to coherent DMA, then using
dma_wmb() instead will be much better performance on arm/arm64.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27  8:56                                       ` Benjamin Herrenschmidt
@ 2018-03-27  9:44                                         ` Arnd Bergmann
  -1 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27  9:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya,
	Jason Gunthorpe, David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, Mar 27, 2018 at 10:56 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote:
>> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> I'm pretty sure I've never seen
>> any bug reports pointing to a missing wmb() between memory
>> and MMIO write accesses, but if you remember seeing them in the
>> list, maybe you can look again for some evidence of something going
>> wrong on x86 without it?
>
> The interesting thing is that we do seem to have a whole LOT of these
> spurrious wmb before writel all over the tree, I suspect because of
> that incorrect recommendation in memory-barriers.txt.
>
> We should fix that.

Maybe the problem is just that it's so counter-intuitive that we don't
need that barrier in Linux, when the hardware does need one on some
architectures.

How about we define a barrier type instruction specifically for this
purpose, something like wmb_before_mmio() and have all architectures
define that to an empty macro?

That way, having correct code using wmb_before_mmio() will not
trigger an incorrect review comment that leads to extra wmb(). ;-)

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27  9:44                                         ` Arnd Bergmann
  0 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27  9:44 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jason Gunthorpe, Sinan Kaya, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Tue, Mar 27, 2018 at 10:56 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote:
>> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>>
>> I'm pretty sure I've never seen
>> any bug reports pointing to a missing wmb() between memory
>> and MMIO write accesses, but if you remember seeing them in the
>> list, maybe you can look again for some evidence of something going
>> wrong on x86 without it?
>
> The interesting thing is that we do seem to have a whole LOT of these
> spurrious wmb before writel all over the tree, I suspect because of
> that incorrect recommendation in memory-barriers.txt.
>
> We should fix that.

Maybe the problem is just that it's so counter-intuitive that we don't
need that barrier in Linux, when the hardware does need one on some
architectures.

How about we define a barrier type instruction specifically for this
purpose, something like wmb_before_mmio() and have all architectures
define that to an empty macro?

That way, having correct code using wmb_before_mmio() will not
trigger an incorrect review comment that leads to extra wmb(). ;-)

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27  8:56                                       ` Benjamin Herrenschmidt
@ 2018-03-27  9:57                                         ` Will Deacon
  -1 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27  9:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paul E. McKenney, Arnd Bergmann, corbet, linux-rdma, Sinan Kaya,
	Jason Gunthorpe, peterz, David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	mingo

[+ locking/ordering/docs people]

On Tue, Mar 27, 2018 at 07:56:59PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote:
> > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> > > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > > 
> > > I even see patches adding wmb() based on actual observed memory
> > > corruption during testing on Intel:
> > > 
> > > https://patchwork.kernel.org/patch/10177207/
> > > 
> > > So you think all of this is unnecessary and writel is totally strongly
> > > ordered, even on multi-socket Intel?
> > 
> > This example adds a wmb() between two writes to a coherent DMA
> > area, it is definitely required there. 
> 
> Ah you are right, I incorrectly assumed that the "prod_db" function was
> an MMIO. So we do NOT have a counter example where wmb is needed on
> x86, pfiew ! :-)
> 
> > I'm pretty sure I've never seen
> > any bug reports pointing to a missing wmb() between memory
> > and MMIO write accesses, but if you remember seeing them in the
> > list, maybe you can look again for some evidence of something going
> > wrong on x86 without it?
> 
> The interesting thing is that we do seem to have a whole LOT of these
> spurrious wmb before writel all over the tree, I suspect because of
> that incorrect recommendation in memory-barriers.txt.
> 
> We should fix that.

Patch below. Thoughts?

Will

--->8

>From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Tue, 27 Mar 2018 10:49:58 +0100
Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering
 example

The section of memory-barriers.txt that describes the dma_Xmb() barriers
has an incorrect example claiming that a wmb() is required after writing
to coherent memory in order for those writes to be visible to a device
before a subsequent MMIO access using writel() can reach the device.

In fact, this ordering guarantee is provided (at significant cost on some
architectures such as arm and power) by writel, so the wmb() is not
necessary. writel_relaxed exists for cases where this ordering is not
required.

Fix the example and update the text to make this clearer.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Reported-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 Documentation/memory-barriers.txt | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index a863009849a3..2556b4b0e6f9 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
 		/* assign ownership */
 		desc->status = DEVICE_OWN;
 
-		/* force memory to sync before notifying device via MMIO */
-		wmb();
-
 		/* notify device of new descriptors */
 		writel(DESC_NOTIFY, doorbell);
 	}
@@ -1919,11 +1916,16 @@ There are some more advanced barrier functions:
      The dma_rmb() allows us guarantee the device has released ownership
      before we read the data from the descriptor, and the dma_wmb() allows
      us to guarantee the data is written to the descriptor before the device
-     can see it now has ownership.  The wmb() is needed to guarantee that the
-     cache coherent memory writes have completed before attempting a write to
-     the cache incoherent MMIO region.
-
-     See Documentation/DMA-API.txt for more information on consistent memory.
+     can see it now has ownership.  Note that, when using writel(), a prior
+     wmb() is not needed to guarantee that the cache coherent memory writes
+     have completed before writing to the cache incoherent MMIO region.
+     If this ordering between incoherent MMIO and coherent memory regions
+     is not required, writel_relaxed() can be used instead and is significantly
+     cheaper on some weakly-ordered architectures.
+
+     See the subsection "Kernel I/O barrier effects" for more information on
+     relaxed I/O accessors and the Documentation/DMA-API.txt file for more
+     information on consistent memory.
 
 
 MMIO WRITE BARRIER
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27  9:57                                         ` Will Deacon
  0 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27  9:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Arnd Bergmann, Jason Gunthorpe, Sinan Kaya, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, peterz, mingo,
	corbet

[+ locking/ordering/docs people]

On Tue, Mar 27, 2018 at 07:56:59PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote:
> > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> > > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > > 
> > > I even see patches adding wmb() based on actual observed memory
> > > corruption during testing on Intel:
> > > 
> > > https://patchwork.kernel.org/patch/10177207/
> > > 
> > > So you think all of this is unnecessary and writel is totally strongly
> > > ordered, even on multi-socket Intel?
> > 
> > This example adds a wmb() between two writes to a coherent DMA
> > area, it is definitely required there. 
> 
> Ah you are right, I incorrectly assumed that the "prod_db" function was
> an MMIO. So we do NOT have a counter example where wmb is needed on
> x86, pfiew ! :-)
> 
> > I'm pretty sure I've never seen
> > any bug reports pointing to a missing wmb() between memory
> > and MMIO write accesses, but if you remember seeing them in the
> > list, maybe you can look again for some evidence of something going
> > wrong on x86 without it?
> 
> The interesting thing is that we do seem to have a whole LOT of these
> spurrious wmb before writel all over the tree, I suspect because of
> that incorrect recommendation in memory-barriers.txt.
> 
> We should fix that.

Patch below. Thoughts?

Will

--->8

>From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Tue, 27 Mar 2018 10:49:58 +0100
Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering
 example

The section of memory-barriers.txt that describes the dma_Xmb() barriers
has an incorrect example claiming that a wmb() is required after writing
to coherent memory in order for those writes to be visible to a device
before a subsequent MMIO access using writel() can reach the device.

In fact, this ordering guarantee is provided (at significant cost on some
architectures such as arm and power) by writel, so the wmb() is not
necessary. writel_relaxed exists for cases where this ordering is not
required.

Fix the example and update the text to make this clearer.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Reported-by: Sinan Kaya <okaya@codeaurora.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 Documentation/memory-barriers.txt | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index a863009849a3..2556b4b0e6f9 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
 		/* assign ownership */
 		desc->status = DEVICE_OWN;
 
-		/* force memory to sync before notifying device via MMIO */
-		wmb();
-
 		/* notify device of new descriptors */
 		writel(DESC_NOTIFY, doorbell);
 	}
@@ -1919,11 +1916,16 @@ There are some more advanced barrier functions:
      The dma_rmb() allows us guarantee the device has released ownership
      before we read the data from the descriptor, and the dma_wmb() allows
      us to guarantee the data is written to the descriptor before the device
-     can see it now has ownership.  The wmb() is needed to guarantee that the
-     cache coherent memory writes have completed before attempting a write to
-     the cache incoherent MMIO region.
-
-     See Documentation/DMA-API.txt for more information on consistent memory.
+     can see it now has ownership.  Note that, when using writel(), a prior
+     wmb() is not needed to guarantee that the cache coherent memory writes
+     have completed before writing to the cache incoherent MMIO region.
+     If this ordering between incoherent MMIO and coherent memory regions
+     is not required, writel_relaxed() can be used instead and is significantly
+     cheaper on some weakly-ordered architectures.
+
+     See the subsection "Kernel I/O barrier effects" for more information on
+     relaxed I/O accessors and the Documentation/DMA-API.txt file for more
+     information on consistent memory.
 
 
 MMIO WRITE BARRIER
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27  9:44                                         ` Arnd Bergmann
@ 2018-03-27 10:00                                           ` Will Deacon
  -1 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27 10:00 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Paul E. McKenney, linux-rdma, Sinan Kaya, Jason Gunthorpe,
	David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, Mar 27, 2018 at 11:44:22AM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 10:56 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote:
> >> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >>
> >> I'm pretty sure I've never seen
> >> any bug reports pointing to a missing wmb() between memory
> >> and MMIO write accesses, but if you remember seeing them in the
> >> list, maybe you can look again for some evidence of something going
> >> wrong on x86 without it?
> >
> > The interesting thing is that we do seem to have a whole LOT of these
> > spurrious wmb before writel all over the tree, I suspect because of
> > that incorrect recommendation in memory-barriers.txt.
> >
> > We should fix that.
> 
> Maybe the problem is just that it's so counter-intuitive that we don't
> need that barrier in Linux, when the hardware does need one on some
> architectures.
> 
> How about we define a barrier type instruction specifically for this
> purpose, something like wmb_before_mmio() and have all architectures
> define that to an empty macro?
> 
> That way, having correct code using wmb_before_mmio() will not
> trigger an incorrect review comment that leads to extra wmb(). ;-)

Please don't add more barriers :)! I think that will make it even more
difficult to understand how to use the ones we already have -- the problem
here seems to be that the documentation that was added for the dma_*
barriers got this wrong, but it was at least in contradiction with the
section elsewhere in memory-barriers.txt that describes the relaxed I/O
accessors.

I guess somebody could hack checkpatch to look for back-to-back wmb/writel
sequences? I suspect we could do something with coccinelle too.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 10:00                                           ` Will Deacon
  0 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27 10:00 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney

On Tue, Mar 27, 2018 at 11:44:22AM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 10:56 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote:
> >> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> >>
> >> I'm pretty sure I've never seen
> >> any bug reports pointing to a missing wmb() between memory
> >> and MMIO write accesses, but if you remember seeing them in the
> >> list, maybe you can look again for some evidence of something going
> >> wrong on x86 without it?
> >
> > The interesting thing is that we do seem to have a whole LOT of these
> > spurrious wmb before writel all over the tree, I suspect because of
> > that incorrect recommendation in memory-barriers.txt.
> >
> > We should fix that.
> 
> Maybe the problem is just that it's so counter-intuitive that we don't
> need that barrier in Linux, when the hardware does need one on some
> architectures.
> 
> How about we define a barrier type instruction specifically for this
> purpose, something like wmb_before_mmio() and have all architectures
> define that to an empty macro?
> 
> That way, having correct code using wmb_before_mmio() will not
> trigger an incorrect review comment that leads to extra wmb(). ;-)

Please don't add more barriers :)! I think that will make it even more
difficult to understand how to use the ones we already have -- the problem
here seems to be that the documentation that was added for the dma_*
barriers got this wrong, but it was at least in contradiction with the
section elsewhere in memory-barriers.txt that describes the relaxed I/O
accessors.

I guess somebody could hack checkpatch to look for back-to-back wmb/writel
sequences? I suspect we could do something with coccinelle too.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27  9:57                                         ` Will Deacon
@ 2018-03-27 10:05                                           ` Arnd Bergmann
  -1 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27 10:05 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe,
	Peter Zijlstra, David Laight, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Tue, Mar 27, 2018 at 11:57 AM, Will Deacon <will.deacon@arm.com> wrote:

>
> From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001
> From: Will Deacon <will.deacon@arm.com>
> Date: Tue, 27 Mar 2018 10:49:58 +0100
> Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering
>  example
>
> The section of memory-barriers.txt that describes the dma_Xmb() barriers
> has an incorrect example claiming that a wmb() is required after writing
> to coherent memory in order for those writes to be visible to a device
> before a subsequent MMIO access using writel() can reach the device.
>
> In fact, this ordering guarantee is provided (at significant cost on some
> architectures such as arm and power) by writel, so the wmb() is not
> necessary. writel_relaxed exists for cases where this ordering is not
> required.
>
> Fix the example and update the text to make this clearer.
>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Reported-by: Sinan Kaya <okaya@codeaurora.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  Documentation/memory-barriers.txt | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index a863009849a3..2556b4b0e6f9 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
>                 /* assign ownership */
>                 desc->status = DEVICE_OWN;
>
> -               /* force memory to sync before notifying device via MMIO */
> -               wmb();
> -
>                 /* notify device of new descriptors */
>                 writel(DESC_NOTIFY, doorbell);
>         }
> @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions:
>       The dma_rmb() allows us guarantee the device has released ownership
>       before we read the data from the descriptor, and the dma_wmb() allows
>       us to guarantee the data is written to the descriptor before the device
> -     can see it now has ownership.  The wmb() is needed to guarantee that the
> -     cache coherent memory writes have completed before attempting a write to
> -     the cache incoherent MMIO region.
> -
> -     See Documentation/DMA-API.txt for more information on consistent memory.
> +     can see it now has ownership.  Note that, when using writel(), a prior
> +     wmb() is not needed to guarantee that the cache coherent memory writes
> +     have completed before writing to the cache incoherent MMIO region.
> +     If this ordering between incoherent MMIO and coherent memory regions
> +     is not required, writel_relaxed() can be used instead and is significantly
> +     cheaper on some weakly-ordered architectures.

I think that's a great improvement, but I'm a bit worried about recommending
writel_relaxed() too much: I've seen a lot of drivers that just always use
writel_relaxed() over write(), and some of them get that wrong when they
don't understand the difference but end up using DMA without explicit
barriers anyway.

Also, having an architecture-independent driver use wmb()+writel_relaxed()
ends up being more expensive than just using write(). Not sure how to
best phrase it though.

      Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 10:05                                           ` Arnd Bergmann
  0 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27 10:05 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Tue, Mar 27, 2018 at 11:57 AM, Will Deacon <will.deacon@arm.com> wrote:

>
> From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001
> From: Will Deacon <will.deacon@arm.com>
> Date: Tue, 27 Mar 2018 10:49:58 +0100
> Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering
>  example
>
> The section of memory-barriers.txt that describes the dma_Xmb() barriers
> has an incorrect example claiming that a wmb() is required after writing
> to coherent memory in order for those writes to be visible to a device
> before a subsequent MMIO access using writel() can reach the device.
>
> In fact, this ordering guarantee is provided (at significant cost on some
> architectures such as arm and power) by writel, so the wmb() is not
> necessary. writel_relaxed exists for cases where this ordering is not
> required.
>
> Fix the example and update the text to make this clearer.
>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Jason Gunthorpe <jgg@ziepe.ca>
> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Reported-by: Sinan Kaya <okaya@codeaurora.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  Documentation/memory-barriers.txt | 18 ++++++++++--------
>  1 file changed, 10 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index a863009849a3..2556b4b0e6f9 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
>                 /* assign ownership */
>                 desc->status = DEVICE_OWN;
>
> -               /* force memory to sync before notifying device via MMIO */
> -               wmb();
> -
>                 /* notify device of new descriptors */
>                 writel(DESC_NOTIFY, doorbell);
>         }
> @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions:
>       The dma_rmb() allows us guarantee the device has released ownership
>       before we read the data from the descriptor, and the dma_wmb() allows
>       us to guarantee the data is written to the descriptor before the device
> -     can see it now has ownership.  The wmb() is needed to guarantee that the
> -     cache coherent memory writes have completed before attempting a write to
> -     the cache incoherent MMIO region.
> -
> -     See Documentation/DMA-API.txt for more information on consistent memory.
> +     can see it now has ownership.  Note that, when using writel(), a prior
> +     wmb() is not needed to guarantee that the cache coherent memory writes
> +     have completed before writing to the cache incoherent MMIO region.
> +     If this ordering between incoherent MMIO and coherent memory regions
> +     is not required, writel_relaxed() can be used instead and is significantly
> +     cheaper on some weakly-ordered architectures.

I think that's a great improvement, but I'm a bit worried about recommending
writel_relaxed() too much: I've seen a lot of drivers that just always use
writel_relaxed() over write(), and some of them get that wrong when they
don't understand the difference but end up using DMA without explicit
barriers anyway.

Also, having an architecture-independent driver use wmb()+writel_relaxed()
ends up being more expensive than just using write(). Not sure how to
best phrase it though.

      Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 10:05                                           ` Arnd Bergmann
@ 2018-03-27 10:09                                             ` Will Deacon
  -1 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27 10:09 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe,
	Peter Zijlstra, David Laight, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 11:57 AM, Will Deacon <will.deacon@arm.com> wrote:
> 
> >
> > From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001
> > From: Will Deacon <will.deacon@arm.com>
> > Date: Tue, 27 Mar 2018 10:49:58 +0100
> > Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering
> >  example
> >
> > The section of memory-barriers.txt that describes the dma_Xmb() barriers
> > has an incorrect example claiming that a wmb() is required after writing
> > to coherent memory in order for those writes to be visible to a device
> > before a subsequent MMIO access using writel() can reach the device.
> >
> > In fact, this ordering guarantee is provided (at significant cost on some
> > architectures such as arm and power) by writel, so the wmb() is not
> > necessary. writel_relaxed exists for cases where this ordering is not
> > required.
> >
> > Fix the example and update the text to make this clearer.
> >
> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Jonathan Corbet <corbet@lwn.net>
> > Reported-by: Sinan Kaya <okaya@codeaurora.org>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  Documentation/memory-barriers.txt | 18 ++++++++++--------
> >  1 file changed, 10 insertions(+), 8 deletions(-)
> >
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index a863009849a3..2556b4b0e6f9 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
> >                 /* assign ownership */
> >                 desc->status = DEVICE_OWN;
> >
> > -               /* force memory to sync before notifying device via MMIO */
> > -               wmb();
> > -
> >                 /* notify device of new descriptors */
> >                 writel(DESC_NOTIFY, doorbell);
> >         }
> > @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions:
> >       The dma_rmb() allows us guarantee the device has released ownership
> >       before we read the data from the descriptor, and the dma_wmb() allows
> >       us to guarantee the data is written to the descriptor before the device
> > -     can see it now has ownership.  The wmb() is needed to guarantee that the
> > -     cache coherent memory writes have completed before attempting a write to
> > -     the cache incoherent MMIO region.
> > -
> > -     See Documentation/DMA-API.txt for more information on consistent memory.
> > +     can see it now has ownership.  Note that, when using writel(), a prior
> > +     wmb() is not needed to guarantee that the cache coherent memory writes
> > +     have completed before writing to the cache incoherent MMIO region.
> > +     If this ordering between incoherent MMIO and coherent memory regions
> > +     is not required, writel_relaxed() can be used instead and is significantly
> > +     cheaper on some weakly-ordered architectures.
> 
> I think that's a great improvement, but I'm a bit worried about recommending
> writel_relaxed() too much: I've seen a lot of drivers that just always use
> writel_relaxed() over write(), and some of them get that wrong when they
> don't understand the difference but end up using DMA without explicit
> barriers anyway.
> 
> Also, having an architecture-independent driver use wmb()+writel_relaxed()
> ends up being more expensive than just using write(). Not sure how to
> best phrase it though.

Perhaps I add reword that with a simple example to say:

  If this ordering between incoherent MMIO and coherent memory regions
  is not required (e.g. in a sequence of accesses all to the MMIO region)
  [...]

since that seems to be the usual case where the _relaxed accessors help.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 10:09                                             ` Will Deacon
  0 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27 10:09 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 11:57 AM, Will Deacon <will.deacon@arm.com> wrote:
> 
> >
> > From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001
> > From: Will Deacon <will.deacon@arm.com>
> > Date: Tue, 27 Mar 2018 10:49:58 +0100
> > Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering
> >  example
> >
> > The section of memory-barriers.txt that describes the dma_Xmb() barriers
> > has an incorrect example claiming that a wmb() is required after writing
> > to coherent memory in order for those writes to be visible to a device
> > before a subsequent MMIO access using writel() can reach the device.
> >
> > In fact, this ordering guarantee is provided (at significant cost on some
> > architectures such as arm and power) by writel, so the wmb() is not
> > necessary. writel_relaxed exists for cases where this ordering is not
> > required.
> >
> > Fix the example and update the text to make this clearer.
> >
> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: Jason Gunthorpe <jgg@ziepe.ca>
> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Jonathan Corbet <corbet@lwn.net>
> > Reported-by: Sinan Kaya <okaya@codeaurora.org>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  Documentation/memory-barriers.txt | 18 ++++++++++--------
> >  1 file changed, 10 insertions(+), 8 deletions(-)
> >
> > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> > index a863009849a3..2556b4b0e6f9 100644
> > --- a/Documentation/memory-barriers.txt
> > +++ b/Documentation/memory-barriers.txt
> > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
> >                 /* assign ownership */
> >                 desc->status = DEVICE_OWN;
> >
> > -               /* force memory to sync before notifying device via MMIO */
> > -               wmb();
> > -
> >                 /* notify device of new descriptors */
> >                 writel(DESC_NOTIFY, doorbell);
> >         }
> > @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions:
> >       The dma_rmb() allows us guarantee the device has released ownership
> >       before we read the data from the descriptor, and the dma_wmb() allows
> >       us to guarantee the data is written to the descriptor before the device
> > -     can see it now has ownership.  The wmb() is needed to guarantee that the
> > -     cache coherent memory writes have completed before attempting a write to
> > -     the cache incoherent MMIO region.
> > -
> > -     See Documentation/DMA-API.txt for more information on consistent memory.
> > +     can see it now has ownership.  Note that, when using writel(), a prior
> > +     wmb() is not needed to guarantee that the cache coherent memory writes
> > +     have completed before writing to the cache incoherent MMIO region.
> > +     If this ordering between incoherent MMIO and coherent memory regions
> > +     is not required, writel_relaxed() can be used instead and is significantly
> > +     cheaper on some weakly-ordered architectures.
> 
> I think that's a great improvement, but I'm a bit worried about recommending
> writel_relaxed() too much: I've seen a lot of drivers that just always use
> writel_relaxed() over write(), and some of them get that wrong when they
> don't understand the difference but end up using DMA without explicit
> barriers anyway.
> 
> Also, having an architecture-independent driver use wmb()+writel_relaxed()
> ends up being more expensive than just using write(). Not sure how to
> best phrase it though.

Perhaps I add reword that with a simple example to say:

  If this ordering between incoherent MMIO and coherent memory regions
  is not required (e.g. in a sequence of accesses all to the MMIO region)
  [...]

since that seems to be the usual case where the _relaxed accessors help.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 10:09                                             ` Will Deacon
@ 2018-03-27 10:53                                               ` Arnd Bergmann
  -1 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27 10:53 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe,
	Peter Zijlstra, David Laight, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:
>> > -
>> > -     See Documentation/DMA-API.txt for more information on consistent memory.
>> > +     can see it now has ownership.  Note that, when using writel(), a prior
>> > +     wmb() is not needed to guarantee that the cache coherent memory writes
>> > +     have completed before writing to the cache incoherent MMIO region.
>> > +     If this ordering between incoherent MMIO and coherent memory regions

One more thing: I think the term "incoherent MMIO" is a bit confusing, I'd
prefer just "MMIO" here. At least I don't have the faintest clue what the
difference between "coherent MMIO" and "incoherent MMIO" would be ;-)

>> > +     is not required, writel_relaxed() can be used instead and is significantly
>> > +     cheaper on some weakly-ordered architectures.
>>
>> I think that's a great improvement, but I'm a bit worried about recommending
>> writel_relaxed() too much: I've seen a lot of drivers that just always use
>> writel_relaxed() over write(), and some of them get that wrong when they
>> don't understand the difference but end up using DMA without explicit
>> barriers anyway.
>>
>> Also, having an architecture-independent driver use wmb()+writel_relaxed()
>> ends up being more expensive than just using write(). Not sure how to
>> best phrase it though.
>
> Perhaps I add reword that with a simple example to say:
>
>   If this ordering between incoherent MMIO and coherent memory regions
>   is not required (e.g. in a sequence of accesses all to the MMIO region)
>   [...]
>
> since that seems to be the usual case where the _relaxed accessors help.

That still doesn't quite capture what I'd like driver writes to do: in essence
I would recommend them to use writel() all the time, except in performance
critical code that has been shown to be correct and has a comment to explain
why _relaxed() is ok in that particular function.

Maybe it can just be rephrased to warn against the use of writel_relaxed()
here, and explain the difference that way:

     can see it now has ownership.  Note that, when using writel(), a prior
     wmb() is not needed to guarantee that the cache coherent memory writes
     have completed before writing to the cache incoherent MMIO region.
     The cheaper writel_relaxed() does not guarantee the DMA to be visible
     to the device and must not be used here.

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 10:53                                               ` Arnd Bergmann
  0 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27 10:53 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:
>> > -
>> > -     See Documentation/DMA-API.txt for more information on consistent memory.
>> > +     can see it now has ownership.  Note that, when using writel(), a prior
>> > +     wmb() is not needed to guarantee that the cache coherent memory writes
>> > +     have completed before writing to the cache incoherent MMIO region.
>> > +     If this ordering between incoherent MMIO and coherent memory regions

One more thing: I think the term "incoherent MMIO" is a bit confusing, I'd
prefer just "MMIO" here. At least I don't have the faintest clue what the
difference between "coherent MMIO" and "incoherent MMIO" would be ;-)

>> > +     is not required, writel_relaxed() can be used instead and is significantly
>> > +     cheaper on some weakly-ordered architectures.
>>
>> I think that's a great improvement, but I'm a bit worried about recommending
>> writel_relaxed() too much: I've seen a lot of drivers that just always use
>> writel_relaxed() over write(), and some of them get that wrong when they
>> don't understand the difference but end up using DMA without explicit
>> barriers anyway.
>>
>> Also, having an architecture-independent driver use wmb()+writel_relaxed()
>> ends up being more expensive than just using write(). Not sure how to
>> best phrase it though.
>
> Perhaps I add reword that with a simple example to say:
>
>   If this ordering between incoherent MMIO and coherent memory regions
>   is not required (e.g. in a sequence of accesses all to the MMIO region)
>   [...]
>
> since that seems to be the usual case where the _relaxed accessors help.

That still doesn't quite capture what I'd like driver writes to do: in essence
I would recommend them to use writel() all the time, except in performance
critical code that has been shown to be correct and has a comment to explain
why _relaxed() is ok in that particular function.

Maybe it can just be rephrased to warn against the use of writel_relaxed()
here, and explain the difference that way:

     can see it now has ownership.  Note that, when using writel(), a prior
     wmb() is not needed to guarantee that the cache coherent memory writes
     have completed before writing to the cache incoherent MMIO region.
     The cheaper writel_relaxed() does not guarantee the DMA to be visible
     to the device and must not be used here.

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 10:53                                               ` Arnd Bergmann
@ 2018-03-27 11:02                                                 ` Will Deacon
  -1 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27 11:02 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe,
	Peter Zijlstra, David Laight, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote:
> > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:
> >> > -
> >> > -     See Documentation/DMA-API.txt for more information on consistent memory.
> >> > +     can see it now has ownership.  Note that, when using writel(), a prior
> >> > +     wmb() is not needed to guarantee that the cache coherent memory writes
> >> > +     have completed before writing to the cache incoherent MMIO region.
> >> > +     If this ordering between incoherent MMIO and coherent memory regions
> 
> One more thing: I think the term "incoherent MMIO" is a bit confusing, I'd
> prefer just "MMIO" here. At least I don't have the faintest clue what the
> difference between "coherent MMIO" and "incoherent MMIO" would be ;-)

Yes, you're right. I was just following the terminology that's already used
here, but actually that seems not be used anywhere else in the document!
I'll kill it.

> >> > +     is not required, writel_relaxed() can be used instead and is significantly
> >> > +     cheaper on some weakly-ordered architectures.
> >>
> >> I think that's a great improvement, but I'm a bit worried about recommending
> >> writel_relaxed() too much: I've seen a lot of drivers that just always use
> >> writel_relaxed() over write(), and some of them get that wrong when they
> >> don't understand the difference but end up using DMA without explicit
> >> barriers anyway.
> >>
> >> Also, having an architecture-independent driver use wmb()+writel_relaxed()
> >> ends up being more expensive than just using write(). Not sure how to
> >> best phrase it though.
> >
> > Perhaps I add reword that with a simple example to say:
> >
> >   If this ordering between incoherent MMIO and coherent memory regions
> >   is not required (e.g. in a sequence of accesses all to the MMIO region)
> >   [...]
> >
> > since that seems to be the usual case where the _relaxed accessors help.
> 
> That still doesn't quite capture what I'd like driver writes to do: in essence
> I would recommend them to use writel() all the time, except in performance
> critical code that has been shown to be correct and has a comment to explain
> why _relaxed() is ok in that particular function.
> 
> Maybe it can just be rephrased to warn against the use of writel_relaxed()
> here, and explain the difference that way:
> 
>      can see it now has ownership.  Note that, when using writel(), a prior
>      wmb() is not needed to guarantee that the cache coherent memory writes
>      have completed before writing to the cache incoherent MMIO region.
>      The cheaper writel_relaxed() does not guarantee the DMA to be visible
>      to the device and must not be used here.

Fair enough. I'd rather people used _relaxed by default, but I have to admit
that it will probably just result in them getting things wrong. Just a tiny
bit of wordsmithing brings this to:


diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index a863009849a3..3247547d1c36 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
 		/* assign ownership */
 		desc->status = DEVICE_OWN;
 
-		/* force memory to sync before notifying device via MMIO */
-		wmb();
-
 		/* notify device of new descriptors */
 		writel(DESC_NOTIFY, doorbell);
 	}
@@ -1919,11 +1916,15 @@ There are some more advanced barrier functions:
      The dma_rmb() allows us guarantee the device has released ownership
      before we read the data from the descriptor, and the dma_wmb() allows
      us to guarantee the data is written to the descriptor before the device
-     can see it now has ownership.  The wmb() is needed to guarantee that the
-     cache coherent memory writes have completed before attempting a write to
-     the cache incoherent MMIO region.
-
-     See Documentation/DMA-API.txt for more information on consistent memory.
+     can see it now has ownership.  Note that, when using writel(), a prior
+     wmb() is not needed to guarantee that the cache coherent memory writes
+     have completed before writing to the MMIO region.  The cheaper
+     writel_relaxed() does not provide this guarantee and must not be used
+     here.
+
+     See the subsection "Kernel I/O barrier effects" for more information on
+     relaxed I/O accessors and the Documentation/DMA-API.txt file for more
+     information on consistent memory.
 
 
 MMIO WRITE BARRIER


If you're happy with that, I'll send it as a proper patch.

Cheers,

Will

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 11:02                                                 ` Will Deacon
  0 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27 11:02 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote:
> > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:
> >> > -
> >> > -     See Documentation/DMA-API.txt for more information on consistent memory.
> >> > +     can see it now has ownership.  Note that, when using writel(), a prior
> >> > +     wmb() is not needed to guarantee that the cache coherent memory writes
> >> > +     have completed before writing to the cache incoherent MMIO region.
> >> > +     If this ordering between incoherent MMIO and coherent memory regions
> 
> One more thing: I think the term "incoherent MMIO" is a bit confusing, I'd
> prefer just "MMIO" here. At least I don't have the faintest clue what the
> difference between "coherent MMIO" and "incoherent MMIO" would be ;-)

Yes, you're right. I was just following the terminology that's already used
here, but actually that seems not be used anywhere else in the document!
I'll kill it.

> >> > +     is not required, writel_relaxed() can be used instead and is significantly
> >> > +     cheaper on some weakly-ordered architectures.
> >>
> >> I think that's a great improvement, but I'm a bit worried about recommending
> >> writel_relaxed() too much: I've seen a lot of drivers that just always use
> >> writel_relaxed() over write(), and some of them get that wrong when they
> >> don't understand the difference but end up using DMA without explicit
> >> barriers anyway.
> >>
> >> Also, having an architecture-independent driver use wmb()+writel_relaxed()
> >> ends up being more expensive than just using write(). Not sure how to
> >> best phrase it though.
> >
> > Perhaps I add reword that with a simple example to say:
> >
> >   If this ordering between incoherent MMIO and coherent memory regions
> >   is not required (e.g. in a sequence of accesses all to the MMIO region)
> >   [...]
> >
> > since that seems to be the usual case where the _relaxed accessors help.
> 
> That still doesn't quite capture what I'd like driver writes to do: in essence
> I would recommend them to use writel() all the time, except in performance
> critical code that has been shown to be correct and has a comment to explain
> why _relaxed() is ok in that particular function.
> 
> Maybe it can just be rephrased to warn against the use of writel_relaxed()
> here, and explain the difference that way:
> 
>      can see it now has ownership.  Note that, when using writel(), a prior
>      wmb() is not needed to guarantee that the cache coherent memory writes
>      have completed before writing to the cache incoherent MMIO region.
>      The cheaper writel_relaxed() does not guarantee the DMA to be visible
>      to the device and must not be used here.

Fair enough. I'd rather people used _relaxed by default, but I have to admit
that it will probably just result in them getting things wrong. Just a tiny
bit of wordsmithing brings this to:


diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index a863009849a3..3247547d1c36 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
 		/* assign ownership */
 		desc->status = DEVICE_OWN;
 
-		/* force memory to sync before notifying device via MMIO */
-		wmb();
-
 		/* notify device of new descriptors */
 		writel(DESC_NOTIFY, doorbell);
 	}
@@ -1919,11 +1916,15 @@ There are some more advanced barrier functions:
      The dma_rmb() allows us guarantee the device has released ownership
      before we read the data from the descriptor, and the dma_wmb() allows
      us to guarantee the data is written to the descriptor before the device
-     can see it now has ownership.  The wmb() is needed to guarantee that the
-     cache coherent memory writes have completed before attempting a write to
-     the cache incoherent MMIO region.
-
-     See Documentation/DMA-API.txt for more information on consistent memory.
+     can see it now has ownership.  Note that, when using writel(), a prior
+     wmb() is not needed to guarantee that the cache coherent memory writes
+     have completed before writing to the MMIO region.  The cheaper
+     writel_relaxed() does not provide this guarantee and must not be used
+     here.
+
+     See the subsection "Kernel I/O barrier effects" for more information on
+     relaxed I/O accessors and the Documentation/DMA-API.txt file for more
+     information on consistent memory.
 
 
 MMIO WRITE BARRIER


If you're happy with that, I'll send it as a proper patch.

Cheers,

Will

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 11:02                                                 ` Will Deacon
@ 2018-03-27 11:05                                                   ` Arnd Bergmann
  -1 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27 11:05 UTC (permalink / raw)
  To: Will Deacon
  Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe,
	Peter Zijlstra, David Laight, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Tue, Mar 27, 2018 at 1:02 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote:
>> On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote:
>> > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:

> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index a863009849a3..3247547d1c36 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
>                 /* assign ownership */
>                 desc->status = DEVICE_OWN;
>
> -               /* force memory to sync before notifying device via MMIO */
> -               wmb();
> -
>                 /* notify device of new descriptors */
>                 writel(DESC_NOTIFY, doorbell);
>         }
> @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions:
>       The dma_rmb() allows us guarantee the device has released ownership
>       before we read the data from the descriptor, and the dma_wmb() allows
>       us to guarantee the data is written to the descriptor before the device
> -     can see it now has ownership.  The wmb() is needed to guarantee that the
> -     cache coherent memory writes have completed before attempting a write to
> -     the cache incoherent MMIO region.
> -
> -     See Documentation/DMA-API.txt for more information on consistent memory.
> +     can see it now has ownership.  Note that, when using writel(), a prior
> +     wmb() is not needed to guarantee that the cache coherent memory writes
> +     have completed before writing to the MMIO region.  The cheaper
> +     writel_relaxed() does not provide this guarantee and must not be used
> +     here.
> +
> +     See the subsection "Kernel I/O barrier effects" for more information on
> +     relaxed I/O accessors and the Documentation/DMA-API.txt file for more
> +     information on consistent memory.
>
>
>  MMIO WRITE BARRIER
>
>
> If you're happy with that, I'll send it as a proper patch.

Looks good to me, thanks!

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 11:05                                                   ` Arnd Bergmann
  0 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27 11:05 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Tue, Mar 27, 2018 at 1:02 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote:
>> On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote:
>> > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote:

> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index a863009849a3..3247547d1c36 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
>                 /* assign ownership */
>                 desc->status = DEVICE_OWN;
>
> -               /* force memory to sync before notifying device via MMIO */
> -               wmb();
> -
>                 /* notify device of new descriptors */
>                 writel(DESC_NOTIFY, doorbell);
>         }
> @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions:
>       The dma_rmb() allows us guarantee the device has released ownership
>       before we read the data from the descriptor, and the dma_wmb() allows
>       us to guarantee the data is written to the descriptor before the device
> -     can see it now has ownership.  The wmb() is needed to guarantee that the
> -     cache coherent memory writes have completed before attempting a write to
> -     the cache incoherent MMIO region.
> -
> -     See Documentation/DMA-API.txt for more information on consistent memory.
> +     can see it now has ownership.  Note that, when using writel(), a prior
> +     wmb() is not needed to guarantee that the cache coherent memory writes
> +     have completed before writing to the MMIO region.  The cheaper
> +     writel_relaxed() does not provide this guarantee and must not be used
> +     here.
> +
> +     See the subsection "Kernel I/O barrier effects" for more information on
> +     relaxed I/O accessors and the Documentation/DMA-API.txt file for more
> +     information on consistent memory.
>
>
>  MMIO WRITE BARRIER
>
>
> If you're happy with that, I'll send it as a proper patch.

Looks good to me, thanks!

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27  9:42                                       ` Will Deacon
@ 2018-03-27 11:20                                         ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 11:20 UTC (permalink / raw)
  To: Will Deacon, Arnd Bergmann
  Cc: Paul E. McKenney, linux-rdma, Sinan Kaya, Jason Gunthorpe,
	David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, 2018-03-27 at 10:42 +0100, Will Deacon wrote:
> > 
> > This example adds a wmb() between two writes to a coherent DMA
> > area, it is definitely required there. I'm pretty sure I've never seen
> > any bug reports pointing to a missing wmb() between memory
> > and MMIO write accesses, but if you remember seeing them in the
> > list, maybe you can look again for some evidence of something going
> > wrong on x86 without it?
> 
> If this is just about ordering accesses to coherent DMA, then using
> dma_wmb() instead will be much better performance on arm/arm64.

Ah, something we should look into for powerpc as well, as we could use
an lwsync for that which is also cheaper than a full sync wmb does.

dma_wmb() is basically the same as smp_wmb() without the CONFIG_SMP
conditional right ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 11:20                                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 11:20 UTC (permalink / raw)
  To: Will Deacon, Arnd Bergmann
  Cc: Jason Gunthorpe, Sinan Kaya, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney

On Tue, 2018-03-27 at 10:42 +0100, Will Deacon wrote:
> > 
> > This example adds a wmb() between two writes to a coherent DMA
> > area, it is definitely required there. I'm pretty sure I've never seen
> > any bug reports pointing to a missing wmb() between memory
> > and MMIO write accesses, but if you remember seeing them in the
> > list, maybe you can look again for some evidence of something going
> > wrong on x86 without it?
> 
> If this is just about ordering accesses to coherent DMA, then using
> dma_wmb() instead will be much better performance on arm/arm64.

Ah, something we should look into for powerpc as well, as we could use
an lwsync for that which is also cheaper than a full sync wmb does.

dma_wmb() is basically the same as smp_wmb() without the CONFIG_SMP
conditional right ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27  9:57                                         ` Will Deacon
@ 2018-03-27 11:21                                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 11:21 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Arnd Bergmann, corbet, linux-rdma, Sinan Kaya,
	Jason Gunthorpe, peterz, David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	mingo

On Tue, 2018-03-27 at 10:57 +0100, Will Deacon wrote:
> > 
> > The interesting thing is that we do seem to have a whole LOT of these
> > spurrious wmb before writel all over the tree, I suspect because of
> > that incorrect recommendation in memory-barriers.txt.
> > 
> > We should fix that.
> 
> Patch below. Thoughts?

Looks good, we should probably also have a clearer (explicit)
definition of that ordering in the driver-api doco.

Now, to remove all those useless wmb's and find what other bugs they
were papering over ... ;-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 11:21                                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 11:21 UTC (permalink / raw)
  To: Will Deacon
  Cc: Arnd Bergmann, Jason Gunthorpe, Sinan Kaya, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, peterz, mingo,
	corbet

On Tue, 2018-03-27 at 10:57 +0100, Will Deacon wrote:
> > 
> > The interesting thing is that we do seem to have a whole LOT of these
> > spurrious wmb before writel all over the tree, I suspect because of
> > that incorrect recommendation in memory-barriers.txt.
> > 
> > We should fix that.
> 
> Patch below. Thoughts?

Looks good, we should probably also have a clearer (explicit)
definition of that ordering in the driver-api doco.

Now, to remove all those useless wmb's and find what other bugs they
were papering over ... ;-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27  9:44                                         ` Arnd Bergmann
@ 2018-03-27 11:23                                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 11:23 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya,
	Jason Gunthorpe, David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote:
> > The interesting thing is that we do seem to have a whole LOT of these
> > spurrious wmb before writel all over the tree, I suspect because of
> > that incorrect recommendation in memory-barriers.txt.
> > 
> > We should fix that.
> 
> Maybe the problem is just that it's so counter-intuitive that we don't
> need that barrier in Linux, when the hardware does need one on some
> architectures.
> 
> How about we define a barrier type instruction specifically for this
> purpose, something like wmb_before_mmio() and have all architectures
> define that to an empty macro?

This is exactly what wmb() is about and exactly what Linux rejected
back in the day (and in hindsight I agree with him).

> That way, having correct code using wmb_before_mmio() will not
> trigger an incorrect review comment that leads to extra wmb(). ;-)

Ah, you mean have an empty macro that will always be empty on all
architectures just to fool people ? :-)

Not sure that will fly ... I think we just need to be documenting that
stuff better and not have incorrect examples. Also a sweep to remove
some useless ones like the one in e1000e would help.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 11:23                                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 11:23 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jason Gunthorpe, Sinan Kaya, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote:
> > The interesting thing is that we do seem to have a whole LOT of these
> > spurrious wmb before writel all over the tree, I suspect because of
> > that incorrect recommendation in memory-barriers.txt.
> > 
> > We should fix that.
> 
> Maybe the problem is just that it's so counter-intuitive that we don't
> need that barrier in Linux, when the hardware does need one on some
> architectures.
> 
> How about we define a barrier type instruction specifically for this
> purpose, something like wmb_before_mmio() and have all architectures
> define that to an empty macro?

This is exactly what wmb() is about and exactly what Linux rejected
back in the day (and in hindsight I agree with him).

> That way, having correct code using wmb_before_mmio() will not
> trigger an incorrect review comment that leads to extra wmb(). ;-)

Ah, you mean have an empty macro that will always be empty on all
architectures just to fool people ? :-)

Not sure that will fly ... I think we just need to be documenting that
stuff better and not have incorrect examples. Also a sweep to remove
some useless ones like the one in e1000e would help.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 11:20                                         ` Benjamin Herrenschmidt
@ 2018-03-27 11:24                                           ` Will Deacon
  -1 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27 11:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Sinan Kaya,
	Jason Gunthorpe, David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, Mar 27, 2018 at 10:20:02PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 10:42 +0100, Will Deacon wrote:
> > > 
> > > This example adds a wmb() between two writes to a coherent DMA
> > > area, it is definitely required there. I'm pretty sure I've never seen
> > > any bug reports pointing to a missing wmb() between memory
> > > and MMIO write accesses, but if you remember seeing them in the
> > > list, maybe you can look again for some evidence of something going
> > > wrong on x86 without it?
> > 
> > If this is just about ordering accesses to coherent DMA, then using
> > dma_wmb() instead will be much better performance on arm/arm64.
> 
> Ah, something we should look into for powerpc as well, as we could use
> an lwsync for that which is also cheaper than a full sync wmb does.
> 
> dma_wmb() is basically the same as smp_wmb() without the CONFIG_SMP
> conditional right ?

Almost -- the slight change we have on arm64 is to say that it's
"outer-shareable", which means it also orders non-cacheable accesses
in the case that dma_alloc_coherent is used to allocate a consistent
buffer for a non-coherent device.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 11:24                                           ` Will Deacon
  0 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27 11:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Arnd Bergmann, Jason Gunthorpe, Sinan Kaya, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney

On Tue, Mar 27, 2018 at 10:20:02PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 10:42 +0100, Will Deacon wrote:
> > > 
> > > This example adds a wmb() between two writes to a coherent DMA
> > > area, it is definitely required there. I'm pretty sure I've never seen
> > > any bug reports pointing to a missing wmb() between memory
> > > and MMIO write accesses, but if you remember seeing them in the
> > > list, maybe you can look again for some evidence of something going
> > > wrong on x86 without it?
> > 
> > If this is just about ordering accesses to coherent DMA, then using
> > dma_wmb() instead will be much better performance on arm/arm64.
> 
> Ah, something we should look into for powerpc as well, as we could use
> an lwsync for that which is also cheaper than a full sync wmb does.
> 
> dma_wmb() is basically the same as smp_wmb() without the CONFIG_SMP
> conditional right ?

Almost -- the slight change we have on arm64 is to say that it's
"outer-shareable", which means it also orders non-cacheable accesses
in the case that dma_alloc_coherent is used to allocate a consistent
buffer for a non-coherent device.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 11:02                                                 ` Will Deacon
@ 2018-03-27 11:25                                                   ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 11:25 UTC (permalink / raw)
  To: Will Deacon, Arnd Bergmann
  Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe,
	Peter Zijlstra, David Laight, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Tue, 2018-03-27 at 12:02 +0100, Will Deacon wrote:
>       can see it now has ownership.  Note that, when using writel(), a prior
> >       wmb() is not needed to guarantee that the cache coherent memory writes
> >       have completed before writing to the cache incoherent MMIO region.
> >       The cheaper writel_relaxed() does not guarantee the DMA to be visible
> >       to the device and must not be used here.
> 
> Fair enough. I'd rather people used _relaxed by default, but I have to admit
> that it will probably just result in them getting things wrong. Just a tiny
> bit of wordsmithing brings this to:

I prefer people using writel() by default for the simple reason that
99% of writels out there are configuration stuff for which the
performance difference doesn't matter, and people will just get it
wrong.

Let's focus on the rare fast path for optimisation.
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index a863009849a3..3247547d1c36 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
>                 /* assign ownership */
>                 desc->status = DEVICE_OWN;
>  
> -               /* force memory to sync before notifying device via MMIO */
> -               wmb();
> -
>                 /* notify device of new descriptors */
>                 writel(DESC_NOTIFY, doorbell);
>         }
> @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions:
>       The dma_rmb() allows us guarantee the device has released ownership
>       before we read the data from the descriptor, and the dma_wmb() allows
>       us to guarantee the data is written to the descriptor before the device
> -     can see it now has ownership.  The wmb() is needed to guarantee that the
> -     cache coherent memory writes have completed before attempting a write to
> -     the cache incoherent MMIO region.
> -
> -     See Documentation/DMA-API.txt for more information on consistent memory.
> +     can see it now has ownership.  Note that, when using writel(), a prior
> +     wmb() is not needed to guarantee that the cache coherent memory writes
> +     have completed before writing to the MMIO region.  The cheaper
> +     writel_relaxed() does not provide this guarantee and must not be used
> +     here.
> +
> +     See the subsection "Kernel I/O barrier effects" for more information on
> +     relaxed I/O accessors and the Documentation/DMA-API.txt file for more
> +     information on consistent memory.
>  
>  
>  MMIO WRITE BARRIER
> 
> 
> If you're happy with that, I'll send it as a proper patch.
> 
> Cheers,
> 
> Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 11:25                                                   ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 11:25 UTC (permalink / raw)
  To: Will Deacon, Arnd Bergmann
  Cc: Jason Gunthorpe, Sinan Kaya, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Tue, 2018-03-27 at 12:02 +0100, Will Deacon wrote:
>       can see it now has ownership.  Note that, when using writel(), a prior
> >       wmb() is not needed to guarantee that the cache coherent memory writes
> >       have completed before writing to the cache incoherent MMIO region.
> >       The cheaper writel_relaxed() does not guarantee the DMA to be visible
> >       to the device and must not be used here.
> 
> Fair enough. I'd rather people used _relaxed by default, but I have to admit
> that it will probably just result in them getting things wrong. Just a tiny
> bit of wordsmithing brings this to:

I prefer people using writel() by default for the simple reason that
99% of writels out there are configuration stuff for which the
performance difference doesn't matter, and people will just get it
wrong.

Let's focus on the rare fast path for optimisation.
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index a863009849a3..3247547d1c36 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions:
>                 /* assign ownership */
>                 desc->status = DEVICE_OWN;
>  
> -               /* force memory to sync before notifying device via MMIO */
> -               wmb();
> -
>                 /* notify device of new descriptors */
>                 writel(DESC_NOTIFY, doorbell);
>         }
> @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions:
>       The dma_rmb() allows us guarantee the device has released ownership
>       before we read the data from the descriptor, and the dma_wmb() allows
>       us to guarantee the data is written to the descriptor before the device
> -     can see it now has ownership.  The wmb() is needed to guarantee that the
> -     cache coherent memory writes have completed before attempting a write to
> -     the cache incoherent MMIO region.
> -
> -     See Documentation/DMA-API.txt for more information on consistent memory.
> +     can see it now has ownership.  Note that, when using writel(), a prior
> +     wmb() is not needed to guarantee that the cache coherent memory writes
> +     have completed before writing to the MMIO region.  The cheaper
> +     writel_relaxed() does not provide this guarantee and must not be used
> +     here.
> +
> +     See the subsection "Kernel I/O barrier effects" for more information on
> +     relaxed I/O accessors and the Documentation/DMA-API.txt file for more
> +     information on consistent memory.
>  
>  
>  MMIO WRITE BARRIER
> 
> 
> If you're happy with that, I'll send it as a proper patch.
> 
> Cheers,
> 
> Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 11:23                                           ` Benjamin Herrenschmidt
@ 2018-03-27 12:22                                             ` okaya
  -1 siblings, 0 replies; 216+ messages in thread
From: okaya @ 2018-03-27 12:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon,
	Jason Gunthorpe, David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On 2018-03-27 07:23, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote:
>> > The interesting thing is that we do seem to have a whole LOT of these
>> > spurrious wmb before writel all over the tree, I suspect because of
>> > that incorrect recommendation in memory-barriers.txt.
>> >
>> > We should fix that.
>> 
>> Maybe the problem is just that it's so counter-intuitive that we don't
>> need that barrier in Linux, when the hardware does need one on some
>> architectures.
>> 
>> How about we define a barrier type instruction specifically for this
>> purpose, something like wmb_before_mmio() and have all architectures
>> define that to an empty macro?
> 
> This is exactly what wmb() is about and exactly what Linux rejected
> back in the day (and in hindsight I agree with him).
> 
>> That way, having correct code using wmb_before_mmio() will not
>> trigger an incorrect review comment that leads to extra wmb(). ;-)
> 
> Ah, you mean have an empty macro that will always be empty on all
> architectures just to fool people ? :-)
> 
> Not sure that will fly ... I think we just need to be documenting that
> stuff better and not have incorrect examples. Also a sweep to remove
> some useless ones like the one in e1000e would help.

I have been converting wmb+writel to wmb+writel_relaxed. (About 30 
patches)

I will have to just remove the wmb and keep writel, then repost.

Some of these got applied. It will cause some churn for the maintainers.

> 
> Cheers,
> Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 12:22                                             ` okaya
  0 siblings, 0 replies; 216+ messages in thread
From: okaya @ 2018-03-27 12:22 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On 2018-03-27 07:23, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote:
>> > The interesting thing is that we do seem to have a whole LOT of these
>> > spurrious wmb before writel all over the tree, I suspect because of
>> > that incorrect recommendation in memory-barriers.txt.
>> >
>> > We should fix that.
>> 
>> Maybe the problem is just that it's so counter-intuitive that we don't
>> need that barrier in Linux, when the hardware does need one on some
>> architectures.
>> 
>> How about we define a barrier type instruction specifically for this
>> purpose, something like wmb_before_mmio() and have all architectures
>> define that to an empty macro?
> 
> This is exactly what wmb() is about and exactly what Linux rejected
> back in the day (and in hindsight I agree with him).
> 
>> That way, having correct code using wmb_before_mmio() will not
>> trigger an incorrect review comment that leads to extra wmb(). ;-)
> 
> Ah, you mean have an empty macro that will always be empty on all
> architectures just to fool people ? :-)
> 
> Not sure that will fly ... I think we just need to be documenting that
> stuff better and not have incorrect examples. Also a sweep to remove
> some useless ones like the one in e1000e would help.

I have been converting wmb+writel to wmb+writel_relaxed. (About 30 
patches)

I will have to just remove the wmb and keep writel, then repost.

Some of these got applied. It will cause some churn for the maintainers.

> 
> Cheers,
> Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
  2018-03-27 11:02                                                 ` Will Deacon
@ 2018-03-27 13:20                                                   ` David Laight
  -1 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-27 13:20 UTC (permalink / raw)
  To: 'Will Deacon', Arnd Bergmann
  Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe,
	Peter Zijlstra, Ingo Molnar, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

> Fair enough. I'd rather people used _relaxed by default, but I have to admit
> that it will probably just result in them getting things wrong...

Certainly requiring the driver writes use explicit barriers should make
them understand when and why they are needed - and then put in the correct ones.

The problem I've had is that I have a good idea which barriers are needed
but find that readl/writel seem to contain a lot of extra ones.
Maybe the are required in some places, but the extra synchronising
instructions could easily have measureable performance effects on
hot paths.

Drivers are likely to contain sequences like:
	read_io
	if (...) return
	write_mem
	...
	write_mem
	barrier
	write_mem
	barrier
	write_io
for things like ring updates.
Where the 'mem' might actually be in io space.
In such sequences not all the synchronising instructions are needed.
I'm not at all sure it is easy to get the right set.

	David

^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
@ 2018-03-27 13:20                                                   ` David Laight
  0 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-27 13:20 UTC (permalink / raw)
  To: 'Will Deacon', Arnd Bergmann
  Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

> Fair enough. I'd rather people used _relaxed by default, but I have to ad=
mit
> that it will probably just result in them getting things wrong...

Certainly requiring the driver writes use explicit barriers should make
them understand when and why they are needed - and then put in the correct =
ones.

The problem I've had is that I have a good idea which barriers are needed
but find that readl/writel seem to contain a lot of extra ones.
Maybe the are required in some places, but the extra synchronising
instructions could easily have measureable performance effects on
hot paths.

Drivers are likely to contain sequences like:
	read_io
	if (...) return
	write_mem
	...
	write_mem
	barrier
	write_mem
	barrier
	write_io
for things like ring updates.
Where the 'mem' might actually be in io space.
In such sequences not all the synchronising instructions are needed.
I'm not at all sure it is easy to get the right set.

	David

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 11:02                                                 ` Will Deacon
@ 2018-03-27 13:46                                                   ` Sinan Kaya
  -1 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-27 13:46 UTC (permalink / raw)
  To: Will Deacon, Arnd Bergmann
  Cc: Jonathan Corbet, linux-rdma, Jason Gunthorpe, Peter Zijlstra,
	David Laight, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On 3/27/2018 7:02 AM, Will Deacon wrote:
> -     See Documentation/DMA-API.txt for more information on consistent memory.
> +     can see it now has ownership.  Note that, when using writel(), a prior
> +     wmb() is not needed to guarantee that the cache coherent memory writes
> +     have completed before writing to the MMIO region.  The cheaper
> +     writel_relaxed() does not provide this guarantee and must not be used
> +     here.

Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
in front of these.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 13:46                                                   ` Sinan Kaya
  0 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-27 13:46 UTC (permalink / raw)
  To: Will Deacon, Arnd Bergmann
  Cc: Benjamin Herrenschmidt, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On 3/27/2018 7:02 AM, Will Deacon wrote:
> -     See Documentation/DMA-API.txt for more information on consistent memory.
> +     can see it now has ownership.  Note that, when using writel(), a prior
> +     wmb() is not needed to guarantee that the cache coherent memory writes
> +     have completed before writing to the MMIO region.  The cheaper
> +     writel_relaxed() does not provide this guarantee and must not be used
> +     here.

Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
in front of these.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 12:22                                             ` okaya
@ 2018-03-27 14:12                                               ` Jason Gunthorpe
  -1 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-27 14:12 UTC (permalink / raw)
  To: okaya
  Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon,
	David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, Mar 27, 2018 at 08:22:55AM -0400, okaya@codeaurora.org wrote:
> On 2018-03-27 07:23, Benjamin Herrenschmidt wrote:
> >On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote:
> >>> The interesting thing is that we do seem to have a whole LOT of these
> >>> spurrious wmb before writel all over the tree, I suspect because of
> >>> that incorrect recommendation in memory-barriers.txt.
> >>>
> >>> We should fix that.
> >>
> >>Maybe the problem is just that it's so counter-intuitive that we don't
> >>need that barrier in Linux, when the hardware does need one on some
> >>architectures.
> >>
> >>How about we define a barrier type instruction specifically for this
> >>purpose, something like wmb_before_mmio() and have all architectures
> >>define that to an empty macro?
> >
> >This is exactly what wmb() is about and exactly what Linux rejected
> >back in the day (and in hindsight I agree with him).
> >
> >>That way, having correct code using wmb_before_mmio() will not
> >>trigger an incorrect review comment that leads to extra wmb(). ;-)
> >
> >Ah, you mean have an empty macro that will always be empty on all
> >architectures just to fool people ? :-)
> >
> >Not sure that will fly ... I think we just need to be documenting that
> >stuff better and not have incorrect examples. Also a sweep to remove
> >some useless ones like the one in e1000e would help.
> 
> I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches)
>
> I will have to just remove the wmb and keep writel, then repost.

Okay, but before you do that, can we get a statement how this works
for WC?

Some of these writels are to WC memory, do they need the wmb()?!?

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 14:12                                               ` Jason Gunthorpe
  0 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-27 14:12 UTC (permalink / raw)
  To: okaya
  Cc: Benjamin Herrenschmidt, Arnd Bergmann, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Tue, Mar 27, 2018 at 08:22:55AM -0400, okaya@codeaurora.org wrote:
> On 2018-03-27 07:23, Benjamin Herrenschmidt wrote:
> >On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote:
> >>> The interesting thing is that we do seem to have a whole LOT of these
> >>> spurrious wmb before writel all over the tree, I suspect because of
> >>> that incorrect recommendation in memory-barriers.txt.
> >>>
> >>> We should fix that.
> >>
> >>Maybe the problem is just that it's so counter-intuitive that we don't
> >>need that barrier in Linux, when the hardware does need one on some
> >>architectures.
> >>
> >>How about we define a barrier type instruction specifically for this
> >>purpose, something like wmb_before_mmio() and have all architectures
> >>define that to an empty macro?
> >
> >This is exactly what wmb() is about and exactly what Linux rejected
> >back in the day (and in hindsight I agree with him).
> >
> >>That way, having correct code using wmb_before_mmio() will not
> >>trigger an incorrect review comment that leads to extra wmb(). ;-)
> >
> >Ah, you mean have an empty macro that will always be empty on all
> >architectures just to fool people ? :-)
> >
> >Not sure that will fly ... I think we just need to be documenting that
> >stuff better and not have incorrect examples. Also a sweep to remove
> >some useless ones like the one in e1000e would help.
> 
> I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches)
>
> I will have to just remove the wmb and keep writel, then repost.

Okay, but before you do that, can we get a statement how this works
for WC?

Some of these writels are to WC memory, do they need the wmb()?!?

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27  7:56                                     ` Arnd Bergmann
@ 2018-03-27 14:16                                       ` Jason Gunthorpe
  -1 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-27 14:16 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya,
	David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> >
> > I even see patches adding wmb() based on actual observed memory
> > corruption during testing on Intel:
> >
> > https://patchwork.kernel.org/patch/10177207/
> >
> > So you think all of this is unnecessary and writel is totally strongly
> > ordered, even on multi-socket Intel?
> 
> This example adds a wmb() between two writes to a coherent DMA
> area, it is definitely required there.

Ah! So it is, too many things called 'db' in that driver.. One of the
'db' is also WC BAR memory..

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 14:16                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-27 14:16 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Benjamin Herrenschmidt, Sinan Kaya, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote:
> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> >
> > I even see patches adding wmb() based on actual observed memory
> > corruption during testing on Intel:
> >
> > https://patchwork.kernel.org/patch/10177207/
> >
> > So you think all of this is unnecessary and writel is totally strongly
> > ordered, even on multi-socket Intel?
> 
> This example adds a wmb() between two writes to a coherent DMA
> area, it is definitely required there.

Ah! So it is, too many things called 'db' in that driver.. One of the
'db' is also WC BAR memory..

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27  9:42                                       ` Will Deacon
@ 2018-03-27 14:24                                         ` Jason Gunthorpe
  -1 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-27 14:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Sinan Kaya,
	David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, Mar 27, 2018 at 10:42:00AM +0100, Will Deacon wrote:
> On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote:
> > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> > >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > >
> > > I even see patches adding wmb() based on actual observed memory
> > > corruption during testing on Intel:
> > >
> > > https://patchwork.kernel.org/patch/10177207/
> > >
> > > So you think all of this is unnecessary and writel is totally strongly
> > > ordered, even on multi-socket Intel?
> > 
> > This example adds a wmb() between two writes to a coherent DMA
> > area, it is definitely required there. I'm pretty sure I've never seen
> > any bug reports pointing to a missing wmb() between memory
> > and MMIO write accesses, but if you remember seeing them in the
> > list, maybe you can look again for some evidence of something going
> > wrong on x86 without it?
> 
> If this is just about ordering accesses to coherent DMA, then using
> dma_wmb() instead will be much better performance on arm/arm64.

dma_wmb() is a NOP on x86, it was tested anyhow and didn't help
this case.. Confusing, but probably not relevant to this discussion.

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 14:24                                         ` Jason Gunthorpe
  0 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-27 14:24 UTC (permalink / raw)
  To: Will Deacon
  Cc: Arnd Bergmann, Benjamin Herrenschmidt, Sinan Kaya, David Laight,
	Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney

On Tue, Mar 27, 2018 at 10:42:00AM +0100, Will Deacon wrote:
> On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote:
> > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote:
> > >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote:
> > >
> > > I even see patches adding wmb() based on actual observed memory
> > > corruption during testing on Intel:
> > >
> > > https://patchwork.kernel.org/patch/10177207/
> > >
> > > So you think all of this is unnecessary and writel is totally strongly
> > > ordered, even on multi-socket Intel?
> > 
> > This example adds a wmb() between two writes to a coherent DMA
> > area, it is definitely required there. I'm pretty sure I've never seen
> > any bug reports pointing to a missing wmb() between memory
> > and MMIO write accesses, but if you remember seeing them in the
> > list, maybe you can look again for some evidence of something going
> > wrong on x86 without it?
> 
> If this is just about ordering accesses to coherent DMA, then using
> dma_wmb() instead will be much better performance on arm/arm64.

dma_wmb() is a NOP on x86, it was tested anyhow and didn't help
this case.. Confusing, but probably not relevant to this discussion.

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 13:46                                                   ` Sinan Kaya
@ 2018-03-27 14:36                                                     ` Will Deacon
  -1 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27 14:36 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Jason Gunthorpe,
	Peter Zijlstra, David Laight, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Tue, Mar 27, 2018 at 09:46:51AM -0400, Sinan Kaya wrote:
> On 3/27/2018 7:02 AM, Will Deacon wrote:
> > -     See Documentation/DMA-API.txt for more information on consistent memory.
> > +     can see it now has ownership.  Note that, when using writel(), a prior
> > +     wmb() is not needed to guarantee that the cache coherent memory writes
> > +     have completed before writing to the MMIO region.  The cheaper
> > +     writel_relaxed() does not provide this guarantee and must not be used
> > +     here.
> 
> Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> in front of these.

I don't think so. My reading of memory-barriers.txt says that writeX might
expand to outX, and outX is not ordered with respect to other types of
memory.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 14:36                                                     ` Will Deacon
  0 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27 14:36 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Arnd Bergmann, Benjamin Herrenschmidt, Jason Gunthorpe,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Tue, Mar 27, 2018 at 09:46:51AM -0400, Sinan Kaya wrote:
> On 3/27/2018 7:02 AM, Will Deacon wrote:
> > -     See Documentation/DMA-API.txt for more information on consistent memory.
> > +     can see it now has ownership.  Note that, when using writel(), a prior
> > +     wmb() is not needed to guarantee that the cache coherent memory writes
> > +     have completed before writing to the MMIO region.  The cheaper
> > +     writel_relaxed() does not provide this guarantee and must not be used
> > +     here.
> 
> Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> in front of these.

I don't think so. My reading of memory-barriers.txt says that writeX might
expand to outX, and outX is not ordered with respect to other types of
memory.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-26 22:00                               ` Benjamin Herrenschmidt
  (?)
@ 2018-03-27 14:46                               ` Sinan Kaya
  2018-03-27 15:01                                 ` Jose Abreu
                                                   ` (2 more replies)
  -1 siblings, 3 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-27 14:46 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Arnd Bergmann, Jason Gunthorpe
  Cc: David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney,
	netdev, Alexander Duyck

+netdev, +Alex

On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
>>  Most of the drivers have a unwound loop with writeq() or something to
>>> do it.
>>
>> But isn't the writeq() barrier much more expensive than anything you'd
>> do in function calls?
> 
> It is for us, and will break any write combining.
> 
>>>>> The same document says that _relaxed() does not give that guarentee.
>>>>>
>>>>> The lwn articule on this went into some depth on the interaction with
>>>>> spinlocks.
>>>>>
>>>>> As far as I can see, containment in a spinlock seems to be the only
>>>>> different between writel and writel_relaxed..
>>>>
>>>> I was always puzzled by this: The intention of _relaxed() on ARM
>>>> (where it originates) was to skip the barrier that serializes DMA
>>>> with MMIO, not to skip the serialization between MMIO and locks.
>>>
>>> But that was never a requirement of writel(),
>>> Documentation/memory-barriers.txt gives an explicit example demanding
>>> the wmb() before writel() for ordering system memory against writel.
> 
> This is a bug in the documentation.
> 
>> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
>> Adding Alexander Duyck to Cc, he added that section as part of
>> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
>> dma_wmb()"). Also adding the other people that were involved with that.
> 
> Linus himself made it very clear years ago. readl and writel have to
> order vs memory accesses.
> 
>>> I actually have no idea why ARM had that barrier, I always assumed it
>>> was to give program ordering to the accesses and that _relaxed allowed
>>> re-ordering (the usual meaning of relaxed)..
>>>
>>> But the barrier document makes it pretty clear that the only
>>> difference between the two is spinlock containment, and WillD wrote
>>> this text, so I belive it is accurate for ARM.
>>>
>>> Very confusing.
>>
>> It does mention serialization with both DMA and locks in the
>> section about  readX_relaxed()/writeX_relaxed(). The part
>> about DMA is very clear here, and I must have just forgotten
>> the exact semantics with regards to spinlocks. I'm still not
>> sure what prevents a writel() from leaking out the end of a
>> spinlock section that doesn't happen with writel_relaxed(), since
>> the barrier in writel() comes before the access, and the
>> spin_unlock() shouldn't affect the external buses.
> 
> So...
> 
> Historically, what happened is that we (we means whoever participated
> in the discussion on the list with Linus calling the shots really)
> decided that there was no sane way for drivers to understand a world
> where readl/writel didn't fully order things vs. memory accesses (ie,
> DMA).
> 
> So it should always be correct to do:
> 
> 	- Write to some in-memory buffer
> 	- writel() to kick the DMA read of that buffer
> 
> without any extra barrier.
> 
> The spinlock situation however got murky. Mostly that came up because
> on architecture (I forgot who, might have been ia64) has a hard time
> providing that consistency without making writel insanely expensive.
> 
> Thus they created mmiowb whose main purpose was precisely to order
> writel with a following spin_unlock.
> 
> I decided not to go down that path on power because getting all drivers
> "fixed" to do the right thing was going to be a losing battle, and
> instead added per-cpu tracking of writel in order to "escalate" to a
> heavier barrier in spin_unlock itself when necessary.
> 
> Now, all this happened more than a decade ago and it's possible that
> the understanding or expectations "shifted" over time...

Alex is raising concerns on the netdev list.

Sinan
"We are being told that if you use writel(), then you don't need a wmb() on
all architectures."

Alex:
"I'm not sure who told you that but that is incorrect, at least for
x86. If you attempt to use writel() without the wmb() we will have to
NAK the patches. We will accept the wmb() with writel_releaxed() since
that solves things for ARM."

> Jason is seeking behavior clarification for write combined buffers.

Alex:
"Don't bother. I can tell you right now that for x86 you have to have a
wmb() before the writel().

Based on the comment in
(https://www.spinics.net/lists/linux-rdma/msg62666.html):
    Replacing wmb() + writel() with wmb() + writel_relaxed() will work on
    PPC, it will just not give you a benefit today.

I say the patch set stays. This gives benefit on ARM, and has no
effect on x86 and PowerPC. If you want to look at trying to optimize
things further on PowerPC and such then go for it in terms of trying
to implement the writel_relaxed(). Otherwise I say we call the ARM
goodness a win and don't get ourselves too wrapped up in trying to fix
this for all architectures."



> 
> Cheers,
> Ben.
>  
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 14:46                               ` Sinan Kaya
@ 2018-03-27 15:01                                 ` Jose Abreu
  2018-03-27 15:10                                 ` Will Deacon
  2018-03-27 21:35                                 ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 216+ messages in thread
From: Jose Abreu @ 2018-03-27 15:01 UTC (permalink / raw)
  To: Sinan Kaya, Benjamin Herrenschmidt, Arnd Bergmann, Jason Gunthorpe
  Cc: David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney,
	netdev, Alexander Duyck

Hi,

On 27-03-2018 15:46, Sinan Kaya wrote:
>
> Sinan
> "We are being told that if you use writel(), then you don't need a wmb() on
> all architectures."
>
> Alex:
> "I'm not sure who told you that but that is incorrect, at least for
> x86. If you attempt to use writel() without the wmb() we will have to
> NAK the patches. We will accept the wmb() with writel_releaxed() since
> that solves things for ARM."
>

So this means we should always use writel() + wmb() in *all*
accesses? I don't know about x86 but arc architecture doesn't
have a wmb() in the writel() function (in some configs).

I see the point in net drivers while you have dma + io accesses
but for most drivers this shouldn't be needed, right?

What about ordering of writes? Is it guaranteed that one write
will happen before the next one ?

Best Regards,
Jose Miguel Abreu

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 14:46                               ` Sinan Kaya
  2018-03-27 15:01                                 ` Jose Abreu
@ 2018-03-27 15:10                                 ` Will Deacon
  2018-03-27 18:54                                   ` Alexander Duyck
  2018-03-28  1:21                                   ` Benjamin Herrenschmidt
  2018-03-27 21:35                                 ` Benjamin Herrenschmidt
  2 siblings, 2 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-27 15:10 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Benjamin Herrenschmidt, Arnd Bergmann, Jason Gunthorpe,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev,
	Alexander Duyck, torvalds

Hi Alex,

On Tue, Mar 27, 2018 at 10:46:58AM -0400, Sinan Kaya wrote:
> +netdev, +Alex
> 
> On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote:
> > On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
> >>  Most of the drivers have a unwound loop with writeq() or something to
> >>> do it.
> >>
> >> But isn't the writeq() barrier much more expensive than anything you'd
> >> do in function calls?
> > 
> > It is for us, and will break any write combining.
> > 
> >>>>> The same document says that _relaxed() does not give that guarentee.
> >>>>>
> >>>>> The lwn articule on this went into some depth on the interaction with
> >>>>> spinlocks.
> >>>>>
> >>>>> As far as I can see, containment in a spinlock seems to be the only
> >>>>> different between writel and writel_relaxed..
> >>>>
> >>>> I was always puzzled by this: The intention of _relaxed() on ARM
> >>>> (where it originates) was to skip the barrier that serializes DMA
> >>>> with MMIO, not to skip the serialization between MMIO and locks.
> >>>
> >>> But that was never a requirement of writel(),
> >>> Documentation/memory-barriers.txt gives an explicit example demanding
> >>> the wmb() before writel() for ordering system memory against writel.
> > 
> > This is a bug in the documentation.
> > 
> >> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
> >> Adding Alexander Duyck to Cc, he added that section as part of
> >> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
> >> dma_wmb()"). Also adding the other people that were involved with that.
> > 
> > Linus himself made it very clear years ago. readl and writel have to
> > order vs memory accesses.
> > 
> >>> I actually have no idea why ARM had that barrier, I always assumed it
> >>> was to give program ordering to the accesses and that _relaxed allowed
> >>> re-ordering (the usual meaning of relaxed)..
> >>>
> >>> But the barrier document makes it pretty clear that the only
> >>> difference between the two is spinlock containment, and WillD wrote
> >>> this text, so I belive it is accurate for ARM.
> >>>
> >>> Very confusing.
> >>
> >> It does mention serialization with both DMA and locks in the
> >> section about  readX_relaxed()/writeX_relaxed(). The part
> >> about DMA is very clear here, and I must have just forgotten
> >> the exact semantics with regards to spinlocks. I'm still not
> >> sure what prevents a writel() from leaking out the end of a
> >> spinlock section that doesn't happen with writel_relaxed(), since
> >> the barrier in writel() comes before the access, and the
> >> spin_unlock() shouldn't affect the external buses.
> > 
> > So...
> > 
> > Historically, what happened is that we (we means whoever participated
> > in the discussion on the list with Linus calling the shots really)
> > decided that there was no sane way for drivers to understand a world
> > where readl/writel didn't fully order things vs. memory accesses (ie,
> > DMA).
> > 
> > So it should always be correct to do:
> > 
> > 	- Write to some in-memory buffer
> > 	- writel() to kick the DMA read of that buffer
> > 
> > without any extra barrier.
> > 
> > The spinlock situation however got murky. Mostly that came up because
> > on architecture (I forgot who, might have been ia64) has a hard time
> > providing that consistency without making writel insanely expensive.
> > 
> > Thus they created mmiowb whose main purpose was precisely to order
> > writel with a following spin_unlock.
> > 
> > I decided not to go down that path on power because getting all drivers
> > "fixed" to do the right thing was going to be a losing battle, and
> > instead added per-cpu tracking of writel in order to "escalate" to a
> > heavier barrier in spin_unlock itself when necessary.
> > 
> > Now, all this happened more than a decade ago and it's possible that
> > the understanding or expectations "shifted" over time...
> 
> Alex is raising concerns on the netdev list.
> 
> Sinan
> "We are being told that if you use writel(), then you don't need a wmb() on
> all architectures."
> 
> Alex:
> "I'm not sure who told you that but that is incorrect, at least for
> x86. If you attempt to use writel() without the wmb() we will have to
> NAK the patches. We will accept the wmb() with writel_releaxed() since
> that solves things for ARM."
> 
> > Jason is seeking behavior clarification for write combined buffers.
> 
> Alex:
> "Don't bother. I can tell you right now that for x86 you have to have a
> wmb() before the writel().

To clarify: are you saying that on x86 you need a wmb() prior to a writel
if you want that writel to be ordered after prior writes to memory? Is this
specific to WC memory or some other non-standard attribute?

The only reason we have wmb() inside writel() on arm, arm64 and power is for
parity with x86 because Linus (CC'd) wanted architectures to order I/O vs
memory by default so that it was easier to write portable drivers. The
performance impact of that implicit barrier is non-trivial, but we want the
driver portability and I went as far as adding generic _relaxed versions for
the cases where ordering isn't required. You seem to be suggesting that none
of this is necessary and drivers would already run into problems on x86 if
they didn't use wmb() explicitly in conjunction with writel, which I find
hard to believe and is in direct contradiction with the current Linux I/O
memory model (modulo the broken example in the dma_*mb section of
memory-barriers.txt).

Has something changed?

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 15:10                                 ` Will Deacon
@ 2018-03-27 18:54                                   ` Alexander Duyck
  2018-03-27 19:54                                       ` Arnd Bergmann
  2018-03-27 21:33                                     ` Benjamin Herrenschmidt
  2018-03-28  1:21                                   ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 216+ messages in thread
From: Alexander Duyck @ 2018-03-27 18:54 UTC (permalink / raw)
  To: Will Deacon
  Cc: Sinan Kaya, Benjamin Herrenschmidt, Arnd Bergmann,
	Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev,
	Linus Torvalds

On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon <will.deacon@arm.com> wrote:
> Hi Alex,
>
> On Tue, Mar 27, 2018 at 10:46:58AM -0400, Sinan Kaya wrote:
>> +netdev, +Alex
>>
>> On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote:
>> > On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote:
>> >>  Most of the drivers have a unwound loop with writeq() or something to
>> >>> do it.
>> >>
>> >> But isn't the writeq() barrier much more expensive than anything you'd
>> >> do in function calls?
>> >
>> > It is for us, and will break any write combining.
>> >
>> >>>>> The same document says that _relaxed() does not give that guarentee.
>> >>>>>
>> >>>>> The lwn articule on this went into some depth on the interaction with
>> >>>>> spinlocks.
>> >>>>>
>> >>>>> As far as I can see, containment in a spinlock seems to be the only
>> >>>>> different between writel and writel_relaxed..
>> >>>>
>> >>>> I was always puzzled by this: The intention of _relaxed() on ARM
>> >>>> (where it originates) was to skip the barrier that serializes DMA
>> >>>> with MMIO, not to skip the serialization between MMIO and locks.
>> >>>
>> >>> But that was never a requirement of writel(),
>> >>> Documentation/memory-barriers.txt gives an explicit example demanding
>> >>> the wmb() before writel() for ordering system memory against writel.
>> >
>> > This is a bug in the documentation.
>> >
>> >> Indeed, but it's in an example for when to use dma_wmb(), not wmb().
>> >> Adding Alexander Duyck to Cc, he added that section as part of
>> >> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and
>> >> dma_wmb()"). Also adding the other people that were involved with that.
>> >
>> > Linus himself made it very clear years ago. readl and writel have to
>> > order vs memory accesses.
>> >
>> >>> I actually have no idea why ARM had that barrier, I always assumed it
>> >>> was to give program ordering to the accesses and that _relaxed allowed
>> >>> re-ordering (the usual meaning of relaxed)..
>> >>>
>> >>> But the barrier document makes it pretty clear that the only
>> >>> difference between the two is spinlock containment, and WillD wrote
>> >>> this text, so I belive it is accurate for ARM.
>> >>>
>> >>> Very confusing.
>> >>
>> >> It does mention serialization with both DMA and locks in the
>> >> section about  readX_relaxed()/writeX_relaxed(). The part
>> >> about DMA is very clear here, and I must have just forgotten
>> >> the exact semantics with regards to spinlocks. I'm still not
>> >> sure what prevents a writel() from leaking out the end of a
>> >> spinlock section that doesn't happen with writel_relaxed(), since
>> >> the barrier in writel() comes before the access, and the
>> >> spin_unlock() shouldn't affect the external buses.
>> >
>> > So...
>> >
>> > Historically, what happened is that we (we means whoever participated
>> > in the discussion on the list with Linus calling the shots really)
>> > decided that there was no sane way for drivers to understand a world
>> > where readl/writel didn't fully order things vs. memory accesses (ie,
>> > DMA).
>> >
>> > So it should always be correct to do:
>> >
>> >     - Write to some in-memory buffer
>> >     - writel() to kick the DMA read of that buffer
>> >
>> > without any extra barrier.
>> >
>> > The spinlock situation however got murky. Mostly that came up because
>> > on architecture (I forgot who, might have been ia64) has a hard time
>> > providing that consistency without making writel insanely expensive.
>> >
>> > Thus they created mmiowb whose main purpose was precisely to order
>> > writel with a following spin_unlock.
>> >
>> > I decided not to go down that path on power because getting all drivers
>> > "fixed" to do the right thing was going to be a losing battle, and
>> > instead added per-cpu tracking of writel in order to "escalate" to a
>> > heavier barrier in spin_unlock itself when necessary.
>> >
>> > Now, all this happened more than a decade ago and it's possible that
>> > the understanding or expectations "shifted" over time...
>>
>> Alex is raising concerns on the netdev list.
>>
>> Sinan
>> "We are being told that if you use writel(), then you don't need a wmb() on
>> all architectures."
>>
>> Alex:
>> "I'm not sure who told you that but that is incorrect, at least for
>> x86. If you attempt to use writel() without the wmb() we will have to
>> NAK the patches. We will accept the wmb() with writel_releaxed() since
>> that solves things for ARM."
>>
>> > Jason is seeking behavior clarification for write combined buffers.
>>
>> Alex:
>> "Don't bother. I can tell you right now that for x86 you have to have a
>> wmb() before the writel().
>
> To clarify: are you saying that on x86 you need a wmb() prior to a writel
> if you want that writel to be ordered after prior writes to memory? Is this
> specific to WC memory or some other non-standard attribute?

Note, I am not a CPU guy so this is just my interpretation. It is my
understanding that the wmb(), aka sfence, is needed on x86 to sort out
writes between Write-back(WB) system memory and Strong Uncacheable
(UC) MMIO accesses.

I was hoping to be able to cite something in the software developers
manual (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf),
but that tends to be pretty vague. I have re-read section 22.34
(volume 3B) several times and I am still not clear on if it says we
need the sfence or not. It is a matter of figuring out what the impact
of store buffers and caching are for WB versus UC memory.

> The only reason we have wmb() inside writel() on arm, arm64 and power is for
> parity with x86 because Linus (CC'd) wanted architectures to order I/O vs
> memory by default so that it was easier to write portable drivers. The
> performance impact of that implicit barrier is non-trivial, but we want the
> driver portability and I went as far as adding generic _relaxed versions for
> the cases where ordering isn't required. You seem to be suggesting that none
> of this is necessary and drivers would already run into problems on x86 if
> they didn't use wmb() explicitly in conjunction with writel, which I find
> hard to believe and is in direct contradiction with the current Linux I/O
> memory model (modulo the broken example in the dma_*mb section of
> memory-barriers.txt).

Is the issue specifically related to memory versus I/O or are there
potential ordering issues for MMIO versus MMIO? I recall when working
on the dma_*mb section that the ARM barriers were much more complex
versus some of the other architectures. One big difference that I can
see for the x86 versus what you define for the "_relaxed" version of
things is the ordering of MMIO operations with respect to locked
transactions. I know x86 forces all MMIO operations to be completed
before you can process any locked operation.

> Has something changed?
>
> Will

As far as I know the code has been this way for a while, something
like 2002, when the barrier was already present in e1000. However
there it was calling out weakly ordered models "such as IA-64". Since
then pretty much all the hardware based network drivers at this point
have similar code floating around with wmb() in place to prevent
issues on weak ordered memory systems.

So in any case we still need to be careful as there are architectures
that are depending on this even if they might not be x86. :-/

- Alex

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 18:54                                   ` Alexander Duyck
@ 2018-03-27 19:54                                       ` Arnd Bergmann
  2018-03-27 21:33                                     ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27 19:54 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Will Deacon, Sinan Kaya, Benjamin Herrenschmidt, Jason Gunthorpe,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev,
	Linus Torvalds

On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon <will.deacon@arm.com> wrote:

>>>
>>> Sinan
>>> "We are being told that if you use writel(), then you don't need a wmb() on
>>> all architectures."
>>>
>>> Alex:
>>> "I'm not sure who told you that but that is incorrect, at least for
>>> x86. If you attempt to use writel() without the wmb() we will have to
>>> NAK the patches. We will accept the wmb() with writel_releaxed() since
>>> that solves things for ARM."
>>>
>>> > Jason is seeking behavior clarification for write combined buffers.
>>>
>>> Alex:
>>> "Don't bother. I can tell you right now that for x86 you have to have a
>>> wmb() before the writel().
>>
>> To clarify: are you saying that on x86 you need a wmb() prior to a writel
>> if you want that writel to be ordered after prior writes to memory? Is this
>> specific to WC memory or some other non-standard attribute?
>
> Note, I am not a CPU guy so this is just my interpretation. It is my
> understanding that the wmb(), aka sfence, is needed on x86 to sort out
> writes between Write-back(WB) system memory and Strong Uncacheable
> (UC) MMIO accesses.
>
> I was hoping to be able to cite something in the software developers
> manual (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf),
> but that tends to be pretty vague. I have re-read section 22.34
> (volume 3B) several times and I am still not clear on if it says we
> need the sfence or not. It is a matter of figuring out what the impact
> of store buffers and caching are for WB versus UC memory.

Here is what I found regarding the store buffer in that document:

11.10 STORE BUFFER
Intel 64 and IA-32 processors temporarily store each write (store) to
memory in a store buffer. The store buffer
improves processor performance by allowing the processor to continue
executing instructions without having to
wait until a write to memory and/or to a cache is complete. It also
allows writes to be delayed for more efficient use
of memory-access bus cycles.
In general, the existence of the store buffer is transparent to
software, even in systems that use multiple processors.
The processor ensures that write operations are always carried out in
program order. It also insures that the
contents of the store buffer are always drained to memory in the
following situations:
• When an exception or interrupt is generated.
• (P6 and more recent processor families only) When a serializing
instruction is executed.
• When an I/O instruction is executed.
• When a LOCK operation is performed.
• (P6 and more recent processor families only) When a BINIT operation
is performed.
• (Pentium III, and more recent processor families only) When using an
SFENCE instruction to order stores.
• (Pentium 4 and more recent processor families only) When using an
MFENCE instruction to order stores.
The discussion of write ordering in Section 8.2, “Memory Ordering,”
gives a detailed description of the operation of
the store buffer.

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 19:54                                       ` Arnd Bergmann
  0 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27 19:54 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Will Deacon, Sinan Kaya, Benjamin Herrenschmidt, Jason Gunthorpe,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev,
	Linus Torvalds

On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon <will.deacon@arm.com> wrote:

>>>
>>> Sinan
>>> "We are being told that if you use writel(), then you don't need a wmb(=
) on
>>> all architectures."
>>>
>>> Alex:
>>> "I'm not sure who told you that but that is incorrect, at least for
>>> x86. If you attempt to use writel() without the wmb() we will have to
>>> NAK the patches. We will accept the wmb() with writel_releaxed() since
>>> that solves things for ARM."
>>>
>>> > Jason is seeking behavior clarification for write combined buffers.
>>>
>>> Alex:
>>> "Don't bother. I can tell you right now that for x86 you have to have a
>>> wmb() before the writel().
>>
>> To clarify: are you saying that on x86 you need a wmb() prior to a write=
l
>> if you want that writel to be ordered after prior writes to memory? Is t=
his
>> specific to WC memory or some other non-standard attribute?
>
> Note, I am not a CPU guy so this is just my interpretation. It is my
> understanding that the wmb(), aka sfence, is needed on x86 to sort out
> writes between Write-back(WB) system memory and Strong Uncacheable
> (UC) MMIO accesses.
>
> I was hoping to be able to cite something in the software developers
> manual (https://software.intel.com/sites/default/files/managed/39/c5/3254=
62-sdm-vol-1-2abcd-3abcd.pdf),
> but that tends to be pretty vague. I have re-read section 22.34
> (volume 3B) several times and I am still not clear on if it says we
> need the sfence or not. It is a matter of figuring out what the impact
> of store buffers and caching are for WB versus UC memory.

Here is what I found regarding the store buffer in that document:

11.10 STORE BUFFER
Intel 64 and IA-32 processors temporarily store each write (store) to
memory in a store buffer. The store buffer
improves processor performance by allowing the processor to continue
executing instructions without having to
wait until a write to memory and/or to a cache is complete. It also
allows writes to be delayed for more efficient use
of memory-access bus cycles.
In general, the existence of the store buffer is transparent to
software, even in systems that use multiple processors.
The processor ensures that write operations are always carried out in
program order. It also insures that the
contents of the store buffer are always drained to memory in the
following situations:
=E2=80=A2 When an exception or interrupt is generated.
=E2=80=A2 (P6 and more recent processor families only) When a serializing
instruction is executed.
=E2=80=A2 When an I/O instruction is executed.
=E2=80=A2 When a LOCK operation is performed.
=E2=80=A2 (P6 and more recent processor families only) When a BINIT operati=
on
is performed.
=E2=80=A2 (Pentium III, and more recent processor families only) When using=
 an
SFENCE instruction to order stores.
=E2=80=A2 (Pentium 4 and more recent processor families only) When using an
MFENCE instruction to order stores.
The discussion of write ordering in Section 8.2, =E2=80=9CMemory Ordering,=
=E2=80=9D
gives a detailed description of the operation of
the store buffer.

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 19:54                                       ` Arnd Bergmann
@ 2018-03-27 20:46                                         ` Arnd Bergmann
  -1 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27 20:46 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Will Deacon, Sinan Kaya, Benjamin Herrenschmidt, Jason Gunthorpe,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev,
	Linus Torvalds

On Tue, Mar 27, 2018 at 9:54 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> 11.10 STORE BUFFER
> Intel 64 and IA-32 processors temporarily store each write (store) to
> memory in a store buffer. The store buffer
> improves processor performance by allowing the processor to continue
> executing instructions without having to
> wait until a write to memory and/or to a cache is complete. It also
> allows writes to be delayed for more efficient use
> of memory-access bus cycles.
> In general, the existence of the store buffer is transparent to
> software, even in systems that use multiple processors.
> The processor ensures that write operations are always carried out in
> program order. It also insures that the
> contents of the store buffer are always drained to memory in the
> following situations:
> • When an exception or interrupt is generated.
> • (P6 and more recent processor families only) When a serializing
> instruction is executed.
> • When an I/O instruction is executed.

I guess I/O instruction is still ambiguous on x86, it may just refer
to 'inb'/'outb' style instructions rather than 'mov' on a device MMIO
area.

Here's a link to a reply from Linus that I found on this topic:

http://yarchive.net/comp/linux/write_combining.html

      Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 20:46                                         ` Arnd Bergmann
  0 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-27 20:46 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Will Deacon, Sinan Kaya, Benjamin Herrenschmidt, Jason Gunthorpe,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev,
	Linus Torvalds

On Tue, Mar 27, 2018 at 9:54 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck
> <alexander.duyck@gmail.com> wrote:
>> On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon <will.deacon@arm.com> wrote=
:
>
> 11.10 STORE BUFFER
> Intel 64 and IA-32 processors temporarily store each write (store) to
> memory in a store buffer. The store buffer
> improves processor performance by allowing the processor to continue
> executing instructions without having to
> wait until a write to memory and/or to a cache is complete. It also
> allows writes to be delayed for more efficient use
> of memory-access bus cycles.
> In general, the existence of the store buffer is transparent to
> software, even in systems that use multiple processors.
> The processor ensures that write operations are always carried out in
> program order. It also insures that the
> contents of the store buffer are always drained to memory in the
> following situations:
> =E2=80=A2 When an exception or interrupt is generated.
> =E2=80=A2 (P6 and more recent processor families only) When a serializing
> instruction is executed.
> =E2=80=A2 When an I/O instruction is executed.

I guess I/O instruction is still ambiguous on x86, it may just refer
to 'inb'/'outb' style instructions rather than 'mov' on a device MMIO
area.

Here's a link to a reply from Linus that I found on this topic:

http://yarchive.net/comp/linux/write_combining.html

      Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 13:46                                                   ` Sinan Kaya
@ 2018-03-27 21:24                                                     ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 21:24 UTC (permalink / raw)
  To: Sinan Kaya, Will Deacon, Arnd Bergmann
  Cc: Jonathan Corbet, linux-rdma, Jason Gunthorpe, Peter Zijlstra,
	David Laight, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Tue, 2018-03-27 at 09:46 -0400, Sinan Kaya wrote:
> On 3/27/2018 7:02 AM, Will Deacon wrote:
> > -     See Documentation/DMA-API.txt for more information on consistent memory.
> > +     can see it now has ownership.  Note that, when using writel(), a prior
> > +     wmb() is not needed to guarantee that the cache coherent memory writes
> > +     have completed before writing to the MMIO region.  The cheaper
> > +     writel_relaxed() does not provide this guarantee and must not be used
> > +     here.
> 
> Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> in front of these.

Yes, they should have the same semantics as writel

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 21:24                                                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 21:24 UTC (permalink / raw)
  To: Sinan Kaya, Will Deacon, Arnd Bergmann
  Cc: Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Tue, 2018-03-27 at 09:46 -0400, Sinan Kaya wrote:
> On 3/27/2018 7:02 AM, Will Deacon wrote:
> > -     See Documentation/DMA-API.txt for more information on consistent memory.
> > +     can see it now has ownership.  Note that, when using writel(), a prior
> > +     wmb() is not needed to guarantee that the cache coherent memory writes
> > +     have completed before writing to the MMIO region.  The cheaper
> > +     writel_relaxed() does not provide this guarantee and must not be used
> > +     here.
> 
> Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> in front of these.

Yes, they should have the same semantics as writel

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 14:12                                               ` Jason Gunthorpe
@ 2018-03-27 21:27                                                 ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 21:27 UTC (permalink / raw)
  To: Jason Gunthorpe, okaya
  Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon,
	David Laight, Oliver, Alexander Duyck,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Tue, 2018-03-27 at 08:12 -0600, Jason Gunthorpe wrote:
> > I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches)
> > 
> > I will have to just remove the wmb and keep writel, then repost.
> 
> Okay, but before you do that, can we get a statement how this works
> for WC?
> 
> Some of these writels are to WC memory, do they need the wmb()?!?

This is an issue as we don't have well defined semantics for WC.

At this point, I would suggest staying away from that (ie, not changing
them). We need to look into it.

I know for example that on powerpc I cannot give you any weaker
semantic on WC for writel (I have to put a full sync in there), but I
am trying to see if I can make writel_relaxed both work with the
existing semantics and provide combining. But it's not yet a given (our
weaker IO barrier, eieio, isn't architecturally defined to do anything
on G=0 space, looking with the HW guys at what the HW actually does).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 21:27                                                 ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 21:27 UTC (permalink / raw)
  To: Jason Gunthorpe, okaya
  Cc: Arnd Bergmann, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney

On Tue, 2018-03-27 at 08:12 -0600, Jason Gunthorpe wrote:
> > I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches)
> > 
> > I will have to just remove the wmb and keep writel, then repost.
> 
> Okay, but before you do that, can we get a statement how this works
> for WC?
> 
> Some of these writels are to WC memory, do they need the wmb()?!?

This is an issue as we don't have well defined semantics for WC.

At this point, I would suggest staying away from that (ie, not changing
them). We need to look into it.

I know for example that on powerpc I cannot give you any weaker
semantic on WC for writel (I have to put a full sync in there), but I
am trying to see if I can make writel_relaxed both work with the
existing semantics and provide combining. But it's not yet a given (our
weaker IO barrier, eieio, isn't architecturally defined to do anything
on G=0 space, looking with the HW guys at what the HW actually does).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 14:36                                                     ` Will Deacon
@ 2018-03-27 21:29                                                       ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 21:29 UTC (permalink / raw)
  To: Will Deacon, Sinan Kaya
  Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Jason Gunthorpe,
	Peter Zijlstra, David Laight, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote:
> > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> > in front of these.
> 
> I don't think so. My reading of memory-barriers.txt says that writeX might
> expand to outX, and outX is not ordered with respect to other types of
> memory.

Ugh ?

My understanding of HW at least is the exact opposite. outX is *more*
ordered if anything, than any other accessors. IO space is completely
synchronous, non posted and ordered afaik.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 21:29                                                       ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 21:29 UTC (permalink / raw)
  To: Will Deacon, Sinan Kaya
  Cc: Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote:
> > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> > in front of these.
> 
> I don't think so. My reading of memory-barriers.txt says that writeX might
> expand to outX, and outX is not ordered with respect to other types of
> memory.

Ugh ?

My understanding of HW at least is the exact opposite. outX is *more*
ordered if anything, than any other accessors. IO space is completely
synchronous, non posted and ordered afaik.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 18:54                                   ` Alexander Duyck
  2018-03-27 19:54                                       ` Arnd Bergmann
@ 2018-03-27 21:33                                     ` Benjamin Herrenschmidt
  2018-03-28  0:39                                       ` Linus Torvalds
  1 sibling, 1 reply; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 21:33 UTC (permalink / raw)
  To: Alexander Duyck, Will Deacon
  Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev,
	Linus Torvalds

On Tue, 2018-03-27 at 11:54 -0700, Alexander Duyck wrote:
> As far as I know the code has been this way for a while, something
> like 2002, when the barrier was already present in e1000. However
> there it was calling out weakly ordered models "such as IA-64". Since
> then pretty much all the hardware based network drivers at this point
> have similar code floating around with wmb() in place to prevent
> issues on weak ordered memory systems.
> 
> So in any case we still need to be careful as there are architectures
> that are depending on this even if they might not be x86. :-/

Well, we need to clarify that once and for all, because as I wrote
earlier, it was decreed by Linus more than a decade ago that writel
would be fully ordered by itself vs. previous memory stores (at least
on UC memory).

This is why we added sync's to writel on powerpc and later ARM added
similar barriers to theirs.

This is also why writel_relaxed was added (though much later), since
what writel_relaxed does is to life that specific requirement.

IE. If what you say is true and wmb() is needed on x86, then
writel_relaxed is now completely useless...

Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 14:46                               ` Sinan Kaya
  2018-03-27 15:01                                 ` Jose Abreu
  2018-03-27 15:10                                 ` Will Deacon
@ 2018-03-27 21:35                                 ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 21:35 UTC (permalink / raw)
  To: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe
  Cc: David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney,
	netdev, Alexander Duyck

On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote:
>  combined buffers.
> 
> Alex:
> "Don't bother. I can tell you right now that for x86 you have to have a
> wmb() before the writel().

No, this isn't the semantics of writel. You shouldn't need it unless
something changed and we need to revisit our complete understanding of
*all* MMIO accessor semantics.

At least for UC space, it has always been accepted (and enforced) that
writel would not require any other barrier to order vs. previous stores
to memory.

> Based on the comment in
> (https://www.spinics.net/lists/linux-rdma/msg62666.html):
>     Replacing wmb() + writel() with wmb() + writel_relaxed() will work on
>     PPC, it will just not give you a benefit today.
> 
> I say the patch set stays. This gives benefit on ARM, and has no
> effect on x86 and PowerPC. If you want to look at trying to optimize
> things further on PowerPC and such then go for it in terms of trying
> to implement the writel_relaxed(). Otherwise I say we call the ARM
> goodness a win and don't get ourselves too wrapped up in trying to fix
> this for all architectures."

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 21:33                                     ` Benjamin Herrenschmidt
@ 2018-03-28  0:39                                       ` Linus Torvalds
  2018-03-28  1:03                                         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 216+ messages in thread
From: Linus Torvalds @ 2018-03-28  0:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev

On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> Well, we need to clarify that once and for all, because as I wrote
> earlier, it was decreed by Linus more than a decade ago that writel
> would be fully ordered by itself vs. previous memory stores (at least
> on UC memory).

Yes.

So "writel()" needs to be ordered with respect to other writel() uses
on the same thread. Anything else *will* break drivers. Obviously, the
drivers may then do magic to say "do write combining etc", but that
magic will be architecture-specific.

The other issue is that "writel()" needs to be ordered wrt other CPU's
doing "writel()" if those writel's are in a spinlocked region.

So it's not  that "writel()" needs to be ordered wrt the spinlock
itself, but you *do* need to honor ordering if you have something like
this:

   spin_lock(&somelock);
   writel(a);
   writel(b);
   spin_unlock(&somelock);

and if two CPU's run the above code "at the same time", then the
*ONLY* acceptable sequence is abab.

You cannot, and must not, ever see "aabb" at the device, for example,
because of how the writel would basically leak out of the spinlock.

That sounds "obvious", but dammit, a lot of architectures got that
wrong, afaik.

                Linus

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  0:39                                       ` Linus Torvalds
@ 2018-03-28  1:03                                         ` Benjamin Herrenschmidt
  2018-03-28  2:51                                           ` Linus Torvalds
  0 siblings, 1 reply; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28  1:03 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev

On Tue, 2018-03-27 at 14:39 -1000, Linus Torvalds wrote:
> On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > 
> > Well, we need to clarify that once and for all, because as I wrote
> > earlier, it was decreed by Linus more than a decade ago that writel
> > would be fully ordered by itself vs. previous memory stores (at least
> > on UC memory).
> 
> Yes.
> 
> So "writel()" needs to be ordered with respect to other writel() uses
> on the same thread. Anything else *will* break drivers. Obviously, the
> drivers may then do magic to say "do write combining etc", but that
> magic will be architecture-specific.
> 
> The other issue is that "writel()" needs to be ordered wrt other CPU's
> doing "writel()" if those writel's are in a spinlocked region.

 .../...

The discussion at hand is about

	dma_buffer->foo = 1;			/* WB */
	writel(KICK, DMA_KICK_REGISTER);	/* UC */

(The WC case is something else, let's not mix things up just yet)

IE, a store to normal WB cache memory followed by a writel to a device
which will then DMA from that buffer.

Back in the days, we did require on powerpc a wmb() between these, but
you made the point that x86 didn't and driver writers would never get
that right.

We decided to go conservative, added the necessary barrier inside
writel, so did ARM and it became the norm that writel is also fully
ordered vs. previous stores to memory *by the same CPU* of course (or
protected by the same spinlock).

Now it appears that this wasn't fully understood back then, and some
people are now saying that x86 might not even provide that semantic
always.

So a number (fairly large) of drivers have been adding wmb() in those
case, while others haven't, and it's a mess.

The mess is compounded by the fact that if writel is now defined to
*not* provide that ordering guarantee, then writel_relaxed() is
pointless since all it is defined to relax is precisely the above
ordering guarantee.

So I want to get to the bottom of this once and for all so we can have
well defined and documented semantics and stop having drivers do random
things that may or may not work on some or all architectures (including
x86 !).

Quite note about the spinlock case... In fact this is the only case you
did allow back then to be relaxed. In theory a writel followed by a
spin_unlock requires an mmiowb (which is the only point of that barrier
in fact). This was done because an arch (I think ia64) had a hard time
getting MMIOs from multiple CPUs get in order vs. a lock and required
an expensive access to the PCI host bridge to do so.

Back then, on powerpc, we chose not to allow that relaxing and instead
added code to our writel to set a per-cpu flag which would cause the
next spin_unlock to use a stronger barrier than usual.

We do need to clarify this as well, but let's start with the most basic
one first, there is enough confusion already.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 15:10                                 ` Will Deacon
  2018-03-27 18:54                                   ` Alexander Duyck
@ 2018-03-28  1:21                                   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28  1:21 UTC (permalink / raw)
  To: Will Deacon, Sinan Kaya
  Cc: Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev,
	Alexander Duyck, torvalds

On Tue, 2018-03-27 at 16:10 +0100, Will Deacon wrote:
> To clarify: are you saying that on x86 you need a wmb() prior to a writel
> if you want that writel to be ordered after prior writes to memory? Is this
> specific to WC memory or some other non-standard attribute?
> 
> The only reason we have wmb() inside writel() on arm, arm64 and power is for
> parity with x86 because Linus (CC'd) wanted architectures to order I/O vs
> memory by default so that it was easier to write portable drivers. The
> performance impact of that implicit barrier is non-trivial, but we want the
> driver portability and I went as far as adding generic _relaxed versions for
> the cases where ordering isn't required. You seem to be suggesting that none
> of this is necessary and drivers would already run into problems on x86 if
> they didn't use wmb() explicitly in conjunction with writel, which I find
> hard to believe and is in direct contradiction with the current Linux I/O
> memory model (modulo the broken example in the dma_*mb section of
> memory-barriers.txt).

Another clarification while we are at it ....

All of this only applies to concurrent access by the CPU and the device
to memory allocate with dma_alloc_coherent().

For memory "mapped" into the DMA domain via dma_map_* then an extra
dma_sync_for_* is needed.

In most useful server cases etc... these latter are NOPs, but
architecture without full DMA cache coherency or using swiotlb,
dma_map_* might maintain bounce buffers or play additional cache
flushing tricks.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  1:03                                         ` Benjamin Herrenschmidt
@ 2018-03-28  2:51                                           ` Linus Torvalds
  2018-03-28  3:24                                             ` Sinan Kaya
  2018-03-28  4:33                                             ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 216+ messages in thread
From: Linus Torvalds @ 2018-03-28  2:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev

On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> The discussion at hand is about
>
>         dma_buffer->foo = 1;                    /* WB */
>         writel(KICK, DMA_KICK_REGISTER);        /* UC */

Yes. That certainly is ordered on x86. In fact, afaik it's ordered
even if that writel() might be of type WC, because that only delays
writes, it doesn't move them earlier.

Whether people *do* that or not, I don't know. But I wouldn't be
surprised if they do.

So if it's a DMA buffer, it's "cached". And even cached accesses are
ordered wrt MMIO.

Basically, to get unordered writes on x86, you need to either use
explicitly nontemporal stores, or have a writecombining region with
back-to-back writes that actually combine.

And nobody really does that nontemporal store thing any more because
the hardware that cared pretty much doesn't exist any more. It was too
much pain. People use DMA and maybe an UC store for starting the DMA
(or possibly a WC buffer that gets multiple  stores in ascending order
as a stream of commands).

Things like UC will force everything to be entirely ordered, but even
without UC, loads won't pass loads, and stores won't pass stores.

> Now it appears that this wasn't fully understood back then, and some
> people are now saying that x86 might not even provide that semantic
> always.

Oh, the above UC case is absoutely guaranteed.

And I think even if it's WC, the write to kick off the DMA is ordered
wrt the cached write.

On x86, I think you need barriers only if you do things like

 - do two non-temporal stores and require them to be ordered: put a
sfence or mfence in between them.

 - do two WC stores, and make sure they do not combine: put a sfence
or mfence between them.

 - do a store, and a subsequent from a different address, and neither
of them is UC: put a mfence between them. But note that this is
literally just "load after store". A "store after load" doesn't need
one.

I think that's pretty much it.

For example, the "lfence" instruction is almost entirely pointless on
x86 - it was designed back in the time when people *thought* they
might re-order loads. But loads don't get re-ordered. At least Intel
seems to document that only non-temporal *stores* can get re-ordered
wrt each other.

End result: lfence is a historical oddity that can now be used to
guarantee that a previous  load has finished, and that in turn meant
that it is  now used in the Spectre mitigations. But it basically has
no real memory ordering meaning since nothing passes an earlier load
anyway, it's more of a pipeline thing.

But in the end, one question is just "how much do drivers actually
_rely_ on the x86 strong ordering?"

We so support "smp_wmb()" even though x86 has strong enough ordering
that just a barrier suffices. Somebody might just say "screw the x86
memory ordering, we're relaxed, and we'll fix up the drivers we care
about".

The only issue really is that 99.9% of all testing gets done on x86
unless you look at specific SoC drivers.

On ARM, for example, there is likely little reason to care about x86
memory ordering, because there is almost zero driver overlap between
x86 and ARM.

*Historically*, the reason for following the x86 IO ordering was
simply that a lot of architectures used the drivers that were
developed on x86. The alpha and powerpc workstations were *designed*
with the x86 IO bus (PCI, then PCIe) and to work with the devices that
came with it.

ARM? PCIe is almost irrelevant. For ARM servers, if they ever take
off, sure. But 99.99% of ARM is about their own SoC's, and so "x86
test coverage" is simply not an issue.

How much of an issue is it for Power? Maybe you decide it's not a big deal.

Then all the above is almost irrelevant.

       Linus

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  2:51                                           ` Linus Torvalds
@ 2018-03-28  3:24                                             ` Sinan Kaya
  2018-03-28  4:41                                               ` Benjamin Herrenschmidt
  2018-03-28  6:14                                               ` Linus Torvalds
  2018-03-28  4:33                                             ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-28  3:24 UTC (permalink / raw)
  To: Linus Torvalds, Benjamin Herrenschmidt
  Cc: Alexander Duyck, Will Deacon, Arnd Bergmann, Jason Gunthorpe,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev

On 3/27/2018 10:51 PM, Linus Torvalds wrote:
>> The discussion at hand is about
>>
>>         dma_buffer->foo = 1;                    /* WB */
>>         writel(KICK, DMA_KICK_REGISTER);        /* UC */
> Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> even if that writel() might be of type WC, because that only delays
> writes, it doesn't move them earlier.

Now that we clarified x86 myth, Is this guaranteed on all architectures?
We keep getting IA64 exception example. Maybe, this is also corrected since
then.

Jose Abreu says "I don't know about x86 but arc architecture doesn't
have a wmb() in the writel() function (in some configs)".

As long as we have these exceptions, these wmb() in common drivers is not
going anywhere and relaxed-arches will continue paying performance penalty.

I see 15% performance loss on ARM64 servers using Intel i40e network
drivers and an XL710 adapter due to CPU keeping itself busy doing barriers
most of the time rather than real work because of sequences like this all over
the place.

         dma_buffer->foo = 1;                    /* WB */
	 wmb()
         writel(KICK, DMA_KICK_REGISTER);        /* UC */

I posted several patches last week to remove duplicate barriers on ARM while
trying to make the code friendly with other architectures.

Basically changing it to

dma_buffer->foo = 1;                    /* WB */
wmb()
writel_relaxed(KICK, DMA_KICK_REGISTER);        /* UC */
mmiowb()

This is a small step in the performance direction until we remove all exceptions.

https://www.spinics.net/lists/netdev/msg491842.html
https://www.spinics.net/lists/linux-rdma/msg62434.html
https://www.spinics.net/lists/arm-kernel/msg642336.html

Discussion started to move around the need for relaxed API on PPC and then
why wmb() question came up.

Sinan

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  2:51                                           ` Linus Torvalds
  2018-03-28  3:24                                             ` Sinan Kaya
@ 2018-03-28  4:33                                             ` Benjamin Herrenschmidt
  2018-03-28  6:26                                               ` Linus Torvalds
  1 sibling, 1 reply; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28  4:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev

On Tue, 2018-03-27 at 16:51 -1000, Linus Torvalds wrote:
> On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > 
> > The discussion at hand is about
> > 
> >         dma_buffer->foo = 1;                    /* WB */
> >         writel(KICK, DMA_KICK_REGISTER);        /* UC */
> 
> Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> even if that writel() might be of type WC, because that only delays
> writes, it doesn't move them earlier.

Ok so this is our answer ...

 ... snip ... (thanks for the background info !)

> Oh, the above UC case is absoutely guaranteed.

Good.

Then....

> The only issue really is that 99.9% of all testing gets done on x86
> unless you look at specific SoC drivers.
> 
> On ARM, for example, there is likely little reason to care about x86
> memory ordering, because there is almost zero driver overlap between
> x86 and ARM.
> 
> *Historically*, the reason for following the x86 IO ordering was
> simply that a lot of architectures used the drivers that were
> developed on x86. The alpha and powerpc workstations were *designed*
> with the x86 IO bus (PCI, then PCIe) and to work with the devices that
> came with it.
> 
> ARM? PCIe is almost irrelevant. For ARM servers, if they ever take
> off, sure. But 99.99% of ARM is about their own SoC's, and so "x86
> test coverage" is simply not an issue.
> 
> How much of an issue is it for Power? Maybe you decide it's not a big deal.
> 
> Then all the above is almost irrelevant.

So the overlap may not be that NIL in practice :-) But even then that
doesn't matter as ARM has been happily implementing the same semantic
you describe above for years, as do we powerpc.

This is why, I want (with your agreement) to define clearly and once
and for all, that the Linux semantics of writel are that it is ordered
with previous writes to coherent memory (*)

This is already what ARM and powerpc provide, from what you say, what
x86 provides, I don't see any reason to keep that badly documented and
have drivers randomly growing useless wmb()'s because they don't think
it works on x86 without them !

Once that's sorted, let's tackle the problem of mmiowb vs. spin_unlock
and the problem of writel_relaxed semantics but as separate issues :-)

Also, can I assume the above ordering with writel() equally applies to
readl() or not ?

IE:
	dma_buf->foo = 1;
	readl(STUPID_DEVICE_DMA_KICK_ON_READ);

Also works on x86 ? (It does on power, maybe not on ARM).

Cheers,
Ben.

(*) From an Linux API perspective, all of this is only valid if the
memory was allocated by dma_alloc_coherent(). Anything obtained by
dma_map_something() might have been bounced bufferred or might require
extra cache flushes on some architectures, and thus needs
dma_sync_for_{cpu,device} calls.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  3:24                                             ` Sinan Kaya
@ 2018-03-28  4:41                                               ` Benjamin Herrenschmidt
  2018-03-28  6:14                                               ` Linus Torvalds
  1 sibling, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28  4:41 UTC (permalink / raw)
  To: Sinan Kaya, Linus Torvalds
  Cc: Alexander Duyck, Will Deacon, Arnd Bergmann, Jason Gunthorpe,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev

On Tue, 2018-03-27 at 23:24 -0400, Sinan Kaya wrote:
> On 3/27/2018 10:51 PM, Linus Torvalds wrote:
> > > The discussion at hand is about
> > > 
> > >         dma_buffer->foo = 1;                    /* WB */
> > >         writel(KICK, DMA_KICK_REGISTER);        /* UC */
> > 
> > Yes. That certainly is ordered on x86. In fact, afaik it's ordered
> > even if that writel() might be of type WC, because that only delays
> > writes, it doesn't move them earlier.
> 
> Now that we clarified x86 myth, Is this guaranteed on all architectures?

If not we need to fix it. It's guaranteed on the "main" ones (arm,
arm64, powerpc, i386, x86_64). We might need to check with other arch
maintainers for the rest.

We really want Linux to provide well defined "sane" semantics for the
basic writel accessors.

Note: We still have open questions about how readl() relates to
surrounding memory accesses. It looks like ARM and powerpc do different
things here.

> We keep getting IA64 exception example. Maybe, this is also corrected since
> then.

I would think ia64 fixed it back when it was all discussed. I was under
the impression all ia64 had "special" was the writel vs. spin_unlock
which requires mmiowb, but maybe that was never completely fixed ?

> Jose Abreu says "I don't know about x86 but arc architecture doesn't
> have a wmb() in the writel() function (in some configs)".

Well, it probably should then.

> As long as we have these exceptions, these wmb() in common drivers is not
> going anywhere and relaxed-arches will continue paying performance penalty.

Well, let's fix them or leave them broken, at this point, it doesn't
matter. We can give all arch maintainers a wakeup call and start making
drivers work based on the documented assumptions.

> I see 15% performance loss on ARM64 servers using Intel i40e network
> drivers and an XL710 adapter due to CPU keeping itself busy doing barriers
> most of the time rather than real work because of sequences like this all over
> the place.
> 
>          dma_buffer->foo = 1;                    /* WB */
> 	 wmb()
>          writel(KICK, DMA_KICK_REGISTER);        /* UC */
>
> I posted several patches last week to remove duplicate barriers on ARM while
> trying to make the code friendly with other architectures.
> 
> Basically changing it to
> 
> dma_buffer->foo = 1;                    /* WB */
> wmb()
> writel_relaxed(KICK, DMA_KICK_REGISTER);        /* UC */
> mmiowb()
> 
> This is a small step in the performance direction until we remove all exceptions.
> 
> https://www.spinics.net/lists/netdev/msg491842.html
> https://www.spinics.net/lists/linux-rdma/msg62434.html
> https://www.spinics.net/lists/arm-kernel/msg642336.html
> 
> Discussion started to move around the need for relaxed API on PPC and then
> why wmb() question came up.

I'm working on the problem of relaxed APIs for powerpc, but we can keep
that orthogonal. As is, today, a wmb() + writel() and a wmb() +
writel_relaxed() on powerpc are identical. So changing them will not
break us.

But I don't see the point of doing that transformation if we can just
get the straying archs fixed. It's not like any of them has a
significant market presence these days anyway.

Cheers,
Ben.

> Sinan
> 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  3:24                                             ` Sinan Kaya
  2018-03-28  4:41                                               ` Benjamin Herrenschmidt
@ 2018-03-28  6:14                                               ` Linus Torvalds
  2018-03-28 11:41                                                 ` okaya
  1 sibling, 1 reply; 216+ messages in thread
From: Linus Torvalds @ 2018-03-28  6:14 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: Benjamin Herrenschmidt, Alexander Duyck, Will Deacon,
	Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev

On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
>
> Basically changing it to
>
> dma_buffer->foo = 1;                    /* WB */
> wmb()
> writel_relaxed(KICK, DMA_KICK_REGISTER);        /* UC */
> mmiowb()

Why?

Why not  just remove the wmb(), and keep the barrier in the writel()?

The above code makes no sense, and just looks stupid to me. It also
generates pointlessly bad code on x86, so it's bad there too.

               Linus

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  4:33                                             ` Benjamin Herrenschmidt
@ 2018-03-28  6:26                                               ` Linus Torvalds
  2018-03-28  6:42                                                 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 216+ messages in thread
From: Linus Torvalds @ 2018-03-28  6:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> This is why, I want (with your agreement) to define clearly and once
> and for all, that the Linux semantics of writel are that it is ordered
> with previous writes to coherent memory (*)

Honestly, I think those are the sane semantics. In fact, make it
"ordered with previous writes" full stop, since it's not only ordered
wrt previous writes to memory, but also previous writel's.

> Also, can I assume the above ordering with writel() equally applies to
> readl() or not ?
>
> IE:
>         dma_buf->foo = 1;
>         readl(STUPID_DEVICE_DMA_KICK_ON_READ);

If that KICK_ON_READ is UC, then that's definitely the case. And
honestly, status registers like that really should always be UC.

But if somebody sets the area WC (which is crazy), then I think it
might be at least debatable. x86 semantics does allow reads to be done
before previous writes (or, put another way, writes to be buffered -
the buffers are ordered so writes don't get re-ordered, but reads can
happen during the buffering).

But UC accesses are always  done entirely ordered, and honestly, any
status register that starts a DMA would not make sense any other way.

Of course, you'd have to be pretty odd to want to start a DMA with a
read anyway - partly exactly because it's bad for performance since
reads will be synchronous and not buffered like a write).

                   Linus

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  6:26                                               ` Linus Torvalds
@ 2018-03-28  6:42                                                 ` Benjamin Herrenschmidt
  2018-03-28  6:53                                                     ` Linus Torvalds
  2018-03-28  9:07                                                   ` Will Deacon
  0 siblings, 2 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28  6:42 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

On Tue, 2018-03-27 at 20:26 -1000, Linus Torvalds wrote:
> On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > 
> > This is why, I want (with your agreement) to define clearly and once
> > and for all, that the Linux semantics of writel are that it is ordered
> > with previous writes to coherent memory (*)
> 
> Honestly, I think those are the sane semantics. In fact, make it
> "ordered with previous writes" full stop, since it's not only ordered
> wrt previous writes to memory, but also previous writel's.

Of course. It was somewhat a given that it's ordered vs. any previous
MMIO actually, but it doesn't hurt to spell it out once more.

> > Also, can I assume the above ordering with writel() equally applies to
> > readl() or not ?
> > 
> > IE:
> >         dma_buf->foo = 1;
> >         readl(STUPID_DEVICE_DMA_KICK_ON_READ);
> 
> If that KICK_ON_READ is UC, then that's definitely the case. And
> honestly, status registers like that really should always be UC.
> 
> But if somebody sets the area WC (which is crazy), then I think it
> might be at least debatable. x86 semantics does allow reads to be done
> before previous writes (or, put another way, writes to be buffered -
> the buffers are ordered so writes don't get re-ordered, but reads can
> happen during the buffering).

Right, for now I worry about UC semantics. Once we have nailed that, we
can look at WC, which is a lot more tricky as archs differs more
widely, but one thing at a time.

> But UC accesses are always  done entirely ordered, and honestly, any
> status register that starts a DMA would not make sense any other way.
> 
> Of course, you'd have to be pretty odd to want to start a DMA with a
> read anyway - partly exactly because it's bad for performance since
> reads will be synchronous and not buffered like a write).

I have bad memories of old adaptec controllers ...

That said, I think the above might not be right on ARM if we want to
make it the rule, Will, what do you reckon ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  6:42                                                 ` Benjamin Herrenschmidt
  2018-03-28  6:53                                                     ` Linus Torvalds
@ 2018-03-28  6:53                                                     ` Linus Torvalds
  1 sibling, 0 replies; 216+ messages in thread
From: Linus Torvalds @ 2018-03-28  6:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Arnd Bergmann, linux-rdma, netdev, Will Deacon, Alexander Duyck,
	Sinan Kaya, Jason Gunthorpe, David Laight, Oliver,
	Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

[-- Attachment #1: Type: text/plain, Size: 1086 bytes --]

On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt <benh@kernel.crashing.org>
wrote:

> >
> > Of course, you'd have to be pretty odd to want to start a DMA with a
> > read anyway - partly exactly because it's bad for performance since
> > reads will be synchronous and not buffered like a write).
>
> I have bad memories of old adaptec controllers ...
>

*Old* adaptec controllers were likely to use the in/out instructions for
status and command data.

Those are actually even more ordered than UC reads and writes: the in/out
instructions are not just fully ordered, but are fully *synchronous* on
x86.

So not just doing accesses in order, but actually waiting for everything to
drain before they start executing, but they also wait for the operation
itself to complete (ie "out" will not just queue the write, it will then
wait for the queue to empty and the write data to hit the line).

That's why in/out were *so* slow, and why nobody uses them any more (well,
the address size limitations and the lack of any remapping of the address
obviously also are a reason).

    Linus

>

[-- Attachment #2: Type: text/html, Size: 1798 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-28  6:53                                                     ` Linus Torvalds
  0 siblings, 0 replies; 216+ messages in thread
From: Linus Torvalds @ 2018-03-28  6:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Arnd Bergmann, linux-rdma, netdev, Will Deacon, Alexander Duyck,
	Sinan Kaya, Jason Gunthorpe, David Laight, Oliver,
	Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

[-- Attachment #1: Type: text/plain, Size: 1086 bytes --]

On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt <benh@kernel.crashing.org>
wrote:

> >
> > Of course, you'd have to be pretty odd to want to start a DMA with a
> > read anyway - partly exactly because it's bad for performance since
> > reads will be synchronous and not buffered like a write).
>
> I have bad memories of old adaptec controllers ...
>

*Old* adaptec controllers were likely to use the in/out instructions for
status and command data.

Those are actually even more ordered than UC reads and writes: the in/out
instructions are not just fully ordered, but are fully *synchronous* on
x86.

So not just doing accesses in order, but actually waiting for everything to
drain before they start executing, but they also wait for the operation
itself to complete (ie "out" will not just queue the write, it will then
wait for the queue to empty and the write data to hit the line).

That's why in/out were *so* slow, and why nobody uses them any more (well,
the address size limitations and the lack of any remapping of the address
obviously also are a reason).

    Linus

>

[-- Attachment #2: Type: text/html, Size: 1798 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-28  6:53                                                     ` Linus Torvalds
  0 siblings, 0 replies; 216+ messages in thread
From: Linus Torvalds @ 2018-03-28  6:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

[-- Attachment #1: Type: text/plain, Size: 1086 bytes --]

On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt <benh@kernel.crashing.org>
wrote:

> >
> > Of course, you'd have to be pretty odd to want to start a DMA with a
> > read anyway - partly exactly because it's bad for performance since
> > reads will be synchronous and not buffered like a write).
>
> I have bad memories of old adaptec controllers ...
>

*Old* adaptec controllers were likely to use the in/out instructions for
status and command data.

Those are actually even more ordered than UC reads and writes: the in/out
instructions are not just fully ordered, but are fully *synchronous* on
x86.

So not just doing accesses in order, but actually waiting for everything to
drain before they start executing, but they also wait for the operation
itself to complete (ie "out" will not just queue the write, it will then
wait for the queue to empty and the write data to hit the line).

That's why in/out were *so* slow, and why nobody uses them any more (well,
the address size limitations and the lack of any remapping of the address
obviously also are a reason).

    Linus

>

[-- Attachment #2: Type: text/html, Size: 1798 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  6:53                                                     ` Linus Torvalds
  (?)
  (?)
@ 2018-03-28  6:56                                                     ` Benjamin Herrenschmidt
  2018-03-28  7:11                                                       ` Arnd Bergmann
  -1 siblings, 1 reply; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28  6:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

On Wed, 2018-03-28 at 06:53 +0000, Linus Torvalds wrote:
> 
> 
> On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt <benh@kernel.crash
> ing.org> wrote:
> > >
> > > Of course, you'd have to be pretty odd to want to start a DMA
> > with a
> > > read anyway - partly exactly because it's bad for performance
> > since
> > > reads will be synchronous and not buffered like a write).
> > 
> > I have bad memories of old adaptec controllers ...
> 
> *Old* adaptec controllers were likely to use the in/out instructions
> for status and command data.
> 
> Those are actually even more ordered than UC reads and writes: the
> in/out instructions are not just fully ordered, but are fully
> *synchronous* on x86. 
> 
> So not just doing accesses in order, but actually waiting for
> everything to drain before they start executing, but they also wait
> for the operation itself to complete (ie "out" will not just queue
> the write, it will then wait for the queue to empty and the write
> data to hit the line).
> 
> That's why in/out were *so* slow, and why nobody uses them any more
> (well, the address size limitations and the lack of any remapping of
> the address obviously also are a reason).

All true indeed, though a lot of other archs never quite made them
fully synchronous, which was another can of worms ... oh well.

As for Adaptec, you might be right, I do remember having cases of old
stuff triggering DMA on reads, it might have been "Mac" variants of
Adaptec using MMIO or something...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  6:56                                                     ` Benjamin Herrenschmidt
@ 2018-03-28  7:11                                                       ` Arnd Bergmann
  2018-03-28  7:42                                                         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-28  7:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Alexander Duyck, Will Deacon, Sinan Kaya,
	Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

On Wed, Mar 28, 2018 at 8:56 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2018-03-28 at 06:53 +0000, Linus Torvalds wrote:
>> On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
>> That's why in/out were *so* slow, and why nobody uses them any more
>> (well, the address size limitations and the lack of any remapping of
>> the address obviously also are a reason).
>
> All true indeed, though a lot of other archs never quite made them
> fully synchronous, which was another can of worms ... oh well.

Many architectures have no way of providing PCI compliant semantics
for outb, as their instruction set and/or bus interconnect lacks a
method of waiting for completion of an outb.

In practice, it doesn't seem to matter for any of the devices one would
encounter these days: very few use I/O space, and those that do don't
actually rely on the strict ordering. Some architectures (in particular
s390, but I remember seeing the same thing elsewhere) explicitly
disallow I/O space access on PCI because of this. On ARM, the typical
PCI implementations have other problems that are worse than this
one, so most drivers are fine with the almost-working semantics.

        Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  7:11                                                       ` Arnd Bergmann
@ 2018-03-28  7:42                                                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28  7:42 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Linus Torvalds, Alexander Duyck, Will Deacon, Sinan Kaya,
	Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

On Wed, 2018-03-28 at 09:11 +0200, Arnd Bergmann wrote:
> On Wed, Mar 28, 2018 at 8:56 AM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Wed, 2018-03-28 at 06:53 +0000, Linus Torvalds wrote:
> > > On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> > > That's why in/out were *so* slow, and why nobody uses them any more
> > > (well, the address size limitations and the lack of any remapping of
> > > the address obviously also are a reason).
> > 
> > All true indeed, though a lot of other archs never quite made them
> > fully synchronous, which was another can of worms ... oh well.
> 
> Many architectures have no way of providing PCI compliant semantics
> for outb, as their instruction set and/or bus interconnect lacks a
> method of waiting for completion of an outb.

Yup, that includes powerpc. Note that since POWER8 we don't even
genetate IO space anymore :-)

> In practice, it doesn't seem to matter for any of the devices one would
> encounter these days: very few use I/O space, and those that do don't
> actually rely on the strict ordering. Some architectures (in particular
> s390, but I remember seeing the same thing elsewhere) explicitly
> disallow I/O space access on PCI because of this. On ARM, the typical
> PCI implementations have other problems that are worse than this
> one, so most drivers are fine with the almost-working semantics.

/me cries...

Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 21:29                                                       ` Benjamin Herrenschmidt
@ 2018-03-28  8:53                                                         ` Will Deacon
  -1 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-28  8:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya,
	Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver,
	Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Wed, Mar 28, 2018 at 08:29:45AM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote:
> > > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> > > in front of these.
> > 
> > I don't think so. My reading of memory-barriers.txt says that writeX might
> > expand to outX, and outX is not ordered with respect to other types of
> > memory.
> 
> Ugh ?
> 
> My understanding of HW at least is the exact opposite. outX is *more*
> ordered if anything, than any other accessors. IO space is completely
> synchronous, non posted and ordered afaik.

I'm just going by memory-barriers.txt:


 (*) inX(), outX():

     [...]

     They are guaranteed to be fully ordered with respect to each other.

     They are not guaranteed to be fully ordered with respect to other types of
     memory and I/O operation.


For arm/arm64 these end up behaving exactly the same as readX/writeX, but
I'm nervous about changing the documentation without understanding why it's
like it is currently. Maybe another ia64 thing?.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-28  8:53                                                         ` Will Deacon
  0 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-28  8:53 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Wed, Mar 28, 2018 at 08:29:45AM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote:
> > > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb()
> > > in front of these.
> > 
> > I don't think so. My reading of memory-barriers.txt says that writeX might
> > expand to outX, and outX is not ordered with respect to other types of
> > memory.
> 
> Ugh ?
> 
> My understanding of HW at least is the exact opposite. outX is *more*
> ordered if anything, than any other accessors. IO space is completely
> synchronous, non posted and ordered afaik.

I'm just going by memory-barriers.txt:


 (*) inX(), outX():

     [...]

     They are guaranteed to be fully ordered with respect to each other.

     They are not guaranteed to be fully ordered with respect to other types of
     memory and I/O operation.


For arm/arm64 these end up behaving exactly the same as readX/writeX, but
I'm nervous about changing the documentation without understanding why it's
like it is currently. Maybe another ia64 thing?.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
  2018-03-28  8:53                                                         ` Will Deacon
@ 2018-03-28  9:00                                                           ` David Laight
  -1 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-28  9:00 UTC (permalink / raw)
  To: 'Will Deacon', Benjamin Herrenschmidt
  Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya,
	Jason Gunthorpe, Peter Zijlstra, Ingo Molnar, Oliver,
	Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

From: Will Deacon
> Sent: 28 March 2018 09:54
...
> > > I don't think so. My reading of memory-barriers.txt says that writeX might
> > > expand to outX, and outX is not ordered with respect to other types of
> > > memory.
> >
> > Ugh ?
> >
> > My understanding of HW at least is the exact opposite. outX is *more*
> > ordered if anything, than any other accessors. IO space is completely
> > synchronous, non posted and ordered afaik.
> 
> I'm just going by memory-barriers.txt:
> 
> 
>  (*) inX(), outX():
> 
>      [...]
> 
>      They are guaranteed to be fully ordered with respect to each other.
> 
>      They are not guaranteed to be fully ordered with respect to other types of
>      memory and I/O operation.

A long time ago there was a document from Intel that said that inb/outb weren't
necessarily synchronised wrt memory accesses.
(Might be P-pro era).
However no processors actually behaved that way and more recent docs
say that inb/outb are fully ordered.

	David

^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
@ 2018-03-28  9:00                                                           ` David Laight
  0 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-28  9:00 UTC (permalink / raw)
  To: 'Will Deacon', Benjamin Herrenschmidt
  Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

From: Will Deacon
> Sent: 28 March 2018 09:54
...
> > > I don't think so. My reading of memory-barriers.txt says that writeX =
might
> > > expand to outX, and outX is not ordered with respect to other types o=
f
> > > memory.
> >
> > Ugh ?
> >
> > My understanding of HW at least is the exact opposite. outX is *more*
> > ordered if anything, than any other accessors. IO space is completely
> > synchronous, non posted and ordered afaik.
>=20
> I'm just going by memory-barriers.txt:
>=20
>=20
>  (*) inX(), outX():
>=20
>      [...]
>=20
>      They are guaranteed to be fully ordered with respect to each other.
>=20
>      They are not guaranteed to be fully ordered with respect to other ty=
pes of
>      memory and I/O operation.

A long time ago there was a document from Intel that said that inb/outb wer=
en't
necessarily synchronised wrt memory accesses.
(Might be P-pro era).
However no processors actually behaved that way and more recent docs
say that inb/outb are fully ordered.

	David

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  6:42                                                 ` Benjamin Herrenschmidt
  2018-03-28  6:53                                                     ` Linus Torvalds
@ 2018-03-28  9:07                                                   ` Will Deacon
  2018-03-28  9:56                                                     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 216+ messages in thread
From: Will Deacon @ 2018-03-28  9:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

On Wed, Mar 28, 2018 at 05:42:56PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2018-03-27 at 20:26 -1000, Linus Torvalds wrote:
> > On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt
> > <benh@kernel.crashing.org> wrote:
> > > 
> > > This is why, I want (with your agreement) to define clearly and once
> > > and for all, that the Linux semantics of writel are that it is ordered
> > > with previous writes to coherent memory (*)
> > 
> > Honestly, I think those are the sane semantics. In fact, make it
> > "ordered with previous writes" full stop, since it's not only ordered
> > wrt previous writes to memory, but also previous writel's.
> 
> Of course. It was somewhat a given that it's ordered vs. any previous
> MMIO actually, but it doesn't hurt to spell it out once more.

Good. So I think this confirms our understanding so far.

> 
> > > Also, can I assume the above ordering with writel() equally applies to
> > > readl() or not ?
> > > 
> > > IE:
> > >         dma_buf->foo = 1;
> > >         readl(STUPID_DEVICE_DMA_KICK_ON_READ);
> > 
> > If that KICK_ON_READ is UC, then that's definitely the case. And
> > honestly, status registers like that really should always be UC.
> > 
> > But if somebody sets the area WC (which is crazy), then I think it
> > might be at least debatable. x86 semantics does allow reads to be done
> > before previous writes (or, put another way, writes to be buffered -
> > the buffers are ordered so writes don't get re-ordered, but reads can
> > happen during the buffering).
> 
> Right, for now I worry about UC semantics. Once we have nailed that, we
> can look at WC, which is a lot more tricky as archs differs more
> widely, but one thing at a time.
> 
> > But UC accesses are always  done entirely ordered, and honestly, any
> > status register that starts a DMA would not make sense any other way.
> > 
> > Of course, you'd have to be pretty odd to want to start a DMA with a
> > read anyway - partly exactly because it's bad for performance since
> > reads will be synchronous and not buffered like a write).
> 
> I have bad memories of old adaptec controllers ...
> 
> That said, I think the above might not be right on ARM if we want to
> make it the rule, Will, what do you reckon ?

So there are two cases to consider:

1.
	if (readl(DEVICE_DMA_STATUS) == DMA_DONE)
		mydata = *dma_bufp;



2.
	*dma_bufp = 42;
	readl(DEVICE_DMA_KICK_ON_READ);


For arm/arm64 we guarantee ordering for (1) but not for (2) -- you'd need to
add an mb() to make it work.

Do both of these work on power? If so, I guess I can make readl even more
expensive :/ Feels a bit like the tail wagging the dog, though.

Another thing I just realised is that we restrict the barriers we use in
readl/writel on arm64 so that they don't necessary apply to both loads and
stores. To be specific:

   writel is ordered against prior writes to memory, but not reads

   readl is ordered against subsequent reads of memory, but not writes (but
   note that in example (1) above, the control dependency ensures that).

If necessary, I could move the barrier in our readl implementation to be
before the read, then play the control-dependency + instruction-sync (ISB)
trick that you do on power.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  9:00                                                           ` David Laight
@ 2018-03-28  9:09                                                             ` Will Deacon
  -1 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-28  9:09 UTC (permalink / raw)
  To: David Laight
  Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya,
	Jason Gunthorpe, Peter Zijlstra, Ingo Molnar, Oliver,
	Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Wed, Mar 28, 2018 at 09:00:01AM +0000, David Laight wrote:
> From: Will Deacon
> > Sent: 28 March 2018 09:54
> ...
> > > > I don't think so. My reading of memory-barriers.txt says that writeX might
> > > > expand to outX, and outX is not ordered with respect to other types of
> > > > memory.
> > >
> > > Ugh ?
> > >
> > > My understanding of HW at least is the exact opposite. outX is *more*
> > > ordered if anything, than any other accessors. IO space is completely
> > > synchronous, non posted and ordered afaik.
> > 
> > I'm just going by memory-barriers.txt:
> > 
> > 
> >  (*) inX(), outX():
> > 
> >      [...]
> > 
> >      They are guaranteed to be fully ordered with respect to each other.
> > 
> >      They are not guaranteed to be fully ordered with respect to other types of
> >      memory and I/O operation.
> 
> A long time ago there was a document from Intel that said that inb/outb weren't
> necessarily synchronised wrt memory accesses.
> (Might be P-pro era).
> However no processors actually behaved that way and more recent docs
> say that inb/outb are fully ordered.

Thank you, David! I'll write another patch fixing this up and hopefully
we'll soon have one making writeX/readX much clearer.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-28  9:09                                                             ` Will Deacon
  0 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-28  9:09 UTC (permalink / raw)
  To: David Laight
  Cc: Benjamin Herrenschmidt, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Wed, Mar 28, 2018 at 09:00:01AM +0000, David Laight wrote:
> From: Will Deacon
> > Sent: 28 March 2018 09:54
> ...
> > > > I don't think so. My reading of memory-barriers.txt says that writeX might
> > > > expand to outX, and outX is not ordered with respect to other types of
> > > > memory.
> > >
> > > Ugh ?
> > >
> > > My understanding of HW at least is the exact opposite. outX is *more*
> > > ordered if anything, than any other accessors. IO space is completely
> > > synchronous, non posted and ordered afaik.
> > 
> > I'm just going by memory-barriers.txt:
> > 
> > 
> >  (*) inX(), outX():
> > 
> >      [...]
> > 
> >      They are guaranteed to be fully ordered with respect to each other.
> > 
> >      They are not guaranteed to be fully ordered with respect to other types of
> >      memory and I/O operation.
> 
> A long time ago there was a document from Intel that said that inb/outb weren't
> necessarily synchronised wrt memory accesses.
> (Might be P-pro era).
> However no processors actually behaved that way and more recent docs
> say that inb/outb are fully ordered.

Thank you, David! I'll write another patch fixing this up and hopefully
we'll soon have one making writeX/readX much clearer.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  8:53                                                         ` Will Deacon
@ 2018-03-28  9:50                                                           ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28  9:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya,
	Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver,
	Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote:
> For arm/arm64 these end up behaving exactly the same as readX/writeX, but
> I'm nervous about changing the documentation without understanding why it's
> like it is currently. Maybe another ia64 thing?.

I doubt it ... the Intel ancestry here would make me think they are
completely ordered there too.

powerpc and ARM can't quite make them synchronous I think, but at least
they should have the same semantics as writel.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-28  9:50                                                           ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28  9:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote:
> For arm/arm64 these end up behaving exactly the same as readX/writeX, but
> I'm nervous about changing the documentation without understanding why it's
> like it is currently. Maybe another ia64 thing?.

I doubt it ... the Intel ancestry here would make me think they are
completely ordered there too.

powerpc and ARM can't quite make them synchronous I think, but at least
they should have the same semantics as writel.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  9:50                                                           ` Benjamin Herrenschmidt
@ 2018-03-28  9:55                                                             ` Arnd Bergmann
  -1 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-28  9:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Jonathan Corbet, linux-rdma, Will Deacon, Sinan Kaya,
	Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver,
	Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Wed, Mar 28, 2018 at 11:50 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote:
>> For arm/arm64 these end up behaving exactly the same as readX/writeX, but
>> I'm nervous about changing the documentation without understanding why it's
>> like it is currently. Maybe another ia64 thing?.
>
> I doubt it ... the Intel ancestry here would make me think they are
> completely ordered there too.
>
> powerpc and ARM can't quite make them synchronous I think, but at least
> they should have the same semantics as writel.

One thing that ARM does IIRC is that it only guarantees to order writel() within
one device, and the memory mapped PCI I/O space window almost certainly
counts as a separate device to the CPU.

In the absence of an enforced global synchronization during an I/O port
access, that means writel() and outb() can be reordered before they arrive
at a device in theory. Again, this rarely matters in practice, but I think it
makes sense to document the less strict behavior here, given that we have
common hardware that can't provide x86 compatible semantics.

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-28  9:55                                                             ` Arnd Bergmann
  0 siblings, 0 replies; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-28  9:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Will Deacon, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Wed, Mar 28, 2018 at 11:50 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote:
>> For arm/arm64 these end up behaving exactly the same as readX/writeX, but
>> I'm nervous about changing the documentation without understanding why it's
>> like it is currently. Maybe another ia64 thing?.
>
> I doubt it ... the Intel ancestry here would make me think they are
> completely ordered there too.
>
> powerpc and ARM can't quite make them synchronous I think, but at least
> they should have the same semantics as writel.

One thing that ARM does IIRC is that it only guarantees to order writel() within
one device, and the memory mapped PCI I/O space window almost certainly
counts as a separate device to the CPU.

In the absence of an enforced global synchronization during an I/O port
access, that means writel() and outb() can be reordered before they arrive
at a device in theory. Again, this rarely matters in practice, but I think it
makes sense to document the less strict behavior here, given that we have
common hardware that can't provide x86 compatible semantics.

       Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  9:07                                                   ` Will Deacon
@ 2018-03-28  9:56                                                     ` Benjamin Herrenschmidt
  2018-03-28 10:13                                                       ` Aw: " Lino Sanfilippo
  2018-03-28 11:30                                                         ` David Laight
  0 siblings, 2 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28  9:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

On Wed, 2018-03-28 at 10:07 +0100, Will Deacon wrote:
> 
> For arm/arm64 we guarantee ordering for (1) but not for (2) -- you'd need to
> add an mb() to make it work.
> 
> Do both of these work on power? 

Yes. There's even another quirk, see further down ;-)

> If so, I guess I can make readl even more
> expensive :/ Feels a bit like the tail wagging the dog, though.

Maybe, but then readl is always horribly slow anyway so you may not
necessarily be losing that much.

> Another thing I just realised is that we restrict the barriers we use in
> readl/writel on arm64 so that they don't necessary apply to both loads and
> stores. To be specific:
> 
>    writel is ordered against prior writes to memory, but not reads

That could be tricky... You may end up with something that reads before
triggering a DMA and ends up with the post-DMA value ... ugh.

>    readl is ordered against subsequent reads of memory, but not writes (but
>    note that in example (1) above, the control dependency ensures that).
> 
> If necessary, I could move the barrier in our readl implementation to be
> before the read, then play the control-dependency + instruction-sync (ISB)
> trick that you do on power.

Yeah so that other trick I'm talking about is also used for timing
accuracy.

For example, let's say I have a device with a reset bit and the spec
says the reset bit needs to be set for at least 10us.

This is wrong:

	writel(1, RESET_REG);
	usleep(10);
	writel(0, RESET_REG);

Because of write posting, the first write might arrive to the device
right before the second one.

The typical "fix" is to turn that into:

	writel(1, RESET_REG);
	readl(RESET_REG); /* Flush posted writes */
	usleep(10);
	writel(0, RESET_REG);

*However* the issue here, at least on power, is that the CPU can issue
that readl but doesn't necessarily wait for it to complete (ie, the
data to return), before proceeding to the usleep. Now a usleep contains
a bunch of loads and stores and is probably fine, but a udelay which
just loops on the timebase may not be.

Thus we may still violate the timing requirement.

What we did inside readl, with the twi;isync sequence (which basically
means, trap on return value with "trap never" as a condition, followed
by isync that ensures all excpetion conditions are resolved), is force
the CPU to "consume" the data from the read before moving on.

This effectively makes readl fully synchronous (we would probably avoid
that if we were to implement a readl_relaxed).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  9:09                                                             ` Will Deacon
@ 2018-03-28  9:56                                                               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28  9:56 UTC (permalink / raw)
  To: Will Deacon, David Laight
  Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya,
	Jason Gunthorpe, Peter Zijlstra, Ingo Molnar, Oliver,
	Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Wed, 2018-03-28 at 10:09 +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 09:00:01AM +0000, David Laight wrote:
> > From: Will Deacon
> > > Sent: 28 March 2018 09:54
> > 
> > ...
> > > > > I don't think so. My reading of memory-barriers.txt says that writeX might
> > > > > expand to outX, and outX is not ordered with respect to other types of
> > > > > memory.
> > > > 
> > > > Ugh ?
> > > > 
> > > > My understanding of HW at least is the exact opposite. outX is *more*
> > > > ordered if anything, than any other accessors. IO space is completely
> > > > synchronous, non posted and ordered afaik.
> > > 
> > > I'm just going by memory-barriers.txt:
> > > 
> > > 
> > >  (*) inX(), outX():
> > > 
> > >      [...]
> > > 
> > >      They are guaranteed to be fully ordered with respect to each other.
> > > 
> > >      They are not guaranteed to be fully ordered with respect to other types of
> > >      memory and I/O operation.
> > 
> > A long time ago there was a document from Intel that said that inb/outb weren't
> > necessarily synchronised wrt memory accesses.
> > (Might be P-pro era).
> > However no processors actually behaved that way and more recent docs
> > say that inb/outb are fully ordered.
> 
> Thank you, David! I'll write another patch fixing this up and hopefully
> we'll soon have one making writeX/readX much clearer.

Thanks for doing the grunt work Will ! :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-28  9:56                                                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28  9:56 UTC (permalink / raw)
  To: Will Deacon, David Laight
  Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Wed, 2018-03-28 at 10:09 +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 09:00:01AM +0000, David Laight wrote:
> > From: Will Deacon
> > > Sent: 28 March 2018 09:54
> > 
> > ...
> > > > > I don't think so. My reading of memory-barriers.txt says that writeX might
> > > > > expand to outX, and outX is not ordered with respect to other types of
> > > > > memory.
> > > > 
> > > > Ugh ?
> > > > 
> > > > My understanding of HW at least is the exact opposite. outX is *more*
> > > > ordered if anything, than any other accessors. IO space is completely
> > > > synchronous, non posted and ordered afaik.
> > > 
> > > I'm just going by memory-barriers.txt:
> > > 
> > > 
> > >  (*) inX(), outX():
> > > 
> > >      [...]
> > > 
> > >      They are guaranteed to be fully ordered with respect to each other.
> > > 
> > >      They are not guaranteed to be fully ordered with respect to other types of
> > >      memory and I/O operation.
> > 
> > A long time ago there was a document from Intel that said that inb/outb weren't
> > necessarily synchronised wrt memory accesses.
> > (Might be P-pro era).
> > However no processors actually behaved that way and more recent docs
> > say that inb/outb are fully ordered.
> 
> Thank you, David! I'll write another patch fixing this up and hopefully
> we'll soon have one making writeX/readX much clearer.

Thanks for doing the grunt work Will ! :-)

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  9:55                                                             ` Arnd Bergmann
@ 2018-03-28 10:01                                                               ` Benjamin Herrenschmidt
  -1 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28 10:01 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Jonathan Corbet, linux-rdma, Will Deacon, Sinan Kaya,
	Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver,
	Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > powerpc and ARM can't quite make them synchronous I think, but at least
> > they should have the same semantics as writel.
> 
> One thing that ARM does IIRC is that it only guarantees to order writel() within
> one device, and the memory mapped PCI I/O space window almost certainly
> counts as a separate device to the CPU.

That sounds bogus.

> In the absence of an enforced global synchronization during an I/O port
> access, that means writel() and outb() can be reordered before they arrive
> at a device in theory. Again, this rarely matters in practice, but I think it
> makes sense to document the less strict behavior here, given that we have
> common hardware that can't provide x86 compatible semantics.

Can't you put some kind of super heavy handed barrier in inX/outX ?
These things are never going to be performance sensitive anyway...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-28 10:01                                                               ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28 10:01 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: Will Deacon, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > powerpc and ARM can't quite make them synchronous I think, but at least
> > they should have the same semantics as writel.
> 
> One thing that ARM does IIRC is that it only guarantees to order writel() within
> one device, and the memory mapped PCI I/O space window almost certainly
> counts as a separate device to the CPU.

That sounds bogus.

> In the absence of an enforced global synchronization during an I/O port
> access, that means writel() and outb() can be reordered before they arrive
> at a device in theory. Again, this rarely matters in practice, but I think it
> makes sense to document the less strict behavior here, given that we have
> common hardware that can't provide x86 compatible semantics.

Can't you put some kind of super heavy handed barrier in inX/outX ?
These things are never going to be performance sensitive anyway...

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Aw: Re: RFC on writel and writel_relaxed
  2018-03-28  9:56                                                     ` Benjamin Herrenschmidt
@ 2018-03-28 10:13                                                       ` Lino Sanfilippo
  2018-03-28 10:20                                                         ` Benjamin Herrenschmidt
  2018-03-28 11:30                                                         ` David Laight
  1 sibling, 1 reply; 216+ messages in thread
From: Lino Sanfilippo @ 2018-03-28 10:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Will Deacon, Linus Torvalds, Alexander Duyck, Sinan Kaya,
	Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

Hi,


> 
> Yeah so that other trick I'm talking about is also used for timing
> accuracy.
> 
> For example, let's say I have a device with a reset bit and the spec
> says the reset bit needs to be set for at least 10us.
> 
> This is wrong:
> 
> 	writel(1, RESET_REG);
> 	usleep(10);
> 	writel(0, RESET_REG);
> 
> Because of write posting, the first write might arrive to the device
> right before the second one.
> 

Does not write posting only concern PCI? This seems to be a different topic. Furthermore
write posting should not include write reordering...

Regards,
Lino

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28 10:01                                                               ` Benjamin Herrenschmidt
@ 2018-03-28 10:13                                                                 ` Will Deacon
  -1 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-28 10:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya,
	Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver,
	Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > they should have the same semantics as writel.
> > 
> > One thing that ARM does IIRC is that it only guarantees to order writel() within
> > one device, and the memory mapped PCI I/O space window almost certainly
> > counts as a separate device to the CPU.
> 
> That sounds bogus.

To elaborate, if you do the following on arm:

	writel(DEVICE_FOO);
	writel(DEVICE_BAR);

we generally cannot guarantee in which order those accesses will hit the
devices even if we add every barrier under the sun. You'd need something
in between, specific to DEVICE_FOO (probably a read-back) to really push
the first write out. This doesn't sound like it would be that uncommon to
me.

On the other hand:

	writel(DEVICE_FOO);
	writel(DEVICE_FOO);

is obviously ordered and also things like:

	writel(DEVICE_FOO_IN_PCI_MEM_SPACE);
	writel(DEVICE_BAR_IN_SAME_PCI_MEM_SPACE);

are ordered up to the PCI host bridge, because that's really the "device"
here.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-28 10:13                                                                 ` Will Deacon
  0 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-28 10:13 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Arnd Bergmann, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > they should have the same semantics as writel.
> > 
> > One thing that ARM does IIRC is that it only guarantees to order writel() within
> > one device, and the memory mapped PCI I/O space window almost certainly
> > counts as a separate device to the CPU.
> 
> That sounds bogus.

To elaborate, if you do the following on arm:

	writel(DEVICE_FOO);
	writel(DEVICE_BAR);

we generally cannot guarantee in which order those accesses will hit the
devices even if we add every barrier under the sun. You'd need something
in between, specific to DEVICE_FOO (probably a read-back) to really push
the first write out. This doesn't sound like it would be that uncommon to
me.

On the other hand:

	writel(DEVICE_FOO);
	writel(DEVICE_FOO);

is obviously ordered and also things like:

	writel(DEVICE_FOO_IN_PCI_MEM_SPACE);
	writel(DEVICE_BAR_IN_SAME_PCI_MEM_SPACE);

are ordered up to the PCI host bridge, because that's really the "device"
here.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: Aw: Re: RFC on writel and writel_relaxed
  2018-03-28 10:13                                                       ` Aw: " Lino Sanfilippo
@ 2018-03-28 10:20                                                         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28 10:20 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: Will Deacon, Linus Torvalds, Alexander Duyck, Sinan Kaya,
	Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

On Wed, 2018-03-28 at 12:13 +0200, Lino Sanfilippo wrote:
> Hi,
> 
> 
> > 
> > Yeah so that other trick I'm talking about is also used for timing
> > accuracy.
> > 
> > For example, let's say I have a device with a reset bit and the spec
> > says the reset bit needs to be set for at least 10us.
> > 
> > This is wrong:
> > 
> > 	writel(1, RESET_REG);
> > 	usleep(10);
> > 	writel(0, RESET_REG);
> > 
> > Because of write posting, the first write might arrive to the device
> > right before the second one.
> > 
> 
> Does not write posting only concern PCI? This seems to be a different topic. Furthermore
> write posting should not include write reordering...

Nobody's talking about re-ordering and no, write posting is rather
common practice on a whole lot of different busses, not just PCI(e).

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
  2018-03-28  9:56                                                     ` Benjamin Herrenschmidt
@ 2018-03-28 11:30                                                         ` David Laight
  2018-03-28 11:30                                                         ` David Laight
  1 sibling, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-28 11:30 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Will Deacon
  Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

From: Benjamin Herrenschmidt
> Sent: 28 March 2018 10:56
...
> For example, let's say I have a device with a reset bit and the spec
> says the reset bit needs to be set for at least 10us.
> 
> This is wrong:
> 
> 	writel(1, RESET_REG);
> 	usleep(10);
> 	writel(0, RESET_REG);
> 
> Because of write posting, the first write might arrive to the device
> right before the second one.
> 
> The typical "fix" is to turn that into:
> 
> 	writel(1, RESET_REG);
> 	readl(RESET_REG); /* Flush posted writes */

Would a writel(1, RESET_REG) here provide enough synchronsiation?

> 	usleep(10);
> 	writel(0, RESET_REG);
> 
> *However* the issue here, at least on power, is that the CPU can issue
> that readl but doesn't necessarily wait for it to complete (ie, the
> data to return), before proceeding to the usleep. Now a usleep contains
> a bunch of loads and stores and is probably fine, but a udelay which
> just loops on the timebase may not be.
> 
> Thus we may still violate the timing requirement.

I've seem that sort of code (with udelay() and no read back) quite often.
How many were in linux I don't know.

For small delays I usually fix it by repeated writes (of the same value)
to the device register. That can guarantee very short intervals.

The only time I've actually seen buffered writes break timing was
between a 286 and an 8859 interrupt controller.
If you wrote to the mask then enabled interrupts the first IACK cycle
could be too close to write and break the cycle recovery time.
That clobbered most of the interrupt controller registers.
That probably affected every 286 board ever built!
Not sure how much software added the required extra bus cycle.

> What we did inside readl, with the twi;isync sequence (which basically
> means, trap on return value with "trap never" as a condition, followed
> by isync that ensures all excpetion conditions are resolved), is force
> the CPU to "consume" the data from the read before moving on.
> 
> This effectively makes readl fully synchronous (we would probably avoid
> that if we were to implement a readl_relaxed).

I've always wondered exactly what the twi;isync were for - always seemed
very heavy handed for most mmio reads.
Particularly if you are doing mmio reads from a data fifo.

Perhaps there should be a writel_wait() that is allowed to do a read back
for such code paths?

	David


^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
@ 2018-03-28 11:30                                                         ` David Laight
  0 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-28 11:30 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Will Deacon
  Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

RnJvbTogQmVuamFtaW4gSGVycmVuc2NobWlkdA0KPiBTZW50OiAyOCBNYXJjaCAyMDE4IDEwOjU2
DQouLi4NCj4gRm9yIGV4YW1wbGUsIGxldCdzIHNheSBJIGhhdmUgYSBkZXZpY2Ugd2l0aCBhIHJl
c2V0IGJpdCBhbmQgdGhlIHNwZWMNCj4gc2F5cyB0aGUgcmVzZXQgYml0IG5lZWRzIHRvIGJlIHNl
dCBmb3IgYXQgbGVhc3QgMTB1cy4NCj4gDQo+IFRoaXMgaXMgd3Jvbmc6DQo+IA0KPiAJd3JpdGVs
KDEsIFJFU0VUX1JFRyk7DQo+IAl1c2xlZXAoMTApOw0KPiAJd3JpdGVsKDAsIFJFU0VUX1JFRyk7
DQo+IA0KPiBCZWNhdXNlIG9mIHdyaXRlIHBvc3RpbmcsIHRoZSBmaXJzdCB3cml0ZSBtaWdodCBh
cnJpdmUgdG8gdGhlIGRldmljZQ0KPiByaWdodCBiZWZvcmUgdGhlIHNlY29uZCBvbmUuDQo+IA0K
PiBUaGUgdHlwaWNhbCAiZml4IiBpcyB0byB0dXJuIHRoYXQgaW50bzoNCj4gDQo+IAl3cml0ZWwo
MSwgUkVTRVRfUkVHKTsNCj4gCXJlYWRsKFJFU0VUX1JFRyk7IC8qIEZsdXNoIHBvc3RlZCB3cml0
ZXMgKi8NCg0KV291bGQgYSB3cml0ZWwoMSwgUkVTRVRfUkVHKSBoZXJlIHByb3ZpZGUgZW5vdWdo
IHN5bmNocm9uc2lhdGlvbj8NCg0KPiAJdXNsZWVwKDEwKTsNCj4gCXdyaXRlbCgwLCBSRVNFVF9S
RUcpOw0KPiANCj4gKkhvd2V2ZXIqIHRoZSBpc3N1ZSBoZXJlLCBhdCBsZWFzdCBvbiBwb3dlciwg
aXMgdGhhdCB0aGUgQ1BVIGNhbiBpc3N1ZQ0KPiB0aGF0IHJlYWRsIGJ1dCBkb2Vzbid0IG5lY2Vz
c2FyaWx5IHdhaXQgZm9yIGl0IHRvIGNvbXBsZXRlIChpZSwgdGhlDQo+IGRhdGEgdG8gcmV0dXJu
KSwgYmVmb3JlIHByb2NlZWRpbmcgdG8gdGhlIHVzbGVlcC4gTm93IGEgdXNsZWVwIGNvbnRhaW5z
DQo+IGEgYnVuY2ggb2YgbG9hZHMgYW5kIHN0b3JlcyBhbmQgaXMgcHJvYmFibHkgZmluZSwgYnV0
IGEgdWRlbGF5IHdoaWNoDQo+IGp1c3QgbG9vcHMgb24gdGhlIHRpbWViYXNlIG1heSBub3QgYmUu
DQo+IA0KPiBUaHVzIHdlIG1heSBzdGlsbCB2aW9sYXRlIHRoZSB0aW1pbmcgcmVxdWlyZW1lbnQu
DQoNCkkndmUgc2VlbSB0aGF0IHNvcnQgb2YgY29kZSAod2l0aCB1ZGVsYXkoKSBhbmQgbm8gcmVh
ZCBiYWNrKSBxdWl0ZSBvZnRlbi4NCkhvdyBtYW55IHdlcmUgaW4gbGludXggSSBkb24ndCBrbm93
Lg0KDQpGb3Igc21hbGwgZGVsYXlzIEkgdXN1YWxseSBmaXggaXQgYnkgcmVwZWF0ZWQgd3JpdGVz
IChvZiB0aGUgc2FtZSB2YWx1ZSkNCnRvIHRoZSBkZXZpY2UgcmVnaXN0ZXIuIFRoYXQgY2FuIGd1
YXJhbnRlZSB2ZXJ5IHNob3J0IGludGVydmFscy4NCg0KVGhlIG9ubHkgdGltZSBJJ3ZlIGFjdHVh
bGx5IHNlZW4gYnVmZmVyZWQgd3JpdGVzIGJyZWFrIHRpbWluZyB3YXMNCmJldHdlZW4gYSAyODYg
YW5kIGFuIDg4NTkgaW50ZXJydXB0IGNvbnRyb2xsZXIuDQpJZiB5b3Ugd3JvdGUgdG8gdGhlIG1h
c2sgdGhlbiBlbmFibGVkIGludGVycnVwdHMgdGhlIGZpcnN0IElBQ0sgY3ljbGUNCmNvdWxkIGJl
IHRvbyBjbG9zZSB0byB3cml0ZSBhbmQgYnJlYWsgdGhlIGN5Y2xlIHJlY292ZXJ5IHRpbWUuDQpU
aGF0IGNsb2JiZXJlZCBtb3N0IG9mIHRoZSBpbnRlcnJ1cHQgY29udHJvbGxlciByZWdpc3RlcnMu
DQpUaGF0IHByb2JhYmx5IGFmZmVjdGVkIGV2ZXJ5IDI4NiBib2FyZCBldmVyIGJ1aWx0IQ0KTm90
IHN1cmUgaG93IG11Y2ggc29mdHdhcmUgYWRkZWQgdGhlIHJlcXVpcmVkIGV4dHJhIGJ1cyBjeWNs
ZS4NCg0KPiBXaGF0IHdlIGRpZCBpbnNpZGUgcmVhZGwsIHdpdGggdGhlIHR3aTtpc3luYyBzZXF1
ZW5jZSAod2hpY2ggYmFzaWNhbGx5DQo+IG1lYW5zLCB0cmFwIG9uIHJldHVybiB2YWx1ZSB3aXRo
ICJ0cmFwIG5ldmVyIiBhcyBhIGNvbmRpdGlvbiwgZm9sbG93ZWQNCj4gYnkgaXN5bmMgdGhhdCBl
bnN1cmVzIGFsbCBleGNwZXRpb24gY29uZGl0aW9ucyBhcmUgcmVzb2x2ZWQpLCBpcyBmb3JjZQ0K
PiB0aGUgQ1BVIHRvICJjb25zdW1lIiB0aGUgZGF0YSBmcm9tIHRoZSByZWFkIGJlZm9yZSBtb3Zp
bmcgb24uDQo+IA0KPiBUaGlzIGVmZmVjdGl2ZWx5IG1ha2VzIHJlYWRsIGZ1bGx5IHN5bmNocm9u
b3VzICh3ZSB3b3VsZCBwcm9iYWJseSBhdm9pZA0KPiB0aGF0IGlmIHdlIHdlcmUgdG8gaW1wbGVt
ZW50IGEgcmVhZGxfcmVsYXhlZCkuDQoNCkkndmUgYWx3YXlzIHdvbmRlcmVkIGV4YWN0bHkgd2hh
dCB0aGUgdHdpO2lzeW5jIHdlcmUgZm9yIC0gYWx3YXlzIHNlZW1lZA0KdmVyeSBoZWF2eSBoYW5k
ZWQgZm9yIG1vc3QgbW1pbyByZWFkcy4NClBhcnRpY3VsYXJseSBpZiB5b3UgYXJlIGRvaW5nIG1t
aW8gcmVhZHMgZnJvbSBhIGRhdGEgZmlmby4NCg0KUGVyaGFwcyB0aGVyZSBzaG91bGQgYmUgYSB3
cml0ZWxfd2FpdCgpIHRoYXQgaXMgYWxsb3dlZCB0byBkbyBhIHJlYWQgYmFjaw0KZm9yIHN1Y2gg
Y29kZSBwYXRocz8NCg0KCURhdmlkDQoNCg==

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28  6:14                                               ` Linus Torvalds
@ 2018-03-28 11:41                                                 ` okaya
  2018-03-28 15:13                                                   ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 216+ messages in thread
From: okaya @ 2018-03-28 11:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Alexander Duyck, Will Deacon,
	Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev, linus971

On 2018-03-28 02:14, Linus Torvalds wrote:
> On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kaya <okaya@codeaurora.org> 
> wrote:
>> 
>> Basically changing it to
>> 
>> dma_buffer->foo = 1;                    /* WB */
>> wmb()
>> writel_relaxed(KICK, DMA_KICK_REGISTER);        /* UC */
>> mmiowb()
> 
> Why?
> 
> Why not  just remove the wmb(), and keep the barrier in the writel()?

Yes, we want to get there indeed. It is because of some arch not 
implementing writel properly. Maintainers want to play safe.

That is why I asked if IA64 and other well known archs follow the 
strongly ordered rule at this moment like PPC and ARM.

Or should we go and inform every arch about this before yanking wmb()?

Maintainers are afraid of introducing a regression.

> 
> The above code makes no sense, and just looks stupid to me. It also
> generates pointlessly bad code on x86, so it's bad there too.
> 
>                Linus

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28 11:30                                                         ` David Laight
  (?)
@ 2018-03-28 15:12                                                         ` Benjamin Herrenschmidt
  2018-03-28 16:16                                                             ` David Laight
  -1 siblings, 1 reply; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28 15:12 UTC (permalink / raw)
  To: David Laight, Will Deacon
  Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

On Wed, 2018-03-28 at 11:30 +0000, David Laight wrote:
> From: Benjamin Herrenschmidt
> > Sent: 28 March 2018 10:56
> 
> ...
> > For example, let's say I have a device with a reset bit and the spec
> > says the reset bit needs to be set for at least 10us.
> > 
> > This is wrong:
> > 
> > 	writel(1, RESET_REG);
> > 	usleep(10);
> > 	writel(0, RESET_REG);
> > 
> > Because of write posting, the first write might arrive to the device
> > right before the second one.
> > 
> > The typical "fix" is to turn that into:
> > 
> > 	writel(1, RESET_REG);
> > 	readl(RESET_REG); /* Flush posted writes */
> 
> Would a writel(1, RESET_REG) here provide enough synchronsiation?

Probably yes. It's one of those things where you try to deal with the
fact that 90% of driver writers barely understand the basic stuff and
so you need the "default" accessors to be hardened as much as possible.

We still need to get a reasonably definition of the semantics of the
relaxed ones vs. WC memory but let's get through that exercise first
and hopefully for the last time.
 
> > 	usleep(10);
> > 	writel(0, RESET_REG);
> > 
> > *However* the issue here, at least on power, is that the CPU can issue
> > that readl but doesn't necessarily wait for it to complete (ie, the
> > data to return), before proceeding to the usleep. Now a usleep contains
> > a bunch of loads and stores and is probably fine, but a udelay which
> > just loops on the timebase may not be.
> > 
> > Thus we may still violate the timing requirement.
> 
> I've seem that sort of code (with udelay() and no read back) quite often.
> How many were in linux I don't know.
> 
> For small delays I usually fix it by repeated writes (of the same value)
> to the device register. That can guarantee very short intervals.

As long as you know the bus frequency...

> The only time I've actually seen buffered writes break timing was
> between a 286 and an 8859 interrupt controller.

:-)

The problem for me is not so much what I've seen, I realize that most
of the issues we are talking about are the kind that will hit once in a
thousand times or less.

But we *can* reason about them in a way that can effectively prevent
the problem completely and when your cluster has 10000 machine, 1/1000
starts becoming significant.

These days the vast majority of IO devices either are 99% DMA driven so
that a bit of overhead on MMIOs is irrelevant, or have one fast path
(low latency IB etc...) that needs some finer control, and the rest is
all setup which can be paranoid at will.

So I think we should aim for the "default" accessors most people use to
be as hadened as we can think of. I favor correctness over performance
in all cases. But then we also define a reasonable semantic for the
relaxed ones (well, we sort-of do have one, we might have to make it a
bit more precise in some areas) that allows the few MMIO fast path that
care to be optimized.

> If you wrote to the mask then enabled interrupts the first IACK cycle
> could be too close to write and break the cycle recovery time.
> That clobbered most of the interrupt controller registers.
> That probably affected every 286 board ever built!
> Not sure how much software added the required extra bus cycle.
> 
> > What we did inside readl, with the twi;isync sequence (which basically
> > means, trap on return value with "trap never" as a condition, followed
> > by isync that ensures all excpetion conditions are resolved), is force
> > the CPU to "consume" the data from the read before moving on.
> > 
> > This effectively makes readl fully synchronous (we would probably avoid
> > that if we were to implement a readl_relaxed).
> 
> I've always wondered exactly what the twi;isync were for - always seemed
> very heavy handed for most mmio reads.
> Particularly if you are doing mmio reads from a data fifo.

If you do that you should use the "s" version of the accessors. Those
will only do the above trick at the end of the access series. Also a
FIFO needs special care about endianness anyway, so you should use
those accessors regardless. (Hint: you never endian swap a FIFO even on
BE on a LE device, unless something's been wired very badly in HW).

> Perhaps there should be a writel_wait() that is allowed to do a read back
> for such code paths?

I think what we have is fine, we just define that the standard
writel/readl as fairly simple and hardened, and we look at providing a
somewhat reasonable set of relaxed variants for optimizing fast path.
We pretty much already are there, we just need to be better at defining
the semantics.

And for the super high perf case, which thankfully is either seldom
(server high perf network stuff) or very arch specific (ARM SoC stuff),
then arch specific driver hacks will always remain the norm.

Cheers,
Ben. 

> 	David
> 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28 11:41                                                 ` okaya
@ 2018-03-28 15:13                                                   ` Benjamin Herrenschmidt
  2018-03-28 15:55                                                     ` David Miller
  0 siblings, 1 reply; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28 15:13 UTC (permalink / raw)
  To: okaya, Linus Torvalds
  Cc: Alexander Duyck, Will Deacon, Arnd Bergmann, Jason Gunthorpe,
	David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Alexander Duyck, Paul E. McKenney, netdev, linus971

On Wed, 2018-03-28 at 07:41 -0400, okaya@codeaurora.org wrote:
> Yes, we want to get there indeed. It is because of some arch not 
> implementing writel properly. Maintainers want to play safe.
> 
> That is why I asked if IA64 and other well known archs follow the 
> strongly ordered rule at this moment like PPC and ARM.
> 
> Or should we go and inform every arch about this before yanking wmb()?
> 
> Maintainers are afraid of introducing a regression.

Let's fix all archs, it's way easier than fixing all drivers. Half of
the archs are unused or dead anyway.

> > 
> > The above code makes no sense, and just looks stupid to me. It also
> > generates pointlessly bad code on x86, so it's bad there too.
> > 
> >                 Linus

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28 15:13                                                   ` Benjamin Herrenschmidt
@ 2018-03-28 15:55                                                     ` David Miller
  2018-03-28 16:23                                                       ` Nicholas Piggin
  2018-03-29 13:56                                                       ` Sinan Kaya
  0 siblings, 2 replies; 216+ messages in thread
From: David Miller @ 2018-03-28 15:55 UTC (permalink / raw)
  To: benh
  Cc: okaya, torvalds, alexander.duyck, will.deacon, arnd, jgg,
	David.Laight, oohall, linuxppc-dev, linux-rdma,
	alexander.h.duyck, paulmck, netdev, linus971

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Thu, 29 Mar 2018 02:13:16 +1100

> Let's fix all archs, it's way easier than fixing all drivers. Half of
> the archs are unused or dead anyway.

Agreed.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
  2018-03-28 15:12                                                         ` Benjamin Herrenschmidt
@ 2018-03-28 16:16                                                             ` David Laight
  0 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-28 16:16 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Will Deacon
  Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

From: Benjamin Herrenschmidt
> Sent: 28 March 2018 16:13
...
> > I've always wondered exactly what the twi;isync were for - always seemed
> > very heavy handed for most mmio reads.
> > Particularly if you are doing mmio reads from a data fifo.
> 
> If you do that you should use the "s" version of the accessors. Those
> will only do the above trick at the end of the access series. Also a
> FIFO needs special care about endianness anyway, so you should use
> those accessors regardless. (Hint: you never endian swap a FIFO even on
> BE on a LE device, unless something's been wired very badly in HW).

That was actually a 64 bit wide fifo connected to a 16bit wide PIO interface.
Reading the high address 'clocked' the fifo.
So the first 3 reads could happen in any order, but the 4th had to be last.
This is a small ppc and we shovel a lot of data through that fifo.

Whether it needed byteswapping depended completely on how our hardware people
had built the pcb (not made easy by some docs using the ibm bit numbering).
In fact it didn't....

While that driver only had to run on a very specific small ppc, generic drivers
might have similar issues.

I suspect that writel() is always (or should always be):
	barrier_before_writel()
	writel_relaxed()
	barrier_after_writel()
So if a driver needs to do multiple writes (without strong ordering)
it should be able to repeat the writel_relaxed() with only one set
of barriers.
Similarly for readl().
In addition a lesser barrier is probably enough between a readl_relaxed()
and a writel_relaxed() that is conditional on the read value.

	David


^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
@ 2018-03-28 16:16                                                             ` David Laight
  0 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-28 16:16 UTC (permalink / raw)
  To: 'Benjamin Herrenschmidt', Will Deacon
  Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann,
	Jason Gunthorpe, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, netdev

RnJvbTogQmVuamFtaW4gSGVycmVuc2NobWlkdA0KPiBTZW50OiAyOCBNYXJjaCAyMDE4IDE2OjEz
DQouLi4NCj4gPiBJJ3ZlIGFsd2F5cyB3b25kZXJlZCBleGFjdGx5IHdoYXQgdGhlIHR3aTtpc3lu
YyB3ZXJlIGZvciAtIGFsd2F5cyBzZWVtZWQNCj4gPiB2ZXJ5IGhlYXZ5IGhhbmRlZCBmb3IgbW9z
dCBtbWlvIHJlYWRzLg0KPiA+IFBhcnRpY3VsYXJseSBpZiB5b3UgYXJlIGRvaW5nIG1taW8gcmVh
ZHMgZnJvbSBhIGRhdGEgZmlmby4NCj4gDQo+IElmIHlvdSBkbyB0aGF0IHlvdSBzaG91bGQgdXNl
IHRoZSAicyIgdmVyc2lvbiBvZiB0aGUgYWNjZXNzb3JzLiBUaG9zZQ0KPiB3aWxsIG9ubHkgZG8g
dGhlIGFib3ZlIHRyaWNrIGF0IHRoZSBlbmQgb2YgdGhlIGFjY2VzcyBzZXJpZXMuIEFsc28gYQ0K
PiBGSUZPIG5lZWRzIHNwZWNpYWwgY2FyZSBhYm91dCBlbmRpYW5uZXNzIGFueXdheSwgc28geW91
IHNob3VsZCB1c2UNCj4gdGhvc2UgYWNjZXNzb3JzIHJlZ2FyZGxlc3MuIChIaW50OiB5b3UgbmV2
ZXIgZW5kaWFuIHN3YXAgYSBGSUZPIGV2ZW4gb24NCj4gQkUgb24gYSBMRSBkZXZpY2UsIHVubGVz
cyBzb21ldGhpbmcncyBiZWVuIHdpcmVkIHZlcnkgYmFkbHkgaW4gSFcpLg0KDQpUaGF0IHdhcyBh
Y3R1YWxseSBhIDY0IGJpdCB3aWRlIGZpZm8gY29ubmVjdGVkIHRvIGEgMTZiaXQgd2lkZSBQSU8g
aW50ZXJmYWNlLg0KUmVhZGluZyB0aGUgaGlnaCBhZGRyZXNzICdjbG9ja2VkJyB0aGUgZmlmby4N
ClNvIHRoZSBmaXJzdCAzIHJlYWRzIGNvdWxkIGhhcHBlbiBpbiBhbnkgb3JkZXIsIGJ1dCB0aGUg
NHRoIGhhZCB0byBiZSBsYXN0Lg0KVGhpcyBpcyBhIHNtYWxsIHBwYyBhbmQgd2Ugc2hvdmVsIGEg
bG90IG9mIGRhdGEgdGhyb3VnaCB0aGF0IGZpZm8uDQoNCldoZXRoZXIgaXQgbmVlZGVkIGJ5dGVz
d2FwcGluZyBkZXBlbmRlZCBjb21wbGV0ZWx5IG9uIGhvdyBvdXIgaGFyZHdhcmUgcGVvcGxlDQpo
YWQgYnVpbHQgdGhlIHBjYiAobm90IG1hZGUgZWFzeSBieSBzb21lIGRvY3MgdXNpbmcgdGhlIGli
bSBiaXQgbnVtYmVyaW5nKS4NCkluIGZhY3QgaXQgZGlkbid0Li4uLg0KDQpXaGlsZSB0aGF0IGRy
aXZlciBvbmx5IGhhZCB0byBydW4gb24gYSB2ZXJ5IHNwZWNpZmljIHNtYWxsIHBwYywgZ2VuZXJp
YyBkcml2ZXJzDQptaWdodCBoYXZlIHNpbWlsYXIgaXNzdWVzLg0KDQpJIHN1c3BlY3QgdGhhdCB3
cml0ZWwoKSBpcyBhbHdheXMgKG9yIHNob3VsZCBhbHdheXMgYmUpOg0KCWJhcnJpZXJfYmVmb3Jl
X3dyaXRlbCgpDQoJd3JpdGVsX3JlbGF4ZWQoKQ0KCWJhcnJpZXJfYWZ0ZXJfd3JpdGVsKCkNClNv
IGlmIGEgZHJpdmVyIG5lZWRzIHRvIGRvIG11bHRpcGxlIHdyaXRlcyAod2l0aG91dCBzdHJvbmcg
b3JkZXJpbmcpDQppdCBzaG91bGQgYmUgYWJsZSB0byByZXBlYXQgdGhlIHdyaXRlbF9yZWxheGVk
KCkgd2l0aCBvbmx5IG9uZSBzZXQNCm9mIGJhcnJpZXJzLg0KU2ltaWxhcmx5IGZvciByZWFkbCgp
Lg0KSW4gYWRkaXRpb24gYSBsZXNzZXIgYmFycmllciBpcyBwcm9iYWJseSBlbm91Z2ggYmV0d2Vl
biBhIHJlYWRsX3JlbGF4ZWQoKQ0KYW5kIGEgd3JpdGVsX3JlbGF4ZWQoKSB0aGF0IGlzIGNvbmRp
dGlvbmFsIG9uIHRoZSByZWFkIHZhbHVlLg0KDQoJRGF2aWQNCg0K

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28 15:55                                                     ` David Miller
@ 2018-03-28 16:23                                                       ` Nicholas Piggin
  2018-03-28 21:31                                                         ` Benjamin Herrenschmidt
  2018-03-29 13:56                                                       ` Sinan Kaya
  1 sibling, 1 reply; 216+ messages in thread
From: Nicholas Piggin @ 2018-03-28 16:23 UTC (permalink / raw)
  To: David Miller
  Cc: benh, paulmck, arnd, linux-rdma, linuxppc-dev, linus971,
	will.deacon, alexander.duyck, okaya, jgg, David.Laight, oohall,
	netdev, alexander.h.duyck, torvalds

On Wed, 28 Mar 2018 11:55:09 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Thu, 29 Mar 2018 02:13:16 +1100
> 
> > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > the archs are unused or dead anyway.  
> 
> Agreed.

While we're making decrees here, can we do something about mmiowb?
The semantics are basically indecipherable.

  This is a variation on the mandatory write barrier that causes writes to weakly
  ordered I/O regions to be partially ordered.  Its effects may go beyond the
  CPU->Hardware interface and actually affect the hardware at some level.

How can a driver writer possibly get that right?

IIRC it was added for some big ia64 system that was really expensive
to implement the proper wmb() semantics on. So wmb() semantics were
quietly downgraded, then the subsequently broken drivers they cared
about were fixed by adding the stronger mmiowb().

What should have happened was wmb and writel remained correct, sane, and
expensive, and they add an mmio_wmb() to order MMIO stores made by the
writel_relaxed accessors, then use that to speed up the few drivers they
care about.

Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
make wmb ordering talk about stores to the device, not to some
intermediate stage of the interconnect where it can be subsequently
reordered wrt the device? Drivers can be converted back to using wmb
or writel gradually.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28 10:13                                                                 ` Will Deacon
@ 2018-03-28 16:57                                                                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-28 16:57 UTC (permalink / raw)
  To: Will Deacon
  Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya,
	Peter Zijlstra, David Laight, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > > they should have the same semantics as writel.
> > > 
> > > One thing that ARM does IIRC is that it only guarantees to order writel() within
> > > one device, and the memory mapped PCI I/O space window almost certainly
> > > counts as a separate device to the CPU.
> > 
> > That sounds bogus.
> 
> To elaborate, if you do the following on arm:
> 
> 	writel(DEVICE_FOO);
> 	writel(DEVICE_BAR);
> 
> we generally cannot guarantee in which order those accesses will hit the
> devices even if we add every barrier under the sun. You'd need something
> in between, specific to DEVICE_FOO (probably a read-back) to really push
> the first write out. This doesn't sound like it would be that uncommon to
> me.

The PCI posted write does not require the above to execute 'in order'
only that any bus segment shared by the two devices have the writes
issued in CPU order. ie at a shared PCI root port for instance.

If I recall this is very similar to the ordering that ARM's on-chip
AXI interconnect is supposed to provide.. So I'd be very surprised if
a modern ARM64 has an meaningful difference from x86 here.

When talking about ordering between the devices, the relevant question
is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
from the DEVICE_FOO. 'ordered' means that in this case
writel(DEVICE_FOO) must be presented to FOO before anything generated
by BAR.

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-28 16:57                                                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-28 16:57 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, Arnd Bergmann, Sinan Kaya, David Laight,
	Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > > they should have the same semantics as writel.
> > > 
> > > One thing that ARM does IIRC is that it only guarantees to order writel() within
> > > one device, and the memory mapped PCI I/O space window almost certainly
> > > counts as a separate device to the CPU.
> > 
> > That sounds bogus.
> 
> To elaborate, if you do the following on arm:
> 
> 	writel(DEVICE_FOO);
> 	writel(DEVICE_BAR);
> 
> we generally cannot guarantee in which order those accesses will hit the
> devices even if we add every barrier under the sun. You'd need something
> in between, specific to DEVICE_FOO (probably a read-back) to really push
> the first write out. This doesn't sound like it would be that uncommon to
> me.

The PCI posted write does not require the above to execute 'in order'
only that any bus segment shared by the two devices have the writes
issued in CPU order. ie at a shared PCI root port for instance.

If I recall this is very similar to the ordering that ARM's on-chip
AXI interconnect is supposed to provide.. So I'd be very surprised if
a modern ARM64 has an meaningful difference from x86 here.

When talking about ordering between the devices, the relevant question
is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
from the DEVICE_FOO. 'ordered' means that in this case
writel(DEVICE_FOO) must be presented to FOO before anything generated
by BAR.

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28 16:23                                                       ` Nicholas Piggin
@ 2018-03-28 21:31                                                         ` Benjamin Herrenschmidt
  2018-03-28 22:09                                                           ` Nicholas Piggin
  2018-03-29  9:20                                                           ` Will Deacon
  0 siblings, 2 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-28 21:31 UTC (permalink / raw)
  To: Nicholas Piggin, David Miller
  Cc: paulmck, arnd, linux-rdma, linuxppc-dev, linus971, will.deacon,
	alexander.duyck, okaya, jgg, David.Laight, oohall, netdev,
	alexander.h.duyck, torvalds

On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote:
> On Wed, 28 Mar 2018 11:55:09 -0400 (EDT)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Date: Thu, 29 Mar 2018 02:13:16 +1100
> > 
> > > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > > the archs are unused or dead anyway.  
> > 
> > Agreed.
> 
> While we're making decrees here, can we do something about mmiowb?
> The semantics are basically indecipherable.

I was going to tackle that next :-)

>   This is a variation on the mandatory write barrier that causes writes to weakly
>   ordered I/O regions to be partially ordered.  Its effects may go beyond the
>   CPU->Hardware interface and actually affect the hardware at some level.
> 
> How can a driver writer possibly get that right?
> 
> IIRC it was added for some big ia64 system that was really expensive
> to implement the proper wmb() semantics on. So wmb() semantics were
> quietly downgraded, then the subsequently broken drivers they cared
> about were fixed by adding the stronger mmiowb().
> 
> What should have happened was wmb and writel remained correct, sane, and
> expensive, and they add an mmio_wmb() to order MMIO stores made by the
> writel_relaxed accessors, then use that to speed up the few drivers they
> care about.
> 
> Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
> make wmb ordering talk about stores to the device, not to some
> intermediate stage of the interconnect where it can be subsequently
> reordered wrt the device? Drivers can be converted back to using wmb
> or writel gradually.

I was under the impression that mmiowb was specifically about ordering
writel's with a subsequent spin_unlock, without it, MMIOs from
different CPUs (within the same lock) would still arrive OO.

If that's indeed the case, I would suggest ia64 switches to a similar
per-cpu flag trick powerpc uses.

Cheers,
Ben.

> Thanks,
> Nick

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28 21:31                                                         ` Benjamin Herrenschmidt
@ 2018-03-28 22:09                                                           ` Nicholas Piggin
  2018-03-29  9:20                                                           ` Will Deacon
  1 sibling, 0 replies; 216+ messages in thread
From: Nicholas Piggin @ 2018-03-28 22:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Miller, paulmck, arnd, linux-rdma, linuxppc-dev, linus971,
	will.deacon, alexander.duyck, okaya, jgg, David.Laight, oohall,
	netdev, alexander.h.duyck, torvalds

On Thu, 29 Mar 2018 08:31:32 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote:
> > On Wed, 28 Mar 2018 11:55:09 -0400 (EDT)
> > David Miller <davem@davemloft.net> wrote:
> >   
> > > From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > Date: Thu, 29 Mar 2018 02:13:16 +1100
> > >   
> > > > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > > > the archs are unused or dead anyway.    
> > > 
> > > Agreed.  
> > 
> > While we're making decrees here, can we do something about mmiowb?
> > The semantics are basically indecipherable.  
> 
> I was going to tackle that next :-)
> 
> >   This is a variation on the mandatory write barrier that causes writes to weakly
> >   ordered I/O regions to be partially ordered.  Its effects may go beyond the
> >   CPU->Hardware interface and actually affect the hardware at some level.
> > 
> > How can a driver writer possibly get that right?
> > 
> > IIRC it was added for some big ia64 system that was really expensive
> > to implement the proper wmb() semantics on. So wmb() semantics were
> > quietly downgraded, then the subsequently broken drivers they cared
> > about were fixed by adding the stronger mmiowb().
> > 
> > What should have happened was wmb and writel remained correct, sane, and
> > expensive, and they add an mmio_wmb() to order MMIO stores made by the
> > writel_relaxed accessors, then use that to speed up the few drivers they
> > care about.
> > 
> > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
> > make wmb ordering talk about stores to the device, not to some
> > intermediate stage of the interconnect where it can be subsequently
> > reordered wrt the device? Drivers can be converted back to using wmb
> > or writel gradually.  
> 
> I was under the impression that mmiowb was specifically about ordering
> writel's with a subsequent spin_unlock, without it, MMIOs from
> different CPUs (within the same lock) would still arrive OO.

Yes more or less, and I think that until mmiowb was introduced, wmb
or writel was sufficient for this.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28 16:57                                                                   ` Jason Gunthorpe
@ 2018-03-29  9:19                                                                     ` Will Deacon
  -1 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-29  9:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya,
	Peter Zijlstra, David Laight, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote:
> On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > > > they should have the same semantics as writel.
> > > > 
> > > > One thing that ARM does IIRC is that it only guarantees to order writel() within
> > > > one device, and the memory mapped PCI I/O space window almost certainly
> > > > counts as a separate device to the CPU.
> > > 
> > > That sounds bogus.
> > 
> > To elaborate, if you do the following on arm:
> > 
> > 	writel(DEVICE_FOO);
> > 	writel(DEVICE_BAR);
> > 
> > we generally cannot guarantee in which order those accesses will hit the
> > devices even if we add every barrier under the sun. You'd need something
> > in between, specific to DEVICE_FOO (probably a read-back) to really push
> > the first write out. This doesn't sound like it would be that uncommon to
> > me.
> 
> The PCI posted write does not require the above to execute 'in order'
> only that any bus segment shared by the two devices have the writes
> issued in CPU order. ie at a shared PCI root port for instance.
> 
> If I recall this is very similar to the ordering that ARM's on-chip
> AXI interconnect is supposed to provide.. So I'd be very surprised if
> a modern ARM64 has an meaningful difference from x86 here.

>From the architectural perspective, writes to different "peripherals" are
not ordered with respect to each other. The first writel will complete once
it gets its write acknowledgement, but this may not necessarily come from
the endpoint -- it could come from an intermediate buffer past the point of
serialisation (i.e. the write will then be ordered with respect to other
accesses to that same endpoint). The PCI root port would look like one
peripheral here.

> When talking about ordering between the devices, the relevant question
> is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> from the DEVICE_FOO. 'ordered' means that in this case
> writel(DEVICE_FOO) must be presented to FOO before anything generated
> by BAR.

Yes, and that isn't the case for arm because the writes can still be
buffered.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-29  9:19                                                                     ` Will Deacon
  0 siblings, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-29  9:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Benjamin Herrenschmidt, Arnd Bergmann, Sinan Kaya, David Laight,
	Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote:
> On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > > > they should have the same semantics as writel.
> > > > 
> > > > One thing that ARM does IIRC is that it only guarantees to order writel() within
> > > > one device, and the memory mapped PCI I/O space window almost certainly
> > > > counts as a separate device to the CPU.
> > > 
> > > That sounds bogus.
> > 
> > To elaborate, if you do the following on arm:
> > 
> > 	writel(DEVICE_FOO);
> > 	writel(DEVICE_BAR);
> > 
> > we generally cannot guarantee in which order those accesses will hit the
> > devices even if we add every barrier under the sun. You'd need something
> > in between, specific to DEVICE_FOO (probably a read-back) to really push
> > the first write out. This doesn't sound like it would be that uncommon to
> > me.
> 
> The PCI posted write does not require the above to execute 'in order'
> only that any bus segment shared by the two devices have the writes
> issued in CPU order. ie at a shared PCI root port for instance.
> 
> If I recall this is very similar to the ordering that ARM's on-chip
> AXI interconnect is supposed to provide.. So I'd be very surprised if
> a modern ARM64 has an meaningful difference from x86 here.

>From the architectural perspective, writes to different "peripherals" are
not ordered with respect to each other. The first writel will complete once
it gets its write acknowledgement, but this may not necessarily come from
the endpoint -- it could come from an intermediate buffer past the point of
serialisation (i.e. the write will then be ordered with respect to other
accesses to that same endpoint). The PCI root port would look like one
peripheral here.

> When talking about ordering between the devices, the relevant question
> is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> from the DEVICE_FOO. 'ordered' means that in this case
> writel(DEVICE_FOO) must be presented to FOO before anything generated
> by BAR.

Yes, and that isn't the case for arm because the writes can still be
buffered.

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28 21:31                                                         ` Benjamin Herrenschmidt
  2018-03-28 22:09                                                           ` Nicholas Piggin
@ 2018-03-29  9:20                                                           ` Will Deacon
  1 sibling, 0 replies; 216+ messages in thread
From: Will Deacon @ 2018-03-29  9:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Nicholas Piggin, David Miller, paulmck, arnd, linux-rdma,
	linuxppc-dev, linus971, alexander.duyck, okaya, jgg,
	David.Laight, oohall, netdev, alexander.h.duyck, torvalds

On Thu, Mar 29, 2018 at 08:31:32AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote:
> >   This is a variation on the mandatory write barrier that causes writes to weakly
> >   ordered I/O regions to be partially ordered.  Its effects may go beyond the
> >   CPU->Hardware interface and actually affect the hardware at some level.
> > 
> > How can a driver writer possibly get that right?
> > 
> > IIRC it was added for some big ia64 system that was really expensive
> > to implement the proper wmb() semantics on. So wmb() semantics were
> > quietly downgraded, then the subsequently broken drivers they cared
> > about were fixed by adding the stronger mmiowb().
> > 
> > What should have happened was wmb and writel remained correct, sane, and
> > expensive, and they add an mmio_wmb() to order MMIO stores made by the
> > writel_relaxed accessors, then use that to speed up the few drivers they
> > care about.
> > 
> > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just
> > make wmb ordering talk about stores to the device, not to some
> > intermediate stage of the interconnect where it can be subsequently
> > reordered wrt the device? Drivers can be converted back to using wmb
> > or writel gradually.
> 
> I was under the impression that mmiowb was specifically about ordering
> writel's with a subsequent spin_unlock, without it, MMIOs from
> different CPUs (within the same lock) would still arrive OO.
> 
> If that's indeed the case, I would suggest ia64 switches to a similar
> per-cpu flag trick powerpc uses.

... or we could remove ia64.

/me runs for cover

Will

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-28 15:55                                                     ` David Miller
  2018-03-28 16:23                                                       ` Nicholas Piggin
@ 2018-03-29 13:56                                                       ` Sinan Kaya
  2018-03-29 14:04                                                         ` David Miller
                                                                           ` (2 more replies)
  1 sibling, 3 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-29 13:56 UTC (permalink / raw)
  To: David Miller, benh
  Cc: torvalds, alexander.duyck, will.deacon, arnd, jgg, David.Laight,
	oohall, linuxppc-dev, linux-rdma, alexander.h.duyck, paulmck,
	netdev, linus971

On 3/28/2018 11:55 AM, David Miller wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Thu, 29 Mar 2018 02:13:16 +1100
> 
>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>> the archs are unused or dead anyway.
> 
> Agreed.
> 

I pinged most of the maintainers yesterday.
Which arches do we care about these days?
I have not been paying attention any other architecture besides arm64.

arch		status			detail
------		-------------		------------------------------------
alpha		question sent
arc		question sent		ysato@users.sourceforge.jp will fix it.
arm		no issues
arm64		no issues
blackfin	question sent		about to be removed
c6x		question sent
cris		question sent
frv
h8300		question sent
hexagon		question sent
ia64		no issues		confirmed by Tony Luck
m32r
m68k		question sent
metag
microblaze	question sent
mips		question sent
mn10300		question sent
nios2		question sent
openrisc	no issues		shorne@gmail.com says should no issues
parisc		no issues		grantgrundler@gmail.com says most probably no problem but still looking
powerpc		no issues
riscv		question sent
s390		question sent
score		question sent
sh		question sent
sparc		question sent
tile		question sent
unicore32	question sent
x86		no issues
xtensa		question sent


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-29 13:56                                                       ` Sinan Kaya
@ 2018-03-29 14:04                                                         ` David Miller
  2018-03-29 16:29                                                         ` Arnd Bergmann
  2018-03-30  1:40                                                         ` Benjamin Herrenschmidt
  2 siblings, 0 replies; 216+ messages in thread
From: David Miller @ 2018-03-29 14:04 UTC (permalink / raw)
  To: okaya
  Cc: benh, torvalds, alexander.duyck, will.deacon, arnd, jgg,
	David.Laight, oohall, linuxppc-dev, linux-rdma,
	alexander.h.duyck, paulmck, netdev, linus971

From: Sinan Kaya <okaya@codeaurora.org>
Date: Thu, 29 Mar 2018 09:56:01 -0400

> sparc		question sent

Sparc never lets physical memory accesses pass MMIO, and vice versa.

They are always strongly ordered amongst eachother.

Therefore no explicit barrier instructions are necessary.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-29  9:19                                                                     ` Will Deacon
@ 2018-03-29 14:45                                                                       ` Jason Gunthorpe
  -1 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-29 14:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya,
	Peter Zijlstra, David Laight, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	Ingo Molnar

On Thu, Mar 29, 2018 at 10:19:41AM +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote:
> > On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> > > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > > > > they should have the same semantics as writel.
> > > > > 
> > > > > One thing that ARM does IIRC is that it only guarantees to order writel() within
> > > > > one device, and the memory mapped PCI I/O space window almost certainly
> > > > > counts as a separate device to the CPU.
> > > > 
> > > > That sounds bogus.
> > > 
> > > To elaborate, if you do the following on arm:
> > > 
> > > 	writel(DEVICE_FOO);
> > > 	writel(DEVICE_BAR);
> > > 
> > > we generally cannot guarantee in which order those accesses will hit the
> > > devices even if we add every barrier under the sun. You'd need something
> > > in between, specific to DEVICE_FOO (probably a read-back) to really push
> > > the first write out. This doesn't sound like it would be that uncommon to
> > > me.
> > 
> > The PCI posted write does not require the above to execute 'in order'
> > only that any bus segment shared by the two devices have the writes
> > issued in CPU order. ie at a shared PCI root port for instance.
> > 
> > If I recall this is very similar to the ordering that ARM's on-chip
> > AXI interconnect is supposed to provide.. So I'd be very surprised if
> > a modern ARM64 has an meaningful difference from x86 here.
> 
> From the architectural perspective, writes to different "peripherals" are
> not ordered with respect to each other. The first writel will complete once
> it gets its write acknowledgement, but this may not necessarily come from
> the endpoint -- it could come from an intermediate buffer past the point of
> serialisation (i.e. the write will then be ordered with respect to other
> accesses to that same endpoint). The PCI root port would look like one
> peripheral here.

That is basically the same as PCI - PCI has no write ACK, so all
writes are buffered by the PCI interconnect and complete in some
undefined temporal order when multiple end points are involved.

This does not seem very different from what happens in x86..

> > When talking about ordering between the devices, the relevant question
> > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > from the DEVICE_FOO. 'ordered' means that in this case
> > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > by BAR.
> 
> Yes, and that isn't the case for arm because the writes can still be
> buffered.

The statement is not about buffering, or temporal completion order, or
the order of acks returning to the CPU. It is about pure transaction
ordering inside the interconnect.

Can write BAR -> FOO pass write CPU -> FOO?

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-29 14:45                                                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-29 14:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, Arnd Bergmann, Sinan Kaya, David Laight,
	Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Thu, Mar 29, 2018 at 10:19:41AM +0100, Will Deacon wrote:
> On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote:
> > On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote:
> > > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote:
> > > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote:
> > > > > > powerpc and ARM can't quite make them synchronous I think, but at least
> > > > > > they should have the same semantics as writel.
> > > > > 
> > > > > One thing that ARM does IIRC is that it only guarantees to order writel() within
> > > > > one device, and the memory mapped PCI I/O space window almost certainly
> > > > > counts as a separate device to the CPU.
> > > > 
> > > > That sounds bogus.
> > > 
> > > To elaborate, if you do the following on arm:
> > > 
> > > 	writel(DEVICE_FOO);
> > > 	writel(DEVICE_BAR);
> > > 
> > > we generally cannot guarantee in which order those accesses will hit the
> > > devices even if we add every barrier under the sun. You'd need something
> > > in between, specific to DEVICE_FOO (probably a read-back) to really push
> > > the first write out. This doesn't sound like it would be that uncommon to
> > > me.
> > 
> > The PCI posted write does not require the above to execute 'in order'
> > only that any bus segment shared by the two devices have the writes
> > issued in CPU order. ie at a shared PCI root port for instance.
> > 
> > If I recall this is very similar to the ordering that ARM's on-chip
> > AXI interconnect is supposed to provide.. So I'd be very surprised if
> > a modern ARM64 has an meaningful difference from x86 here.
> 
> From the architectural perspective, writes to different "peripherals" are
> not ordered with respect to each other. The first writel will complete once
> it gets its write acknowledgement, but this may not necessarily come from
> the endpoint -- it could come from an intermediate buffer past the point of
> serialisation (i.e. the write will then be ordered with respect to other
> accesses to that same endpoint). The PCI root port would look like one
> peripheral here.

That is basically the same as PCI - PCI has no write ACK, so all
writes are buffered by the PCI interconnect and complete in some
undefined temporal order when multiple end points are involved.

This does not seem very different from what happens in x86..

> > When talking about ordering between the devices, the relevant question
> > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > from the DEVICE_FOO. 'ordered' means that in this case
> > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > by BAR.
> 
> Yes, and that isn't the case for arm because the writes can still be
> buffered.

The statement is not about buffering, or temporal completion order, or
the order of acks returning to the CPU. It is about pure transaction
ordering inside the interconnect.

Can write BAR -> FOO pass write CPU -> FOO?

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
  2018-03-29 14:45                                                                       ` Jason Gunthorpe
@ 2018-03-29 14:58                                                                         ` David Laight
  -1 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-29 14:58 UTC (permalink / raw)
  To: 'Jason Gunthorpe', Will Deacon
  Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya,
	Peter Zijlstra, Ingo Molnar, Oliver, Paul E. McKenney,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

From: Jason Gunthorpe
> Sent: 29 March 2018 15:45
...
> > > When talking about ordering between the devices, the relevant question
> > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > > from the DEVICE_FOO. 'ordered' means that in this case
> > > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > > by BAR.
> >
> > Yes, and that isn't the case for arm because the writes can still be
> > buffered.
> 
> The statement is not about buffering, or temporal completion order, or
> the order of acks returning to the CPU. It is about pure transaction
> ordering inside the interconnect.
> 
> Can write BAR -> FOO pass write CPU -> FOO?

Almost certainly.
The first cpu write can almost certainly be 'stalled' at the shared PCIe bridge.
The second cpu write then completes (to a different target).
That target then issues a peer to peer transfer that reaches the shared bridge.
I doubt the order of the transactions is guaranteed when it becomes 'un-stalled'.

Of course, these are peer to peer transfers, and strange ones at that.
Normally you'd not be doing peer to peer transfers that access 'memory'
the cpu has just written to.

Requiring extra barriers in this case, or different functions for WC accesses
shouldn't really be an issue.

Even requiring a barrier between a write to dma coherent memory and a write
that starts dma isn't really onerous.
Even if it is a nop on all current architectures it is a good comment in the code.
It could even have a 'dev' argument.

	David

^ permalink raw reply	[flat|nested] 216+ messages in thread

* RE: RFC on writel and writel_relaxed
@ 2018-03-29 14:58                                                                         ` David Laight
  0 siblings, 0 replies; 216+ messages in thread
From: David Laight @ 2018-03-29 14:58 UTC (permalink / raw)
  To: 'Jason Gunthorpe', Will Deacon
  Cc: Benjamin Herrenschmidt, Arnd Bergmann, Sinan Kaya, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

From: Jason Gunthorpe
> Sent: 29 March 2018 15:45
...
> > > When talking about ordering between the devices, the relevant questio=
n
> > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > > from the DEVICE_FOO. 'ordered' means that in this case
> > > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > > by BAR.
> >
> > Yes, and that isn't the case for arm because the writes can still be
> > buffered.
>=20
> The statement is not about buffering, or temporal completion order, or
> the order of acks returning to the CPU. It is about pure transaction
> ordering inside the interconnect.
>=20
> Can write BAR -> FOO pass write CPU -> FOO?

Almost certainly.
The first cpu write can almost certainly be 'stalled' at the shared PCIe br=
idge.
The second cpu write then completes (to a different target).
That target then issues a peer to peer transfer that reaches the shared bri=
dge.
I doubt the order of the transactions is guaranteed when it becomes 'un-sta=
lled'.

Of course, these are peer to peer transfers, and strange ones at that.
Normally you'd not be doing peer to peer transfers that access 'memory'
the cpu has just written to.

Requiring extra barriers in this case, or different functions for WC access=
es
shouldn't really be an issue.

Even requiring a barrier between a write to dma coherent memory and a write
that starts dma isn't really onerous.
Even if it is a nop on all current architectures it is a good comment in th=
e code.
It could even have a 'dev' argument.

	David

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-29 13:56                                                       ` Sinan Kaya
  2018-03-29 14:04                                                         ` David Miller
@ 2018-03-29 16:29                                                         ` Arnd Bergmann
  2018-03-29 16:59                                                           ` Sinan Kaya
  2018-03-30  1:40                                                         ` Benjamin Herrenschmidt
  2 siblings, 1 reply; 216+ messages in thread
From: Arnd Bergmann @ 2018-03-29 16:29 UTC (permalink / raw)
  To: Sinan Kaya
  Cc: David Miller, Benjamin Herrenschmidt, Linus Torvalds,
	Alexander Duyck, Will Deacon, Jason Gunthorpe, David Laight,
	Oliver O'Halloran, linuxppc-dev, linux-rdma, Alexander Duyck,
	Paul E. McKenney, Networking, Linus Torvalds

On Thu, Mar 29, 2018 at 3:56 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
> On 3/28/2018 11:55 AM, David Miller wrote:
>> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>
>>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>>> the archs are unused or dead anyway.
>>
>> Agreed.
>>
>
> I pinged most of the maintainers yesterday.
> Which arches do we care about these days?
> I have not been paying attention any other architecture besides arm64.
>
> arch            status                  detail
> ------          -------------           ------------------------------------
> alpha           question sent

I'm guessing alpha has problems

extern inline u32 readl(const volatile void __iomem *addr)
{
        u32 ret = __raw_readl(addr);
        mb();
        return ret;
}
extern inline void writel(u32 b, volatile void __iomem *addr)
{
        __raw_writel(b, addr);
        mb();
}

There is a barrier in writel /after/ the acess but not before.

> arc             question sent           ysato@users.sourceforge.jp will fix it.
> arm             no issues
> arm64           no issues
> blackfin        question sent           about to be removed
> c6x             question sent

no PCI, so it might not matter that much -- all drivers
are platform specific in the end.

> cris            question sent
> frv

cris and frv are getting removed

> h8300           question sent

no PCI

> hexagon         question sent

no PCI

> ia64            no issues               confirmed by Tony Luck
> m32r

removed

> m68k            question sent

> metag

removed

> microblaze      question sent
> mips            question sent

I'm guessing that some mips platforms have problems, but others don't.

> mn10300         question sent

removed

> nios2           question sent

no PCI

> openrisc        no issues               shorne@gmail.com says should no issues
> parisc          no issues               grantgrundler@gmail.com says most probably no problem but still looking
> powerpc         no issues
> riscv           question sent

riscv should be fine

> s390            question sent

Pretty sure this is also fine

> score           question sent

removed

> sh              question sent
> sparc           question sent

> tile            question sent

removed

> unicore32       question sent

Note the maintainer's new email address in linux-next.

> x86             no issues

> xtensa          question sent

removed.

         Arnd

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-29 14:58                                                                         ` David Laight
@ 2018-03-29 16:40                                                                           ` Jason Gunthorpe
  -1 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-29 16:40 UTC (permalink / raw)
  To: David Laight
  Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Will Deacon,
	Sinan Kaya, Peter Zijlstra, Ingo Molnar, Oliver,
	Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)

On Thu, Mar 29, 2018 at 02:58:34PM +0000, David Laight wrote:
> From: Jason Gunthorpe
> > Sent: 29 March 2018 15:45
> ...
> > > > When talking about ordering between the devices, the relevant question
> > > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > > > from the DEVICE_FOO. 'ordered' means that in this case
> > > > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > > > by BAR.
> > >
> > > Yes, and that isn't the case for arm because the writes can still be
> > > buffered.
> > 
> > The statement is not about buffering, or temporal completion order, or
> > the order of acks returning to the CPU. It is about pure transaction
> > ordering inside the interconnect.
> > 
> > Can write BAR -> FOO pass write CPU -> FOO?
> 
> Almost certainly.
> The first cpu write can almost certainly be 'stalled' at the shared PCIe bridge.
> The second cpu write then completes (to a different target).
> That target then issues a peer to peer transfer that reaches the shared bridge.
> I doubt the order of the transactions is guaranteed when it becomes 'un-stalled'.

The PCI spec has very strong wording on ordering that covers this
case. Stalled bridges have to follow the ordering rules, and posted
writes cannot pass other posted writes.

Since in PCI all three transactions:
 CPU -> FOO
 CPU -> BAR
 BAR -> FOO

Must traverse a shared bus segment, they must be placed on that bus in
the above order, and the bridge(s) toward FOO must preserve this
order.

ARM's AXI has similar rules, I just can't recall the tiny details
right now :)

> Of course, these are peer to peer transfers, and strange ones at that.
> Normally you'd not be doing peer to peer transfers that access 'memory'
> the cpu has just written to.

It is the best situation I can think of where order of completion to
different devices would matter to a generic Linux driver..

.. And there are patches circulating right now for NVMe that enable
exactly this kind of transfer, and rely on these kind of semantics, so
it is a relevant detail :)

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-29 16:40                                                                           ` Jason Gunthorpe
  0 siblings, 0 replies; 216+ messages in thread
From: Jason Gunthorpe @ 2018-03-29 16:40 UTC (permalink / raw)
  To: David Laight
  Cc: Will Deacon, Benjamin Herrenschmidt, Arnd Bergmann, Sinan Kaya,
	Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar,
	Jonathan Corbet

On Thu, Mar 29, 2018 at 02:58:34PM +0000, David Laight wrote:
> From: Jason Gunthorpe
> > Sent: 29 March 2018 15:45
> ...
> > > > When talking about ordering between the devices, the relevant question
> > > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA
> > > > from the DEVICE_FOO. 'ordered' means that in this case
> > > > writel(DEVICE_FOO) must be presented to FOO before anything generated
> > > > by BAR.
> > >
> > > Yes, and that isn't the case for arm because the writes can still be
> > > buffered.
> > 
> > The statement is not about buffering, or temporal completion order, or
> > the order of acks returning to the CPU. It is about pure transaction
> > ordering inside the interconnect.
> > 
> > Can write BAR -> FOO pass write CPU -> FOO?
> 
> Almost certainly.
> The first cpu write can almost certainly be 'stalled' at the shared PCIe bridge.
> The second cpu write then completes (to a different target).
> That target then issues a peer to peer transfer that reaches the shared bridge.
> I doubt the order of the transactions is guaranteed when it becomes 'un-stalled'.

The PCI spec has very strong wording on ordering that covers this
case. Stalled bridges have to follow the ordering rules, and posted
writes cannot pass other posted writes.

Since in PCI all three transactions:
 CPU -> FOO
 CPU -> BAR
 BAR -> FOO

Must traverse a shared bus segment, they must be placed on that bus in
the above order, and the bridge(s) toward FOO must preserve this
order.

ARM's AXI has similar rules, I just can't recall the tiny details
right now :)

> Of course, these are peer to peer transfers, and strange ones at that.
> Normally you'd not be doing peer to peer transfers that access 'memory'
> the cpu has just written to.

It is the best situation I can think of where order of completion to
different devices would matter to a generic Linux driver..

.. And there are patches circulating right now for NVMe that enable
exactly this kind of transfer, and rely on these kind of semantics, so
it is a relevant detail :)

Jason

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-29 16:29                                                         ` Arnd Bergmann
@ 2018-03-29 16:59                                                           ` Sinan Kaya
  0 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-29 16:59 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: David Miller, Benjamin Herrenschmidt, Linus Torvalds,
	Alexander Duyck, Will Deacon, Jason Gunthorpe, David Laight,
	Oliver O'Halloran, linuxppc-dev, linux-rdma, Alexander Duyck,
	Paul E. McKenney, Networking, Linus Torvalds

On 3/29/2018 12:29 PM, Arnd Bergmann wrote:
> On Thu, Mar 29, 2018 at 3:56 PM, Sinan Kaya <okaya@codeaurora.org> wrote:
>> On 3/28/2018 11:55 AM, David Miller wrote:
>>> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>>
>>>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>>>> the archs are unused or dead anyway.
>>>
>>> Agreed.
>>>
>>
>> I pinged most of the maintainers yesterday.
>> Which arches do we care about these days?
>> I have not been paying attention any other architecture besides arm64.
>>
>> arch            status                  detail
>> ------          -------------           ------------------------------------
>> alpha           question sent

Thanks for the detailed analysis.

> 
> I'm guessing alpha has problems
> 
> extern inline u32 readl(const volatile void __iomem *addr)
> {
>         u32 ret = __raw_readl(addr);
>         mb();
>         return ret;
> }
> extern inline void writel(u32 b, volatile void __iomem *addr)
> {
>         __raw_writel(b, addr);
>         mb();
> }

Looks like a problem to me too. I'll start a thread with the alpha
people and CC you.


> 
> There is a barrier in writel /after/ the acess but not before.
> 

This is the consolidated list. I also heart back from m68k and corrected
contacts for arc and h8300.

arch            status                  detail
------          -------------           ------------------------------------
alpha		question sent		Arnd: alpha has problems
arc		question sent		Vineet.Gupta1@synopsys.com says he'll get to this
					in the next few days
arm		no issues
arm64		no issues
c6x		no issues		no PCI
h8300		no issues		no PCI: ysato@users.sourceforge.jp will fix it.
hexagon		no issues		no PCI
ia64		no issues		confirmed by Tony Luck
m68k		no issues		geert@linux-m68k.org says no problem
metag		no issues		arnd: removed
microblaze	question sent		arnd: some mips platforms have problems
mips		question sent		arnd: some mips platforms have problems
nds32		question sent
nios2		no issues		no PCI
openrisc	no issues		shorne@gmail.com says should no issues
parisc		no issues		grantgrundler@gmail.com says most probably no problem
					but still looking
powerpc		no issues
riscv		no issues		arnd: riscv should be fine
s390		no issues		arnd: Pretty sure this is also fine
sh		question sent
sparc		no issues		davem@davemloft.net says always strongly ordered
unicore32	question sent		resent to gxt@pku.edu.cn
x86		no issues
x86_64		no issues



> 
>          Arnd
> 


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-29 13:56                                                       ` Sinan Kaya
  2018-03-29 14:04                                                         ` David Miller
  2018-03-29 16:29                                                         ` Arnd Bergmann
@ 2018-03-30  1:40                                                         ` Benjamin Herrenschmidt
  2018-04-02 13:01                                                           ` Sinan Kaya
  2 siblings, 1 reply; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-30  1:40 UTC (permalink / raw)
  To: Sinan Kaya, David Miller
  Cc: torvalds, alexander.duyck, will.deacon, arnd, jgg, David.Laight,
	oohall, linuxppc-dev, linux-rdma, alexander.h.duyck, paulmck,
	netdev, linus971

On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote:
> On 3/28/2018 11:55 AM, David Miller wrote:
> > From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Date: Thu, 29 Mar 2018 02:13:16 +1100
> > 
> > > Let's fix all archs, it's way easier than fixing all drivers. Half of
> > > the archs are unused or dead anyway.
> > 
> > Agreed.
> > 
> 
> I pinged most of the maintainers yesterday.
> Which arches do we care about these days?
> I have not been paying attention any other architecture besides arm64.

Thanks for going through that exercise !

Once sparc, s390, microblaze and mips reply, I think we'll have a good
coverage, maybe riscv is to put in that lot too.
 
Cheers,
Ben.

> 
> arch		status			detail
> ------		-------------		------------------------------------
> alpha		question sent
> arc		question sent		ysato@users.sourceforge.jp will fix it.
> arm		no issues
> arm64		no issues
> blackfin	question sent		about to be removed
> c6x		question sent
> cris		question sent
> frv
> h8300		question sent
> hexagon		question sent
> ia64		no issues		confirmed by Tony Luck
> m32r
> m68k		question sent
> metag
> microblaze	question sent
> mips		question sent
> mn10300		question sent
> nios2		question sent
> openrisc	no issues		shorne@gmail.com says should no issues
> parisc		no issues		grantgrundler@gmail.com says most probably no problem but still looking
> powerpc		no issues
> riscv		question sent
> s390		question sent
> score		question sent
> sh		question sent
> sparc		question sent
> tile		question sent
> unicore32	question sent
> x86		no issues
> xtensa		question sent
> 
> 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-30  1:40                                                         ` Benjamin Herrenschmidt
@ 2018-04-02 13:01                                                           ` Sinan Kaya
  0 siblings, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-04-02 13:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Miller
  Cc: torvalds, alexander.duyck, will.deacon, arnd, jgg, David.Laight,
	oohall, linuxppc-dev, linux-rdma, alexander.h.duyck, paulmck,
	netdev, linus971

On 3/29/2018 9:40 PM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote:
>> On 3/28/2018 11:55 AM, David Miller wrote:
>>> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>> Date: Thu, 29 Mar 2018 02:13:16 +1100
>>>
>>>> Let's fix all archs, it's way easier than fixing all drivers. Half of
>>>> the archs are unused or dead anyway.
>>>
>>> Agreed.
>>>
>>
>> I pinged most of the maintainers yesterday.
>> Which arches do we care about these days?
>> I have not been paying attention any other architecture besides arm64.
> 
> Thanks for going through that exercise !
> 
> Once sparc, s390, microblaze and mips reply, I think we'll have a good
> coverage, maybe riscv is to put in that lot too.


I posted the following two patches for supporting microblaze and unicore32. 

[PATCH v2 1/2] io: prevent compiler reordering on the default writeX() implementation
[PATCH v2 2/2] io: prevent compiler reordering on the default readX() implementation

The rest of the arches except mips and alpha seem OK.
I sent a question email on Friday to mips and alpha mailing lists. I'll follow up with
an actual patch today.


-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 21:54 Alexander Duyck
  2018-03-27 22:35 ` Sinan Kaya
@ 2018-03-27 23:43 ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 216+ messages in thread
From: Benjamin Herrenschmidt @ 2018-03-27 23:43 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Will Deacon, Paul E. McKenney, netdev

On Tue, 2018-03-27 at 14:54 -0700, Alexander Duyck wrote:
> On Tue, Mar 27, 2018 at 2:35 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> > On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote:
> > >  combined buffers.
> > > 
> > > Alex:
> > > "Don't bother. I can tell you right now that for x86 you have to have a
> > > wmb() before the writel().
> > 
> > No, this isn't the semantics of writel. You shouldn't need it unless
> > something changed and we need to revisit our complete understanding of
> > *all* MMIO accessor semantics.
> 
> The issue seems to be that there have been two different ways of
> dealing with this. There has historically been a number of different
> drivers that have been carrying this wmb() workaround since something
> like 2002. I get that the semantics for writel might have changed
> since then, but those of us who already have the wmb() in our drivers
> will be very wary of anyone wanting to go through and remove them
> since writel is supposed to be "good enough". I would much rather err
> on the side of caution here.
> 
> I view the wmb() + writel_relaxed() as more of a driver owning and
> handling this itself. Besides in the Intel Ethernet driver case it is
> better performance as our wmb() placement for us also provides a
> secondary barrier so we don't need to add a separate smp_wmb() to deal
> with a potential race we have with the Tx cleanup.
> 
> > At least for UC space, it has always been accepted (and enforced) that
> > writel would not require any other barrier to order vs. previous stores
> > to memory.
> 
> So the one thing I would question here is if this is UC vs UC or if
> this extends to other types as well? So for x86 we could find
> references to Write Combining being flushed by a write to UC memory,
> however I have yet to find a clear explanation of what a write to UC
> does to WB. 

Well, this is the standard write memory + trigger DMA case, the one
specific case for which Linus was adamant we don't need another barrier
 back then ...

> My personal inclination would be to err on the side of
> caution.

Which means writel_relaxed is now pointless ?

We need clear semantics here. In this case the "side of caution" means
we are randomly doing things not understanding what really happens and
that makes me *more* nervous.

>  I just don't want us going through and removing the wmb()
> calls because it "should" work. I would want to know for certain it
> will work.

We need to know for certain anyway. Otherwise, all the drivers that do
not have wmb's are potentially broken.

So I dont agree with the status quo.

We need to establish precisely what x86 does, decide what we want the
semantic of writel to be, and implement things accordingly.

Ben.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
  2018-03-27 21:54 Alexander Duyck
@ 2018-03-27 22:35 ` Sinan Kaya
  2018-03-27 23:43 ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 216+ messages in thread
From: Sinan Kaya @ 2018-03-27 22:35 UTC (permalink / raw)
  To: Alexander Duyck, Benjamin Herrenschmidt
  Cc: Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Will Deacon, Paul E. McKenney, netdev

On 3/27/2018 5:54 PM, Alexander Duyck wrote:
> I view the wmb() + writel_relaxed() as more of a driver owning and
> handling this itself. Besides in the Intel Ethernet driver case it is
> better performance as our wmb() placement for us also provides a
> secondary barrier so we don't need to add a separate smp_wmb() to deal
> with a potential race we have with the Tx cleanup.

Thanks for the reminder. 

I forgot about the double barrier optimization. wmb() + writel_relaxed()
seems to be the best option for Intel network drivers at this moment. 
Otherwise, we'll have to remove wmb() and throw in smp barriers there
like you mentioned.

I'll leave the changes in the Intel drivers alone.

-- 
Sinan Kaya
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc.
Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: RFC on writel and writel_relaxed
@ 2018-03-27 21:54 Alexander Duyck
  2018-03-27 22:35 ` Sinan Kaya
  2018-03-27 23:43 ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 216+ messages in thread
From: Alexander Duyck @ 2018-03-27 21:54 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT),
	linux-rdma, Will Deacon, Paul E. McKenney, netdev

On Tue, Mar 27, 2018 at 2:35 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote:
>>  combined buffers.
>>
>> Alex:
>> "Don't bother. I can tell you right now that for x86 you have to have a
>> wmb() before the writel().
>
> No, this isn't the semantics of writel. You shouldn't need it unless
> something changed and we need to revisit our complete understanding of
> *all* MMIO accessor semantics.

The issue seems to be that there have been two different ways of
dealing with this. There has historically been a number of different
drivers that have been carrying this wmb() workaround since something
like 2002. I get that the semantics for writel might have changed
since then, but those of us who already have the wmb() in our drivers
will be very wary of anyone wanting to go through and remove them
since writel is supposed to be "good enough". I would much rather err
on the side of caution here.

I view the wmb() + writel_relaxed() as more of a driver owning and
handling this itself. Besides in the Intel Ethernet driver case it is
better performance as our wmb() placement for us also provides a
secondary barrier so we don't need to add a separate smp_wmb() to deal
with a potential race we have with the Tx cleanup.

> At least for UC space, it has always been accepted (and enforced) that
> writel would not require any other barrier to order vs. previous stores
> to memory.

So the one thing I would question here is if this is UC vs UC or if
this extends to other types as well? So for x86 we could find
references to Write Combining being flushed by a write to UC memory,
however I have yet to find a clear explanation of what a write to UC
does to WB. My personal inclination would be to err on the side of
caution. I just don't want us going through and removing the wmb()
calls because it "should" work. I would want to know for certain it
will work.

- Alex

^ permalink raw reply	[flat|nested] 216+ messages in thread

end of thread, other threads:[~2018-04-02 13:01 UTC | newest]

Thread overview: 216+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-03-21  3:07 RFC on writel and writel_relaxed Sinan Kaya
2018-03-21  3:40 ` Oliver
2018-03-21  3:40   ` Oliver
2018-03-21 13:53   ` Sinan Kaya
2018-03-21 13:53     ` Sinan Kaya
2018-03-21 13:58     ` Sinan Kaya
2018-03-21 13:58       ` Sinan Kaya
2018-03-26 13:43       ` Arnd Bergmann
2018-03-26 13:43         ` Arnd Bergmann
2018-03-26 16:00         ` Sinan Kaya
2018-03-26 16:00           ` Sinan Kaya
2018-03-21 14:35     ` David Laight
2018-03-21 14:35       ` David Laight
2018-03-21 15:04       ` Sinan Kaya
2018-03-22  5:24       ` Oliver
2018-03-22  8:20         ` Gabriel Paubert
2018-03-22  8:20           ` Gabriel Paubert
2018-03-22  9:25           ` Oliver
2018-03-22  9:25             ` Oliver
2018-03-22 11:25             ` Gabriel Paubert
2018-03-22 11:25               ` Gabriel Paubert
2018-03-22 10:37         ` David Laight
2018-03-22 10:37           ` David Laight
2018-03-22  4:24     ` Benjamin Herrenschmidt
2018-03-22  4:24       ` Benjamin Herrenschmidt
2018-03-22 10:15       ` Oliver
2018-03-22 10:15         ` Oliver
2018-03-22 13:52         ` Benjamin Herrenschmidt
2018-03-22 13:52           ` Benjamin Herrenschmidt
2018-03-22 17:51           ` Sinan Kaya
2018-03-22 17:51             ` Sinan Kaya
2018-03-23  0:16             ` Benjamin Herrenschmidt
2018-03-23  0:16               ` Benjamin Herrenschmidt
2018-03-23 13:42               ` Sinan Kaya
2018-03-23 13:42                 ` Sinan Kaya
2018-03-24  1:22                 ` Benjamin Herrenschmidt
2018-03-24  1:22                   ` Benjamin Herrenschmidt
2018-03-24 15:06                   ` Sinan Kaya
2018-03-24 15:06                     ` Sinan Kaya
2018-03-26 11:44               ` Will Deacon
2018-03-26 11:44                 ` Will Deacon
2018-03-26 12:11                 ` okaya
2018-03-26 12:11                   ` okaya
2018-03-26 12:42                   ` Sinan Kaya
2018-03-26 12:42                     ` Sinan Kaya
2018-03-23 16:35           ` Jason Gunthorpe
2018-03-23 16:35             ` Jason Gunthorpe
2018-03-24  1:23             ` Benjamin Herrenschmidt
2018-03-24  1:23               ` Benjamin Herrenschmidt
2018-03-26 11:08               ` David Laight
2018-03-26 11:08                 ` David Laight
2018-03-26 16:54                 ` Jason Gunthorpe
2018-03-26 16:54                   ` Jason Gunthorpe
2018-03-26 19:44                   ` Arnd Bergmann
2018-03-26 19:44                     ` Arnd Bergmann
2018-03-26 20:25                     ` Jason Gunthorpe
2018-03-26 20:25                       ` Jason Gunthorpe
2018-03-26 20:43                       ` Arnd Bergmann
2018-03-26 20:43                         ` Arnd Bergmann
2018-03-26 21:09                         ` Jason Gunthorpe
2018-03-26 21:09                           ` Jason Gunthorpe
2018-03-26 21:30                           ` Arnd Bergmann
2018-03-26 21:30                             ` Arnd Bergmann
2018-03-26 21:46                             ` Sinan Kaya
2018-03-26 21:46                               ` Sinan Kaya
2018-03-26 22:01                               ` Benjamin Herrenschmidt
2018-03-26 22:01                                 ` Benjamin Herrenschmidt
2018-03-26 22:08                                 ` Sinan Kaya
2018-03-26 22:08                                   ` Sinan Kaya
2018-03-26 22:28                                   ` Benjamin Herrenschmidt
2018-03-26 22:28                                     ` Benjamin Herrenschmidt
2018-03-26 22:27                                 ` Jason Gunthorpe
2018-03-26 22:27                                   ` Jason Gunthorpe
2018-03-26 22:36                                   ` Benjamin Herrenschmidt
2018-03-26 22:36                                     ` Benjamin Herrenschmidt
2018-03-26 22:42                                     ` Benjamin Herrenschmidt
2018-03-26 22:42                                       ` Benjamin Herrenschmidt
2018-03-26 22:50                                     ` Jason Gunthorpe
2018-03-26 22:50                                       ` Jason Gunthorpe
2018-03-26 23:59                                       ` Benjamin Herrenschmidt
2018-03-26 23:59                                         ` Benjamin Herrenschmidt
2018-03-27  1:39                                         ` Jason Gunthorpe
2018-03-27  1:39                                           ` Jason Gunthorpe
2018-03-27  7:56                                   ` Arnd Bergmann
2018-03-27  7:56                                     ` Arnd Bergmann
2018-03-27  8:56                                     ` Benjamin Herrenschmidt
2018-03-27  8:56                                       ` Benjamin Herrenschmidt
2018-03-27  9:44                                       ` Arnd Bergmann
2018-03-27  9:44                                         ` Arnd Bergmann
2018-03-27 10:00                                         ` Will Deacon
2018-03-27 10:00                                           ` Will Deacon
2018-03-27 11:23                                         ` Benjamin Herrenschmidt
2018-03-27 11:23                                           ` Benjamin Herrenschmidt
2018-03-27 12:22                                           ` okaya
2018-03-27 12:22                                             ` okaya
2018-03-27 14:12                                             ` Jason Gunthorpe
2018-03-27 14:12                                               ` Jason Gunthorpe
2018-03-27 21:27                                               ` Benjamin Herrenschmidt
2018-03-27 21:27                                                 ` Benjamin Herrenschmidt
2018-03-27  9:57                                       ` Will Deacon
2018-03-27  9:57                                         ` Will Deacon
2018-03-27 10:05                                         ` Arnd Bergmann
2018-03-27 10:05                                           ` Arnd Bergmann
2018-03-27 10:09                                           ` Will Deacon
2018-03-27 10:09                                             ` Will Deacon
2018-03-27 10:53                                             ` Arnd Bergmann
2018-03-27 10:53                                               ` Arnd Bergmann
2018-03-27 11:02                                               ` Will Deacon
2018-03-27 11:02                                                 ` Will Deacon
2018-03-27 11:05                                                 ` Arnd Bergmann
2018-03-27 11:05                                                   ` Arnd Bergmann
2018-03-27 11:25                                                 ` Benjamin Herrenschmidt
2018-03-27 11:25                                                   ` Benjamin Herrenschmidt
2018-03-27 13:20                                                 ` David Laight
2018-03-27 13:20                                                   ` David Laight
2018-03-27 13:46                                                 ` Sinan Kaya
2018-03-27 13:46                                                   ` Sinan Kaya
2018-03-27 14:36                                                   ` Will Deacon
2018-03-27 14:36                                                     ` Will Deacon
2018-03-27 21:29                                                     ` Benjamin Herrenschmidt
2018-03-27 21:29                                                       ` Benjamin Herrenschmidt
2018-03-28  8:53                                                       ` Will Deacon
2018-03-28  8:53                                                         ` Will Deacon
2018-03-28  9:00                                                         ` David Laight
2018-03-28  9:00                                                           ` David Laight
2018-03-28  9:09                                                           ` Will Deacon
2018-03-28  9:09                                                             ` Will Deacon
2018-03-28  9:56                                                             ` Benjamin Herrenschmidt
2018-03-28  9:56                                                               ` Benjamin Herrenschmidt
2018-03-28  9:50                                                         ` Benjamin Herrenschmidt
2018-03-28  9:50                                                           ` Benjamin Herrenschmidt
2018-03-28  9:55                                                           ` Arnd Bergmann
2018-03-28  9:55                                                             ` Arnd Bergmann
2018-03-28 10:01                                                             ` Benjamin Herrenschmidt
2018-03-28 10:01                                                               ` Benjamin Herrenschmidt
2018-03-28 10:13                                                               ` Will Deacon
2018-03-28 10:13                                                                 ` Will Deacon
2018-03-28 16:57                                                                 ` Jason Gunthorpe
2018-03-28 16:57                                                                   ` Jason Gunthorpe
2018-03-29  9:19                                                                   ` Will Deacon
2018-03-29  9:19                                                                     ` Will Deacon
2018-03-29 14:45                                                                     ` Jason Gunthorpe
2018-03-29 14:45                                                                       ` Jason Gunthorpe
2018-03-29 14:58                                                                       ` David Laight
2018-03-29 14:58                                                                         ` David Laight
2018-03-29 16:40                                                                         ` Jason Gunthorpe
2018-03-29 16:40                                                                           ` Jason Gunthorpe
2018-03-27 21:24                                                   ` Benjamin Herrenschmidt
2018-03-27 21:24                                                     ` Benjamin Herrenschmidt
2018-03-27 11:21                                         ` Benjamin Herrenschmidt
2018-03-27 11:21                                           ` Benjamin Herrenschmidt
2018-03-27  9:42                                     ` Will Deacon
2018-03-27  9:42                                       ` Will Deacon
2018-03-27 11:20                                       ` Benjamin Herrenschmidt
2018-03-27 11:20                                         ` Benjamin Herrenschmidt
2018-03-27 11:24                                         ` Will Deacon
2018-03-27 11:24                                           ` Will Deacon
2018-03-27 14:24                                       ` Jason Gunthorpe
2018-03-27 14:24                                         ` Jason Gunthorpe
2018-03-27 14:16                                     ` Jason Gunthorpe
2018-03-27 14:16                                       ` Jason Gunthorpe
2018-03-26 22:00                             ` Benjamin Herrenschmidt
2018-03-26 22:00                               ` Benjamin Herrenschmidt
2018-03-27 14:46                               ` Sinan Kaya
2018-03-27 15:01                                 ` Jose Abreu
2018-03-27 15:10                                 ` Will Deacon
2018-03-27 18:54                                   ` Alexander Duyck
2018-03-27 19:54                                     ` Arnd Bergmann
2018-03-27 19:54                                       ` Arnd Bergmann
2018-03-27 20:46                                       ` Arnd Bergmann
2018-03-27 20:46                                         ` Arnd Bergmann
2018-03-27 21:33                                     ` Benjamin Herrenschmidt
2018-03-28  0:39                                       ` Linus Torvalds
2018-03-28  1:03                                         ` Benjamin Herrenschmidt
2018-03-28  2:51                                           ` Linus Torvalds
2018-03-28  3:24                                             ` Sinan Kaya
2018-03-28  4:41                                               ` Benjamin Herrenschmidt
2018-03-28  6:14                                               ` Linus Torvalds
2018-03-28 11:41                                                 ` okaya
2018-03-28 15:13                                                   ` Benjamin Herrenschmidt
2018-03-28 15:55                                                     ` David Miller
2018-03-28 16:23                                                       ` Nicholas Piggin
2018-03-28 21:31                                                         ` Benjamin Herrenschmidt
2018-03-28 22:09                                                           ` Nicholas Piggin
2018-03-29  9:20                                                           ` Will Deacon
2018-03-29 13:56                                                       ` Sinan Kaya
2018-03-29 14:04                                                         ` David Miller
2018-03-29 16:29                                                         ` Arnd Bergmann
2018-03-29 16:59                                                           ` Sinan Kaya
2018-03-30  1:40                                                         ` Benjamin Herrenschmidt
2018-04-02 13:01                                                           ` Sinan Kaya
2018-03-28  4:33                                             ` Benjamin Herrenschmidt
2018-03-28  6:26                                               ` Linus Torvalds
2018-03-28  6:42                                                 ` Benjamin Herrenschmidt
2018-03-28  6:53                                                   ` Linus Torvalds
2018-03-28  6:53                                                     ` Linus Torvalds
2018-03-28  6:53                                                     ` Linus Torvalds
2018-03-28  6:56                                                     ` Benjamin Herrenschmidt
2018-03-28  7:11                                                       ` Arnd Bergmann
2018-03-28  7:42                                                         ` Benjamin Herrenschmidt
2018-03-28  9:07                                                   ` Will Deacon
2018-03-28  9:56                                                     ` Benjamin Herrenschmidt
2018-03-28 10:13                                                       ` Aw: " Lino Sanfilippo
2018-03-28 10:20                                                         ` Benjamin Herrenschmidt
2018-03-28 11:30                                                       ` David Laight
2018-03-28 11:30                                                         ` David Laight
2018-03-28 15:12                                                         ` Benjamin Herrenschmidt
2018-03-28 16:16                                                           ` David Laight
2018-03-28 16:16                                                             ` David Laight
2018-03-28  1:21                                   ` Benjamin Herrenschmidt
2018-03-27 21:35                                 ` Benjamin Herrenschmidt
2018-03-26 21:26                   ` Benjamin Herrenschmidt
2018-03-26 21:26                     ` Benjamin Herrenschmidt
2018-03-27 21:54 Alexander Duyck
2018-03-27 22:35 ` Sinan Kaya
2018-03-27 23:43 ` Benjamin Herrenschmidt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.