* RFC on writel and writel_relaxed @ 2018-03-21 3:07 Sinan Kaya 2018-03-21 3:40 ` Oliver 0 siblings, 1 reply; 216+ messages in thread From: Sinan Kaya @ 2018-03-21 3:07 UTC (permalink / raw) To: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma Hi PPC Maintainers, We are seeking feedback on the status of relaxed write API implementation. What is the motivation for not implementing the relaxed API? I see that network drivers are working around the issue by calling __raw_write() API directly but this also breaks other architectures like SPARC since the semantics of __raw_writel() seems to be system dependent. This is putting drivers into a tight position and they cannot achieve true multi-arch enablement and are forced into calling __raw APIs flavors directly with #ifdef BIG_ENDIAN ugliness. Sinan -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-21 3:07 RFC on writel and writel_relaxed Sinan Kaya @ 2018-03-21 3:40 ` Oliver 0 siblings, 0 replies; 216+ messages in thread From: Oliver @ 2018-03-21 3:40 UTC (permalink / raw) To: Sinan Kaya; +Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Wed, Mar 21, 2018 at 2:07 PM, Sinan Kaya <okaya@codeaurora.org> wrote: > Hi PPC Maintainers, > > We are seeking feedback on the status of relaxed write API implementation. > What is the motivation for not implementing the relaxed API? Hmm, good question. Looks like we've implemented the relaxed_* variants by aliasing them to the normal version since the dawn of time. There's a comment in io.h saying something about us not having the expected semantics for the relaxed variants, but I don't see what the issue is... Ben? > I see that network drivers are working around the issue by calling > __raw_write() API directly but this also breaks other architectures > like SPARC since the semantics of __raw_writel() seems to be system dependent. Yeah that's pretty gross. Which drivers are doing this? > This is putting drivers into a tight position and they cannot achieve true > multi-arch enablement and are forced into calling __raw APIs flavors > directly with #ifdef BIG_ENDIAN ugliness. > > Sinan > > -- > Sinan Kaya > Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. > Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-21 3:40 ` Oliver 0 siblings, 0 replies; 216+ messages in thread From: Oliver @ 2018-03-21 3:40 UTC (permalink / raw) To: Sinan Kaya Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Benjamin Herrenschmidt On Wed, Mar 21, 2018 at 2:07 PM, Sinan Kaya <okaya@codeaurora.org> wrote: > Hi PPC Maintainers, > > We are seeking feedback on the status of relaxed write API implementation. > What is the motivation for not implementing the relaxed API? Hmm, good question. Looks like we've implemented the relaxed_* variants by aliasing them to the normal version since the dawn of time. There's a comment in io.h saying something about us not having the expected semantics for the relaxed variants, but I don't see what the issue is... Ben? > I see that network drivers are working around the issue by calling > __raw_write() API directly but this also breaks other architectures > like SPARC since the semantics of __raw_writel() seems to be system dependent. Yeah that's pretty gross. Which drivers are doing this? > This is putting drivers into a tight position and they cannot achieve true > multi-arch enablement and are forced into calling __raw APIs flavors > directly with #ifdef BIG_ENDIAN ugliness. > > Sinan > > -- > Sinan Kaya > Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. > Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-21 3:40 ` Oliver @ 2018-03-21 13:53 ` Sinan Kaya -1 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-21 13:53 UTC (permalink / raw) To: Oliver; +Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On 3/20/2018 10:40 PM, Oliver wrote: >> I see that network drivers are working around the issue by calling >> __raw_write() API directly but this also breaks other architectures >> like SPARC since the semantics of __raw_writel() seems to be system dependent. > Yeah that's pretty gross. Which drivers are doing this? > Searching for __raw_writel() and BIG_ENDIAN in drivers/net directory should give you the idea. In a nutshell, drivers are doing this today. wmb() __raw_writel(); mmiowb() I'm in the process of posting patches ([1], [2], [3]) to various subsystems to eliminate double barriers by replacing sequences like wmb() writel() mmiowb() with wmb() writel_relaxed() mmiowb() Reviewers pointed out that writel_relaxed() is not __raw_writel() on PPC and cannot take advantage of the optimization. Replacing writel_relaxed() with raw_writel() or vice versa is not correct. writel_relaxed() needs to have ordering guarantees with respect to the order device observes writes. x86 has compiler barrier inside the relaxed() API so that code does not get reordered. ARM64 architecturally guarantees device writes to be observed in order. I was hoping that PPC could follow x86 and inject compiler barrier into the relaxed functions. BTW, I have no idea what compiler barrier does on PPC and if wrltel() == compiler barrier() + wrltel_relaxed() can be said. Need guidance from the PPC maintainers. [1] https://www.spinics.net/lists/netdev/msg490480.html [2] https://www.spinics.net/lists/arm-kernel/msg642341.html [3] https://www.spinics.net/lists/arm-kernel/msg642336.html -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-21 13:53 ` Sinan Kaya 0 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-21 13:53 UTC (permalink / raw) To: Oliver Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Benjamin Herrenschmidt On 3/20/2018 10:40 PM, Oliver wrote: >> I see that network drivers are working around the issue by calling >> __raw_write() API directly but this also breaks other architectures >> like SPARC since the semantics of __raw_writel() seems to be system dependent. > Yeah that's pretty gross. Which drivers are doing this? > Searching for __raw_writel() and BIG_ENDIAN in drivers/net directory should give you the idea. In a nutshell, drivers are doing this today. wmb() __raw_writel(); mmiowb() I'm in the process of posting patches ([1], [2], [3]) to various subsystems to eliminate double barriers by replacing sequences like wmb() writel() mmiowb() with wmb() writel_relaxed() mmiowb() Reviewers pointed out that writel_relaxed() is not __raw_writel() on PPC and cannot take advantage of the optimization. Replacing writel_relaxed() with raw_writel() or vice versa is not correct. writel_relaxed() needs to have ordering guarantees with respect to the order device observes writes. x86 has compiler barrier inside the relaxed() API so that code does not get reordered. ARM64 architecturally guarantees device writes to be observed in order. I was hoping that PPC could follow x86 and inject compiler barrier into the relaxed functions. BTW, I have no idea what compiler barrier does on PPC and if wrltel() == compiler barrier() + wrltel_relaxed() can be said. Need guidance from the PPC maintainers. [1] https://www.spinics.net/lists/netdev/msg490480.html [2] https://www.spinics.net/lists/arm-kernel/msg642341.html [3] https://www.spinics.net/lists/arm-kernel/msg642336.html -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-21 13:53 ` Sinan Kaya @ 2018-03-21 13:58 ` Sinan Kaya -1 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-21 13:58 UTC (permalink / raw) To: Oliver; +Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On 3/21/2018 8:53 AM, Sinan Kaya wrote: > BTW, I have no idea what compiler barrier does on PPC and if > > wrltel() == compiler barrier() + wrltel_relaxed() > > can be said. this should have been writel_relaxed() == compiler barrier() + __raw_writel() -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-21 13:58 ` Sinan Kaya 0 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-21 13:58 UTC (permalink / raw) To: Oliver Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Benjamin Herrenschmidt On 3/21/2018 8:53 AM, Sinan Kaya wrote: > BTW, I have no idea what compiler barrier does on PPC and if > > wrltel() == compiler barrier() + wrltel_relaxed() > > can be said. this should have been writel_relaxed() == compiler barrier() + __raw_writel() -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-21 13:58 ` Sinan Kaya @ 2018-03-26 13:43 ` Arnd Bergmann -1 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-26 13:43 UTC (permalink / raw) To: Sinan Kaya Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Oliver On Wed, Mar 21, 2018 at 2:58 PM, Sinan Kaya <okaya@codeaurora.org> wrote: > On 3/21/2018 8:53 AM, Sinan Kaya wrote: >> BTW, I have no idea what compiler barrier does on PPC and if >> >> wrltel() == compiler barrier() + wrltel_relaxed() >> >> can be said. > > this should have been > > writel_relaxed() == compiler barrier() + __raw_writel() I don't think anyone clarified this so far, but there are additional differences between the two, writel_relaxed() assumes we are talking to a 32-bit little-endian MMIO register, while __raw_writel() is primarily used for writing into memory-type regions with no particular byte order. This means: - writel_relaxed() must perform a byte swap when running on big-endian kernels - when used with __packed MMIO pointers, __raw_writel() may turn into a series of byte writes, while writel_relaxed() must result in a single 32-bit access. - A set if consecutive writel_relaxed() on the same device is issued in program order, while __raw_writel() is not ordered. This typically requires only a compiler barrier, but may also need a CPU barrier (in addition to the barriers we use to serialize with spinlocks and DMA in writel() but not writel_relaxed()). Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 13:43 ` Arnd Bergmann 0 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-26 13:43 UTC (permalink / raw) To: Sinan Kaya Cc: Oliver, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Wed, Mar 21, 2018 at 2:58 PM, Sinan Kaya <okaya@codeaurora.org> wrote: > On 3/21/2018 8:53 AM, Sinan Kaya wrote: >> BTW, I have no idea what compiler barrier does on PPC and if >> >> wrltel() == compiler barrier() + wrltel_relaxed() >> >> can be said. > > this should have been > > writel_relaxed() == compiler barrier() + __raw_writel() I don't think anyone clarified this so far, but there are additional differences between the two, writel_relaxed() assumes we are talking to a 32-bit little-endian MMIO register, while __raw_writel() is primarily used for writing into memory-type regions with no particular byte order. This means: - writel_relaxed() must perform a byte swap when running on big-endian kernels - when used with __packed MMIO pointers, __raw_writel() may turn into a series of byte writes, while writel_relaxed() must result in a single 32-bit access. - A set if consecutive writel_relaxed() on the same device is issued in program order, while __raw_writel() is not ordered. This typically requires only a compiler barrier, but may also need a CPU barrier (in addition to the barriers we use to serialize with spinlocks and DMA in writel() but not writel_relaxed()). Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 13:43 ` Arnd Bergmann @ 2018-03-26 16:00 ` Sinan Kaya -1 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-26 16:00 UTC (permalink / raw) To: Arnd Bergmann Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Oliver On 3/26/2018 9:43 AM, Arnd Bergmann wrote: > On Wed, Mar 21, 2018 at 2:58 PM, Sinan Kaya <okaya@codeaurora.org> wrote: >> On 3/21/2018 8:53 AM, Sinan Kaya wrote: >>> BTW, I have no idea what compiler barrier does on PPC and if >>> >>> wrltel() == compiler barrier() + wrltel_relaxed() >>> >>> can be said. >> >> this should have been >> >> writel_relaxed() == compiler barrier() + __raw_writel() > > I don't think anyone clarified this so far, but there are additional differences > between the two, writel_relaxed() assumes we are talking to a 32-bit > little-endian > MMIO register, while __raw_writel() is primarily used for writing into > memory-type > regions with no particular byte order. This means: > > - writel_relaxed() must perform a byte swap when running on big-endian kernels > - when used with __packed MMIO pointers, __raw_writel() may turn into a series > of byte writes, while writel_relaxed() must result in a single 32-bit access. > - A set if consecutive writel_relaxed() on the same device is issued in program > order, while __raw_writel() is not ordered. This typically requires > only a compiler > barrier, but may also need a CPU barrier (in addition to the > barriers we use to > serialize with spinlocks and DMA in writel() but not writel_relaxed()). Thanks for the great summary. I didn't know that __raw_writel() could get converted to byte writes. > > Arnd > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 16:00 ` Sinan Kaya 0 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-26 16:00 UTC (permalink / raw) To: Arnd Bergmann Cc: Oliver, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On 3/26/2018 9:43 AM, Arnd Bergmann wrote: > On Wed, Mar 21, 2018 at 2:58 PM, Sinan Kaya <okaya@codeaurora.org> wrote: >> On 3/21/2018 8:53 AM, Sinan Kaya wrote: >>> BTW, I have no idea what compiler barrier does on PPC and if >>> >>> wrltel() == compiler barrier() + wrltel_relaxed() >>> >>> can be said. >> >> this should have been >> >> writel_relaxed() == compiler barrier() + __raw_writel() > > I don't think anyone clarified this so far, but there are additional differences > between the two, writel_relaxed() assumes we are talking to a 32-bit > little-endian > MMIO register, while __raw_writel() is primarily used for writing into > memory-type > regions with no particular byte order. This means: > > - writel_relaxed() must perform a byte swap when running on big-endian kernels > - when used with __packed MMIO pointers, __raw_writel() may turn into a series > of byte writes, while writel_relaxed() must result in a single 32-bit access. > - A set if consecutive writel_relaxed() on the same device is issued in program > order, while __raw_writel() is not ordered. This typically requires > only a compiler > barrier, but may also need a CPU barrier (in addition to the > barriers we use to > serialize with spinlocks and DMA in writel() but not writel_relaxed()). Thanks for the great summary. I didn't know that __raw_writel() could get converted to byte writes. > > Arnd > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed 2018-03-21 13:53 ` Sinan Kaya @ 2018-03-21 14:35 ` David Laight -1 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-21 14:35 UTC (permalink / raw) To: 'Sinan Kaya', Oliver Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) > x86 has compiler barrier inside the relaxed() API so that code does not > get reordered. ARM64 architecturally guarantees device writes to be observed > in order. There are places where you don't even need a compile barrier between every write. I had horrid problems getting some ppc code (for a specific embedded SoC) optimised to have no extra barriers. I ended up just writing through 'pointer to volatile' and adding an explicit 'eieio' between the block of writes and status read. No less painful was doing a byteswapping write to normal memory. David ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed @ 2018-03-21 14:35 ` David Laight 0 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-21 14:35 UTC (permalink / raw) To: 'Sinan Kaya', Oliver Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) PiB4ODYgaGFzIGNvbXBpbGVyIGJhcnJpZXIgaW5zaWRlIHRoZSByZWxheGVkKCkgQVBJIHNvIHRo YXQgY29kZSBkb2VzIG5vdA0KPiBnZXQgcmVvcmRlcmVkLiBBUk02NCBhcmNoaXRlY3R1cmFsbHkg Z3VhcmFudGVlcyBkZXZpY2Ugd3JpdGVzIHRvIGJlIG9ic2VydmVkDQo+IGluIG9yZGVyLg0KDQpU aGVyZSBhcmUgcGxhY2VzIHdoZXJlIHlvdSBkb24ndCBldmVuIG5lZWQgYSBjb21waWxlIGJhcnJp ZXIgYmV0d2Vlbg0KZXZlcnkgd3JpdGUuDQoNCkkgaGFkIGhvcnJpZCBwcm9ibGVtcyBnZXR0aW5n IHNvbWUgcHBjIGNvZGUgKGZvciBhIHNwZWNpZmljIGVtYmVkZGVkIFNvQykNCm9wdGltaXNlZCB0 byBoYXZlIG5vIGV4dHJhIGJhcnJpZXJzLg0KSSBlbmRlZCB1cCBqdXN0IHdyaXRpbmcgdGhyb3Vn aCAncG9pbnRlciB0byB2b2xhdGlsZScgYW5kIGFkZGluZyBhbg0KZXhwbGljaXQgJ2VpZWlvJyBi ZXR3ZWVuIHRoZSBibG9jayBvZiB3cml0ZXMgYW5kIHN0YXR1cyByZWFkLg0KDQpObyBsZXNzIHBh aW5mdWwgd2FzIGRvaW5nIGEgYnl0ZXN3YXBwaW5nIHdyaXRlIHRvIG5vcm1hbCBtZW1vcnkuDQoN CglEYXZpZA0KDQo= ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-21 14:35 ` David Laight (?) @ 2018-03-21 15:04 ` Sinan Kaya -1 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-21 15:04 UTC (permalink / raw) To: David Laight, Oliver Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On 3/21/2018 9:35 AM, David Laight wrote: >> x86 has compiler barrier inside the relaxed() API so that code does not >> get reordered. ARM64 architecturally guarantees device writes to be observed >> in order. > > There are places where you don't even need a compile barrier between > every write. > > I had horrid problems getting some ppc code (for a specific embedded SoC) > optimised to have no extra barriers. > I ended up just writing through 'pointer to volatile' and adding an > explicit 'eieio' between the block of writes and status read. > > No less painful was doing a byteswapping write to normal memory. If the architecture is reordering writes to the peripheral, then removing the compiler barrier can break the multi-arch drivers. barriers document clearly states that device need to observe writes in order. Though for special cases like you mentioned, you can certainly do this: wmb() __raw_write/pointer access __raw_write/pointer access __raw_write/pointer access /* flush everything */ mmiowb() __raw_write/pointer access There would be no ordering guarantee between the wmb() and mmiowb(). This can only be done for known code and known hardware. I don't believe this applies to multi-arch drivers. > > David > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-21 14:35 ` David Laight (?) (?) @ 2018-03-22 5:24 ` Oliver 2018-03-22 8:20 ` Gabriel Paubert 2018-03-22 10:37 ` David Laight -1 siblings, 2 replies; 216+ messages in thread From: Oliver @ 2018-03-22 5:24 UTC (permalink / raw) To: David Laight Cc: Sinan Kaya, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote: >> x86 has compiler barrier inside the relaxed() API so that code does not >> get reordered. ARM64 architecturally guarantees device writes to be observed >> in order. > > There are places where you don't even need a compile barrier between > every write. > > I had horrid problems getting some ppc code (for a specific embedded SoC) > optimised to have no extra barriers. > I ended up just writing through 'pointer to volatile' and adding an > explicit 'eieio' between the block of writes and status read. This is what you are supposed to do. For accesses to MMIO (cache inhibited + guarded) storage the Power ISA guarantees that load-load and store-store pairs of accesses will always occur in program order, but there's no implicit ordering between load-store or store-load pairs. In those cases you need an explicit eieio barrier between the two accesses. At the HW level you can think of the CPU as having separate queues for MMIO loads and stores. Accesses will be added to the respective queue in program order, but there's no synchronisation between the two queues. If the CPU is doing write combining it's easy to imagine the whole store queue being emptied in one big gulp before the load queue is even touched. > No less painful was doing a byteswapping write to normal memory. What was the problem? The reverse indexed load/store instructions are a little awkward to use, but they work... > > David > ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-22 5:24 ` Oliver @ 2018-03-22 8:20 ` Gabriel Paubert 2018-03-22 10:37 ` David Laight 1 sibling, 0 replies; 216+ messages in thread From: Gabriel Paubert @ 2018-03-22 8:20 UTC (permalink / raw) To: Oliver Cc: Sinan Kaya, linux-rdma, David Laight, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote: > On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote: > >> x86 has compiler barrier inside the relaxed() API so that code does not > >> get reordered. ARM64 architecturally guarantees device writes to be observed > >> in order. > > > > There are places where you don't even need a compile barrier between > > every write. > > > > I had horrid problems getting some ppc code (for a specific embedded SoC) > > optimised to have no extra barriers. > > I ended up just writing through 'pointer to volatile' and adding an > > explicit 'eieio' between the block of writes and status read. > > This is what you are supposed to do. For accesses to MMIO (cache > inhibited + guarded) storage the Power ISA guarantees that load-load > and store-store pairs of accesses will always occur in program order, > but there's no implicit ordering between load-store or store-load And even for load store, eieio is not always necessary, in the important case of reading and writing to the same address, when modifying bits in a control register for example. Typically also loads will be moved ahead of stores, but not the other way around, so in practice you won't notice a missed eieio in this case. This does not mean you should not insert it. > pairs. In those cases you need an explicit eieio barrier between the > two accesses. At the HW level you can think of the CPU as having > separate queues for MMIO loads and stores. Accesses will be added to > the respective queue in program order, but there's no synchronisation > between the two queues. If the CPU is doing write combining it's easy > to imagine the whole store queue being emptied in one big gulp before > the load queue is even touched. Is write combining allowed on guarded storage? <Looking at docs> >From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited": "No combining occurs if the storage is also Guarded" Gabriel ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-22 8:20 ` Gabriel Paubert 0 siblings, 0 replies; 216+ messages in thread From: Gabriel Paubert @ 2018-03-22 8:20 UTC (permalink / raw) To: Oliver Cc: David Laight, Sinan Kaya, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote: > On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote: > >> x86 has compiler barrier inside the relaxed() API so that code does not > >> get reordered. ARM64 architecturally guarantees device writes to be observed > >> in order. > > > > There are places where you don't even need a compile barrier between > > every write. > > > > I had horrid problems getting some ppc code (for a specific embedded SoC) > > optimised to have no extra barriers. > > I ended up just writing through 'pointer to volatile' and adding an > > explicit 'eieio' between the block of writes and status read. > > This is what you are supposed to do. For accesses to MMIO (cache > inhibited + guarded) storage the Power ISA guarantees that load-load > and store-store pairs of accesses will always occur in program order, > but there's no implicit ordering between load-store or store-load And even for load store, eieio is not always necessary, in the important case of reading and writing to the same address, when modifying bits in a control register for example. Typically also loads will be moved ahead of stores, but not the other way around, so in practice you won't notice a missed eieio in this case. This does not mean you should not insert it. > pairs. In those cases you need an explicit eieio barrier between the > two accesses. At the HW level you can think of the CPU as having > separate queues for MMIO loads and stores. Accesses will be added to > the respective queue in program order, but there's no synchronisation > between the two queues. If the CPU is doing write combining it's easy > to imagine the whole store queue being emptied in one big gulp before > the load queue is even touched. Is write combining allowed on guarded storage? <Looking at docs> >From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited": "No combining occurs if the storage is also Guarded" Gabriel ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-22 8:20 ` Gabriel Paubert @ 2018-03-22 9:25 ` Oliver -1 siblings, 0 replies; 216+ messages in thread From: Oliver @ 2018-03-22 9:25 UTC (permalink / raw) To: Gabriel Paubert Cc: Sinan Kaya, linux-rdma, David Laight, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Thu, Mar 22, 2018 at 7:20 PM, Gabriel Paubert <paubert@iram.es> wrote: > On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote: >> On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote: >> >> x86 has compiler barrier inside the relaxed() API so that code does not >> >> get reordered. ARM64 architecturally guarantees device writes to be observed >> >> in order. >> > >> > There are places where you don't even need a compile barrier between >> > every write. >> > >> > I had horrid problems getting some ppc code (for a specific embedded SoC) >> > optimised to have no extra barriers. >> > I ended up just writing through 'pointer to volatile' and adding an >> > explicit 'eieio' between the block of writes and status read. >> >> This is what you are supposed to do. For accesses to MMIO (cache >> inhibited + guarded) storage the Power ISA guarantees that load-load >> and store-store pairs of accesses will always occur in program order, >> but there's no implicit ordering between load-store or store-load > > And even for load store, eieio is not always necessary, in the important > case of reading and writing to the same address, when modifying bits in > a control register for example. > > Typically also loads will be moved ahead of stores, but not the other > way around, so in practice you won't notice a missed eieio in this case. > This does not mean you should not insert it. Yep, but it doesn't really help us here. The generic accessors need to cope with the general case. >> pairs. In those cases you need an explicit eieio barrier between the >> two accesses. At the HW level you can think of the CPU as having >> separate queues for MMIO loads and stores. Accesses will be added to >> the respective queue in program order, but there's no synchronisation >> between the two queues. If the CPU is doing write combining it's easy >> to imagine the whole store queue being emptied in one big gulp before >> the load queue is even touched. > > Is write combining allowed on guarded storage? > > <Looking at docs> > From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited": > > "No combining occurs if the storage is also Guarded" Yeah it's not allowed. That's what I get for handwaving examples ;) ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-22 9:25 ` Oliver 0 siblings, 0 replies; 216+ messages in thread From: Oliver @ 2018-03-22 9:25 UTC (permalink / raw) To: Gabriel Paubert Cc: David Laight, Sinan Kaya, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Thu, Mar 22, 2018 at 7:20 PM, Gabriel Paubert <paubert@iram.es> wrote: > On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote: >> On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote: >> >> x86 has compiler barrier inside the relaxed() API so that code does not >> >> get reordered. ARM64 architecturally guarantees device writes to be observed >> >> in order. >> > >> > There are places where you don't even need a compile barrier between >> > every write. >> > >> > I had horrid problems getting some ppc code (for a specific embedded SoC) >> > optimised to have no extra barriers. >> > I ended up just writing through 'pointer to volatile' and adding an >> > explicit 'eieio' between the block of writes and status read. >> >> This is what you are supposed to do. For accesses to MMIO (cache >> inhibited + guarded) storage the Power ISA guarantees that load-load >> and store-store pairs of accesses will always occur in program order, >> but there's no implicit ordering between load-store or store-load > > And even for load store, eieio is not always necessary, in the important > case of reading and writing to the same address, when modifying bits in > a control register for example. > > Typically also loads will be moved ahead of stores, but not the other > way around, so in practice you won't notice a missed eieio in this case. > This does not mean you should not insert it. Yep, but it doesn't really help us here. The generic accessors need to cope with the general case. >> pairs. In those cases you need an explicit eieio barrier between the >> two accesses. At the HW level you can think of the CPU as having >> separate queues for MMIO loads and stores. Accesses will be added to >> the respective queue in program order, but there's no synchronisation >> between the two queues. If the CPU is doing write combining it's easy >> to imagine the whole store queue being emptied in one big gulp before >> the load queue is even touched. > > Is write combining allowed on guarded storage? > > <Looking at docs> > From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited": > > "No combining occurs if the storage is also Guarded" Yeah it's not allowed. That's what I get for handwaving examples ;) ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-22 9:25 ` Oliver @ 2018-03-22 11:25 ` Gabriel Paubert -1 siblings, 0 replies; 216+ messages in thread From: Gabriel Paubert @ 2018-03-22 11:25 UTC (permalink / raw) To: Oliver Cc: Sinan Kaya, linux-rdma, David Laight, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Thu, Mar 22, 2018 at 08:25:43PM +1100, Oliver wrote: > On Thu, Mar 22, 2018 at 7:20 PM, Gabriel Paubert <paubert@iram.es> wrote: > > On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote: > >> On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote: > >> >> x86 has compiler barrier inside the relaxed() API so that code does not > >> >> get reordered. ARM64 architecturally guarantees device writes to be observed > >> >> in order. > >> > > >> > There are places where you don't even need a compile barrier between > >> > every write. > >> > > >> > I had horrid problems getting some ppc code (for a specific embedded SoC) > >> > optimised to have no extra barriers. > >> > I ended up just writing through 'pointer to volatile' and adding an > >> > explicit 'eieio' between the block of writes and status read. > >> > >> This is what you are supposed to do. For accesses to MMIO (cache > >> inhibited + guarded) storage the Power ISA guarantees that load-load > >> and store-store pairs of accesses will always occur in program order, > >> but there's no implicit ordering between load-store or store-load > > > > And even for load store, eieio is not always necessary, in the important > > case of reading and writing to the same address, when modifying bits in > > a control register for example. > > > > Typically also loads will be moved ahead of stores, but not the other > > way around, so in practice you won't notice a missed eieio in this case. > > This does not mean you should not insert it. > > Yep, but it doesn't really help us here. The generic accessors need to cope > with the general case. A generic accessor for modifying fields in a device register might be an useful addition to the current set. This is a fairly frequent operation. Actually I did add macros to do exactly this in drivers for our own hardware here almost 20 years ago. I was fed up with writing writel(readl(reg) & mask | value, reg), especially when reg was not that simple (one device had over 100 registers). The macros obviously guaranteed that both accesses would be to the same register, something easy to get wrong with cut and paste. > > >> pairs. In those cases you need an explicit eieio barrier between the > >> two accesses. At the HW level you can think of the CPU as having > >> separate queues for MMIO loads and stores. Accesses will be added to > >> the respective queue in program order, but there's no synchronisation > >> between the two queues. If the CPU is doing write combining it's easy > >> to imagine the whole store queue being emptied in one big gulp before > >> the load queue is even touched. > > > > Is write combining allowed on guarded storage? > > > > <Looking at docs> > > From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited": > > > > "No combining occurs if the storage is also Guarded" > > Yeah it's not allowed. That's what I get for handwaving examples ;) At least it means that, for cache-inhibited guarded storage, there is a one to one correspondance between instructions and bus cycles. The only issue left is ordering ;) Gabriel ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-22 11:25 ` Gabriel Paubert 0 siblings, 0 replies; 216+ messages in thread From: Gabriel Paubert @ 2018-03-22 11:25 UTC (permalink / raw) To: Oliver Cc: David Laight, Sinan Kaya, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Thu, Mar 22, 2018 at 08:25:43PM +1100, Oliver wrote: > On Thu, Mar 22, 2018 at 7:20 PM, Gabriel Paubert <paubert@iram.es> wrote: > > On Thu, Mar 22, 2018 at 04:24:24PM +1100, Oliver wrote: > >> On Thu, Mar 22, 2018 at 1:35 AM, David Laight <David.Laight@aculab.com> wrote: > >> >> x86 has compiler barrier inside the relaxed() API so that code does not > >> >> get reordered. ARM64 architecturally guarantees device writes to be observed > >> >> in order. > >> > > >> > There are places where you don't even need a compile barrier between > >> > every write. > >> > > >> > I had horrid problems getting some ppc code (for a specific embedded SoC) > >> > optimised to have no extra barriers. > >> > I ended up just writing through 'pointer to volatile' and adding an > >> > explicit 'eieio' between the block of writes and status read. > >> > >> This is what you are supposed to do. For accesses to MMIO (cache > >> inhibited + guarded) storage the Power ISA guarantees that load-load > >> and store-store pairs of accesses will always occur in program order, > >> but there's no implicit ordering between load-store or store-load > > > > And even for load store, eieio is not always necessary, in the important > > case of reading and writing to the same address, when modifying bits in > > a control register for example. > > > > Typically also loads will be moved ahead of stores, but not the other > > way around, so in practice you won't notice a missed eieio in this case. > > This does not mean you should not insert it. > > Yep, but it doesn't really help us here. The generic accessors need to cope > with the general case. A generic accessor for modifying fields in a device register might be an useful addition to the current set. This is a fairly frequent operation. Actually I did add macros to do exactly this in drivers for our own hardware here almost 20 years ago. I was fed up with writing writel(readl(reg) & mask | value, reg), especially when reg was not that simple (one device had over 100 registers). The macros obviously guaranteed that both accesses would be to the same register, something easy to get wrong with cut and paste. > > >> pairs. In those cases you need an explicit eieio barrier between the > >> two accesses. At the HW level you can think of the CPU as having > >> separate queues for MMIO loads and stores. Accesses will be added to > >> the respective queue in program order, but there's no synchronisation > >> between the two queues. If the CPU is doing write combining it's easy > >> to imagine the whole store queue being emptied in one big gulp before > >> the load queue is even touched. > > > > Is write combining allowed on guarded storage? > > > > <Looking at docs> > > From PowerISA_V3.0.pdf, Book2, section 1.6.2 "Caching inhibited": > > > > "No combining occurs if the storage is also Guarded" > > Yeah it's not allowed. That's what I get for handwaving examples ;) At least it means that, for cache-inhibited guarded storage, there is a one to one correspondance between instructions and bus cycles. The only issue left is ordering ;) Gabriel ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed 2018-03-22 5:24 ` Oliver @ 2018-03-22 10:37 ` David Laight 2018-03-22 10:37 ` David Laight 1 sibling, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-22 10:37 UTC (permalink / raw) To: 'Oliver' Cc: Sinan Kaya, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) From: Oliver > Sent: 22 March 2018 05:24 ... > > No less painful was doing a byteswapping write to normal memory. > > What was the problem? The reverse indexed load/store instructions are > a little awkward to use, but they work... Finding something that would generate the right instruction without any barriers. ISTR writing my own asm pattern. David ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed @ 2018-03-22 10:37 ` David Laight 0 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-22 10:37 UTC (permalink / raw) To: 'Oliver' Cc: Sinan Kaya, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) RnJvbTogT2xpdmVyDQo+IFNlbnQ6IDIyIE1hcmNoIDIwMTggMDU6MjQNCi4uLg0KPiA+IE5vIGxl c3MgcGFpbmZ1bCB3YXMgZG9pbmcgYSBieXRlc3dhcHBpbmcgd3JpdGUgdG8gbm9ybWFsIG1lbW9y eS4NCj4gDQo+IFdoYXQgd2FzIHRoZSBwcm9ibGVtPyBUaGUgcmV2ZXJzZSBpbmRleGVkIGxvYWQv c3RvcmUgaW5zdHJ1Y3Rpb25zIGFyZQ0KPiBhIGxpdHRsZSBhd2t3YXJkIHRvIHVzZSwgYnV0IHRo ZXkgd29yay4uLg0KDQpGaW5kaW5nIHNvbWV0aGluZyB0aGF0IHdvdWxkIGdlbmVyYXRlIHRoZSBy aWdodCBpbnN0cnVjdGlvbiB3aXRob3V0IGFueQ0KYmFycmllcnMuDQpJU1RSIHdyaXRpbmcgbXkg b3duIGFzbSBwYXR0ZXJuLg0KDQoJRGF2aWQNCg0K ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-21 13:53 ` Sinan Kaya @ 2018-03-22 4:24 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-22 4:24 UTC (permalink / raw) To: Sinan Kaya, Oliver Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Wed, 2018-03-21 at 08:53 -0500, Sinan Kaya wrote: > writel_relaxed() needs to have ordering guarantees with respect to the order > device observes writes. Correct. > x86 has compiler barrier inside the relaxed() API so that code does not > get reordered. ARM64 architecturally guarantees device writes to be observed > in order. > > I was hoping that PPC could follow x86 and inject compiler barrier into the > relaxed functions. > > BTW, I have no idea what compiler barrier does on PPC and if > > wrltel() == compiler barrier() + wrltel_relaxed() > > can be said. No, it's not sufficient. Replacing wmb() + writel() with wmb() + writel_relaxed() will work on PPC, it will just not give you a benefit today. The main problem is that the semantics of writel/writel_relaxed (and read versions) aren't very well defined in Linux esp. when it comes to different memory types (NC, WC, ...). I've been wanting to implement the relaxed accessors for a while but was battling with this to try to also better support WC, and due to other commitments, this somewhat fell down the cracks. Two options I can think of: - Just make the _relaxed variants use an eieio instead of a sync, this will effectively lift the ordering guarantee vs. cachable storage (and thus unlock) and might give a (small) performance improvement. However, we still have the problem that on WC mappings, neither writel nor writel_relaxed will effectively allow combining to happen (only raw accesses will because on powerpc *all* barriers will break combining). - Make writel_relaxed() be a simple store without barriers, and readl_relaxed() be "eieio, read, eieio", thus allowing write combining to happen between successive writel_relaxed on WC space (no change on normal NC space) while maintaining the ordering between relaxed reads and writes. The flip side is a (slight) increased overhead of readl_relaxed. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-22 4:24 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-22 4:24 UTC (permalink / raw) To: Sinan Kaya, Oliver Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma On Wed, 2018-03-21 at 08:53 -0500, Sinan Kaya wrote: > writel_relaxed() needs to have ordering guarantees with respect to the order > device observes writes. Correct. > x86 has compiler barrier inside the relaxed() API so that code does not > get reordered. ARM64 architecturally guarantees device writes to be observed > in order. > > I was hoping that PPC could follow x86 and inject compiler barrier into the > relaxed functions. > > BTW, I have no idea what compiler barrier does on PPC and if > > wrltel() == compiler barrier() + wrltel_relaxed() > > can be said. No, it's not sufficient. Replacing wmb() + writel() with wmb() + writel_relaxed() will work on PPC, it will just not give you a benefit today. The main problem is that the semantics of writel/writel_relaxed (and read versions) aren't very well defined in Linux esp. when it comes to different memory types (NC, WC, ...). I've been wanting to implement the relaxed accessors for a while but was battling with this to try to also better support WC, and due to other commitments, this somewhat fell down the cracks. Two options I can think of: - Just make the _relaxed variants use an eieio instead of a sync, this will effectively lift the ordering guarantee vs. cachable storage (and thus unlock) and might give a (small) performance improvement. However, we still have the problem that on WC mappings, neither writel nor writel_relaxed will effectively allow combining to happen (only raw accesses will because on powerpc *all* barriers will break combining). - Make writel_relaxed() be a simple store without barriers, and readl_relaxed() be "eieio, read, eieio", thus allowing write combining to happen between successive writel_relaxed on WC space (no change on normal NC space) while maintaining the ordering between relaxed reads and writes. The flip side is a (slight) increased overhead of readl_relaxed. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-22 4:24 ` Benjamin Herrenschmidt @ 2018-03-22 10:15 ` Oliver -1 siblings, 0 replies; 216+ messages in thread From: Oliver @ 2018-03-22 10:15 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Sinan Kaya, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Thu, Mar 22, 2018 at 3:24 PM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Wed, 2018-03-21 at 08:53 -0500, Sinan Kaya wrote: >> writel_relaxed() needs to have ordering guarantees with respect to the order >> device observes writes. > > Correct. > >> x86 has compiler barrier inside the relaxed() API so that code does not >> get reordered. ARM64 architecturally guarantees device writes to be observed >> in order. >> >> I was hoping that PPC could follow x86 and inject compiler barrier into the >> relaxed functions. >> >> BTW, I have no idea what compiler barrier does on PPC and if >> >> wrltel() == compiler barrier() + wrltel_relaxed() >> >> can be said. > > No, it's not sufficient. > > Replacing wmb() + writel() with wmb() + writel_relaxed() will work on > PPC, it will just not give you a benefit today. > > The main problem is that the semantics of writel/writel_relaxed (and > read versions) aren't very well defined in Linux esp. when it comes > to different memory types (NC, WC, ...). > > I've been wanting to implement the relaxed accessors for a while but > was battling with this to try to also better support WC, and due to > other commitments, this somewhat fell down the cracks. > > Two options I can think of: > > - Just make the _relaxed variants use an eieio instead of a sync, this > will effectively lift the ordering guarantee vs. cachable storage (and > thus unlock) and might give a (small) performance improvement. Wouldn't we still have the unlock ordering due to the io_sync hack or are you thinking we should remove that too for the relaxed version? > However, > we still have the problem that on WC mappings, neither writel nor > writel_relaxed will effectively allow combining to happen (only raw > accesses will because on powerpc *all* barriers will break combining). Hmm, eieio is only architected to affect CI+G (and WT) so it shouldn't affect combining on non-guarded memory. Do most implementations apply it to all CI accesses anyway? > - Make writel_relaxed() be a simple store without barriers, and > readl_relaxed() be "eieio, read, eieio", thus allowing write combining > to happen between successive writel_relaxed on WC space (no change on > normal NC space) while maintaining the ordering between relaxed reads > and writes. The flip side is a (slight) increased overhead of > readl_relaxed. Are there many drivers that actually do writeX() on WC space? memory-barriers.txt pretty much says that all bets are off and no ordering guarantees can be assumed when using readX/writeX on prefetchable IO memory. It seems sketchy enough to give me some pause, but maybe it works fine elsewhere. Oliver ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-22 10:15 ` Oliver 0 siblings, 0 replies; 216+ messages in thread From: Oliver @ 2018-03-22 10:15 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma On Thu, Mar 22, 2018 at 3:24 PM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Wed, 2018-03-21 at 08:53 -0500, Sinan Kaya wrote: >> writel_relaxed() needs to have ordering guarantees with respect to the order >> device observes writes. > > Correct. > >> x86 has compiler barrier inside the relaxed() API so that code does not >> get reordered. ARM64 architecturally guarantees device writes to be observed >> in order. >> >> I was hoping that PPC could follow x86 and inject compiler barrier into the >> relaxed functions. >> >> BTW, I have no idea what compiler barrier does on PPC and if >> >> wrltel() == compiler barrier() + wrltel_relaxed() >> >> can be said. > > No, it's not sufficient. > > Replacing wmb() + writel() with wmb() + writel_relaxed() will work on > PPC, it will just not give you a benefit today. > > The main problem is that the semantics of writel/writel_relaxed (and > read versions) aren't very well defined in Linux esp. when it comes > to different memory types (NC, WC, ...). > > I've been wanting to implement the relaxed accessors for a while but > was battling with this to try to also better support WC, and due to > other commitments, this somewhat fell down the cracks. > > Two options I can think of: > > - Just make the _relaxed variants use an eieio instead of a sync, this > will effectively lift the ordering guarantee vs. cachable storage (and > thus unlock) and might give a (small) performance improvement. Wouldn't we still have the unlock ordering due to the io_sync hack or are you thinking we should remove that too for the relaxed version? > However, > we still have the problem that on WC mappings, neither writel nor > writel_relaxed will effectively allow combining to happen (only raw > accesses will because on powerpc *all* barriers will break combining). Hmm, eieio is only architected to affect CI+G (and WT) so it shouldn't affect combining on non-guarded memory. Do most implementations apply it to all CI accesses anyway? > - Make writel_relaxed() be a simple store without barriers, and > readl_relaxed() be "eieio, read, eieio", thus allowing write combining > to happen between successive writel_relaxed on WC space (no change on > normal NC space) while maintaining the ordering between relaxed reads > and writes. The flip side is a (slight) increased overhead of > readl_relaxed. Are there many drivers that actually do writeX() on WC space? memory-barriers.txt pretty much says that all bets are off and no ordering guarantees can be assumed when using readX/writeX on prefetchable IO memory. It seems sketchy enough to give me some pause, but maybe it works fine elsewhere. Oliver ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-22 10:15 ` Oliver @ 2018-03-22 13:52 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-22 13:52 UTC (permalink / raw) To: Oliver Cc: Sinan Kaya, linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Thu, 2018-03-22 at 21:15 +1100, Oliver wrote: > On Thu, Mar 22, 2018 at 3:24 PM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > On Wed, 2018-03-21 at 08:53 -0500, Sinan Kaya wrote: > > > writel_relaxed() needs to have ordering guarantees with respect to the order > > > device observes writes. > > > > Correct. > > > > > x86 has compiler barrier inside the relaxed() API so that code does not > > > get reordered. ARM64 architecturally guarantees device writes to be observed > > > in order. > > > > > > I was hoping that PPC could follow x86 and inject compiler barrier into the > > > relaxed functions. > > > > > > BTW, I have no idea what compiler barrier does on PPC and if > > > > > > wrltel() == compiler barrier() + wrltel_relaxed() > > > > > > can be said. > > > > No, it's not sufficient. Just to clarify ... barrier() is just a compiler barrier, it means the compiler will generate things in the order they are written. This isn't sufficient on archs with an OO memory model, where an actual memory barrier instruction needs to be emited. As for Oliver comments... > > Replacing wmb() + writel() with wmb() + writel_relaxed() will work on > > PPC, it will just not give you a benefit today. > > > > The main problem is that the semantics of writel/writel_relaxed (and > > read versions) aren't very well defined in Linux esp. when it comes > > to different memory types (NC, WC, ...). > > > > I've been wanting to implement the relaxed accessors for a while but > > was battling with this to try to also better support WC, and due to > > other commitments, this somewhat fell down the cracks. > > > > Two options I can think of: > > > > - Just make the _relaxed variants use an eieio instead of a sync, this > > will effectively lift the ordering guarantee vs. cachable storage (and > > thus unlock) and might give a (small) performance improvement. > > Wouldn't we still have the unlock ordering due to the io_sync hack or > are you thinking we should remove that too for the relaxed version? Well, the documentation says we don't care about synchronization vs. locks so we should probably remove it (then we need to make sure mmiowb works, thus sets the flag). > > However, > > we still have the problem that on WC mappings, neither writel nor > > writel_relaxed will effectively allow combining to happen (only raw > > accesses will because on powerpc *all* barriers will break combining). > > Hmm, eieio is only architected to affect CI+G (and WT) so it shouldn't > affect combining > on non-guarded memory. Do most implementations apply it to all CI > accesses anyway? Yes, as far as I know all implementations will stop combining on *any* barrier instruction. > > - Make writel_relaxed() be a simple store without barriers, and > > readl_relaxed() be "eieio, read, eieio", thus allowing write combining > > to happen between successive writel_relaxed on WC space (no change on > > normal NC space) while maintaining the ordering between relaxed reads > > and writes. The flip side is a (slight) increased overhead of > > readl_relaxed. > > Are there many drivers that actually do writeX() on WC space? > memory-barriers.txt > pretty much says that all bets are off and no ordering guarantees can be assumed > when using readX/writeX on prefetchable IO memory. It seems sketchy enough to > give me some pause, but maybe it works fine elsewhere. I don't know whether any does it, but I want to provide a way for a driver to somewhat reliably obtain write combine semantics without having to hand code endian swap and other horrors involved with using __raw_* accessors. So my thinking is to define that the combination of WC + writel_relaxed gives you that, which it does at least on x86 and ARM afaik. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-22 13:52 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-22 13:52 UTC (permalink / raw) To: Oliver Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma On Thu, 2018-03-22 at 21:15 +1100, Oliver wrote: > On Thu, Mar 22, 2018 at 3:24 PM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > On Wed, 2018-03-21 at 08:53 -0500, Sinan Kaya wrote: > > > writel_relaxed() needs to have ordering guarantees with respect to the order > > > device observes writes. > > > > Correct. > > > > > x86 has compiler barrier inside the relaxed() API so that code does not > > > get reordered. ARM64 architecturally guarantees device writes to be observed > > > in order. > > > > > > I was hoping that PPC could follow x86 and inject compiler barrier into the > > > relaxed functions. > > > > > > BTW, I have no idea what compiler barrier does on PPC and if > > > > > > wrltel() == compiler barrier() + wrltel_relaxed() > > > > > > can be said. > > > > No, it's not sufficient. Just to clarify ... barrier() is just a compiler barrier, it means the compiler will generate things in the order they are written. This isn't sufficient on archs with an OO memory model, where an actual memory barrier instruction needs to be emited. As for Oliver comments... > > Replacing wmb() + writel() with wmb() + writel_relaxed() will work on > > PPC, it will just not give you a benefit today. > > > > The main problem is that the semantics of writel/writel_relaxed (and > > read versions) aren't very well defined in Linux esp. when it comes > > to different memory types (NC, WC, ...). > > > > I've been wanting to implement the relaxed accessors for a while but > > was battling with this to try to also better support WC, and due to > > other commitments, this somewhat fell down the cracks. > > > > Two options I can think of: > > > > - Just make the _relaxed variants use an eieio instead of a sync, this > > will effectively lift the ordering guarantee vs. cachable storage (and > > thus unlock) and might give a (small) performance improvement. > > Wouldn't we still have the unlock ordering due to the io_sync hack or > are you thinking we should remove that too for the relaxed version? Well, the documentation says we don't care about synchronization vs. locks so we should probably remove it (then we need to make sure mmiowb works, thus sets the flag). > > However, > > we still have the problem that on WC mappings, neither writel nor > > writel_relaxed will effectively allow combining to happen (only raw > > accesses will because on powerpc *all* barriers will break combining). > > Hmm, eieio is only architected to affect CI+G (and WT) so it shouldn't > affect combining > on non-guarded memory. Do most implementations apply it to all CI > accesses anyway? Yes, as far as I know all implementations will stop combining on *any* barrier instruction. > > - Make writel_relaxed() be a simple store without barriers, and > > readl_relaxed() be "eieio, read, eieio", thus allowing write combining > > to happen between successive writel_relaxed on WC space (no change on > > normal NC space) while maintaining the ordering between relaxed reads > > and writes. The flip side is a (slight) increased overhead of > > readl_relaxed. > > Are there many drivers that actually do writeX() on WC space? > memory-barriers.txt > pretty much says that all bets are off and no ordering guarantees can be assumed > when using readX/writeX on prefetchable IO memory. It seems sketchy enough to > give me some pause, but maybe it works fine elsewhere. I don't know whether any does it, but I want to provide a way for a driver to somewhat reliably obtain write combine semantics without having to hand code endian swap and other horrors involved with using __raw_* accessors. So my thinking is to define that the combination of WC + writel_relaxed gives you that, which it does at least on x86 and ARM afaik. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-22 13:52 ` Benjamin Herrenschmidt @ 2018-03-22 17:51 ` Sinan Kaya -1 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-22 17:51 UTC (permalink / raw) To: Benjamin Herrenschmidt, Oliver Cc: linux-rdma, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: >>> No, it's not sufficient. > Just to clarify ... barrier() is just a compiler barrier, it means the > compiler will generate things in the order they are written. This isn't > sufficient on archs with an OO memory model, where an actual memory > barrier instruction needs to be emited. Surprisingly, ARM64 GCC compiler generates a write barrier as opposed to preventing code reordering. I was curious if this is an ARM only thing or not. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-22 17:51 ` Sinan Kaya 0 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-22 17:51 UTC (permalink / raw) To: Benjamin Herrenschmidt, Oliver Cc: open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: >>> No, it's not sufficient. > Just to clarify ... barrier() is just a compiler barrier, it means the > compiler will generate things in the order they are written. This isn't > sufficient on archs with an OO memory model, where an actual memory > barrier instruction needs to be emited. Surprisingly, ARM64 GCC compiler generates a write barrier as opposed to preventing code reordering. I was curious if this is an ARM only thing or not. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-22 17:51 ` Sinan Kaya @ 2018-03-23 0:16 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-23 0:16 UTC (permalink / raw) To: Sinan Kaya, Oliver Cc: linux-rdma, Marc Zyngier, linuxppc dev list, Will Deacon On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote: > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: > > > > No, it's not sufficient. > > > > Just to clarify ... barrier() is just a compiler barrier, it means the > > compiler will generate things in the order they are written. This isn't > > sufficient on archs with an OO memory model, where an actual memory > > barrier instruction needs to be emited. > > Surprisingly, ARM64 GCC compiler generates a write barrier as > opposed to preventing code reordering. > > I was curious if this is an ARM only thing or not. Are you sure of that ? I thought it's the ARM implementation of writel that had an explicit write barrier in it: #define writel(v,c) ({ __iowmb(); writel_relaxed((v),(c)); }) And __iowmb() is #define __iowmb() wmb() Note, I'm a bit dubious about this in ARM: #define readl(c) ({ u32 __v = readl_relaxed(c); __iormb(); __v; } Will, Marc, on powerpc, we put a sync *before* the read in readl etc... The reasoning was there could be some DMA setup followed by a side effect readl rather than a side effect writel to trigger a DMA. Granted I wouldn't expect modern devices to be that stupid, but I have vague memory of some devices back in the day having that sort of read ops. In general, I though the model offerred by x86 and thus by Linux readl/writel was full synchronization both before and after the MMIO, vs either other MMIO or all other forms of ops (cachable memory, locks etc...). Also, can't the above readl_relaxed leak out of a lock ? Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-23 0:16 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-23 0:16 UTC (permalink / raw) To: Sinan Kaya, Oliver Cc: linuxppc dev list, linux-rdma, Marc Zyngier, Will Deacon On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote: > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: > > > > No, it's not sufficient. > > > > Just to clarify ... barrier() is just a compiler barrier, it means the > > compiler will generate things in the order they are written. This isn't > > sufficient on archs with an OO memory model, where an actual memory > > barrier instruction needs to be emited. > > Surprisingly, ARM64 GCC compiler generates a write barrier as > opposed to preventing code reordering. > > I was curious if this is an ARM only thing or not. Are you sure of that ? I thought it's the ARM implementation of writel that had an explicit write barrier in it: #define writel(v,c) ({ __iowmb(); writel_relaxed((v),(c)); }) And __iowmb() is #define __iowmb() wmb() Note, I'm a bit dubious about this in ARM: #define readl(c) ({ u32 __v = readl_relaxed(c); __iormb(); __v; } Will, Marc, on powerpc, we put a sync *before* the read in readl etc... The reasoning was there could be some DMA setup followed by a side effect readl rather than a side effect writel to trigger a DMA. Granted I wouldn't expect modern devices to be that stupid, but I have vague memory of some devices back in the day having that sort of read ops. In general, I though the model offerred by x86 and thus by Linux readl/writel was full synchronization both before and after the MMIO, vs either other MMIO or all other forms of ops (cachable memory, locks etc...). Also, can't the above readl_relaxed leak out of a lock ? Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-23 0:16 ` Benjamin Herrenschmidt @ 2018-03-23 13:42 ` Sinan Kaya -1 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-23 13:42 UTC (permalink / raw) To: Benjamin Herrenschmidt, Oliver Cc: linux-rdma, Marc Zyngier, linuxppc dev list, Will Deacon On 3/22/2018 8:16 PM, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote: >> On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: >>>>> No, it's not sufficient. >>> >>> Just to clarify ... barrier() is just a compiler barrier, it means the >>> compiler will generate things in the order they are written. This isn't >>> sufficient on archs with an OO memory model, where an actual memory >>> barrier instruction needs to be emited. >> >> Surprisingly, ARM64 GCC compiler generates a write barrier as >> opposed to preventing code reordering. >> >> I was curious if this is an ARM only thing or not. > > Are you sure of that ? I thought it's the ARM implementation of writel > that had an explicit write barrier in it: Yes, I'm %100 sure. The answer is both writel() and barrier() generates a write barrier instruction. I found this by searching the kernel disassembly for back to back "dsb st" instruction. > > #define writel(v,c) ({ __iowmb(); writel_relaxed((v),(c)); }) > > And __iowmb() is > > #define __iowmb() wmb() > > Note, I'm a bit dubious about this in ARM: > > #define readl(c) ({ u32 __v = readl_relaxed(c); __iormb(); __v; } > > Will, Marc, on powerpc, we put a sync *before* the read in readl etc... > > The reasoning was there could be some DMA setup followed by a side > effect readl rather than a side effect writel to trigger a DMA. Granted > I wouldn't expect modern devices to be that stupid, but I have vague > memory of some devices back in the day having that sort of read ops. > > In general, I though the model offerred by x86 and thus by Linux > readl/writel was full synchronization both before and after the MMIO, > vs either other MMIO or all other forms of ops (cachable memory, locks > etc...). > > Also, can't the above readl_relaxed leak out of a lock ? I think you are asking about PPC, correct? I read somewhere that PPC implementation keeps track of MMIO accesses and has an implicit barrier inside the spin_unlock() code for such accesses. Isn't this true? > > Cheers, > Ben. > > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-23 13:42 ` Sinan Kaya 0 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-23 13:42 UTC (permalink / raw) To: Benjamin Herrenschmidt, Oliver Cc: linuxppc dev list, linux-rdma, Marc Zyngier, Will Deacon On 3/22/2018 8:16 PM, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote: >> On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: >>>>> No, it's not sufficient. >>> >>> Just to clarify ... barrier() is just a compiler barrier, it means the >>> compiler will generate things in the order they are written. This isn't >>> sufficient on archs with an OO memory model, where an actual memory >>> barrier instruction needs to be emited. >> >> Surprisingly, ARM64 GCC compiler generates a write barrier as >> opposed to preventing code reordering. >> >> I was curious if this is an ARM only thing or not. > > Are you sure of that ? I thought it's the ARM implementation of writel > that had an explicit write barrier in it: Yes, I'm %100 sure. The answer is both writel() and barrier() generates a write barrier instruction. I found this by searching the kernel disassembly for back to back "dsb st" instruction. > > #define writel(v,c) ({ __iowmb(); writel_relaxed((v),(c)); }) > > And __iowmb() is > > #define __iowmb() wmb() > > Note, I'm a bit dubious about this in ARM: > > #define readl(c) ({ u32 __v = readl_relaxed(c); __iormb(); __v; } > > Will, Marc, on powerpc, we put a sync *before* the read in readl etc... > > The reasoning was there could be some DMA setup followed by a side > effect readl rather than a side effect writel to trigger a DMA. Granted > I wouldn't expect modern devices to be that stupid, but I have vague > memory of some devices back in the day having that sort of read ops. > > In general, I though the model offerred by x86 and thus by Linux > readl/writel was full synchronization both before and after the MMIO, > vs either other MMIO or all other forms of ops (cachable memory, locks > etc...). > > Also, can't the above readl_relaxed leak out of a lock ? I think you are asking about PPC, correct? I read somewhere that PPC implementation keeps track of MMIO accesses and has an implicit barrier inside the spin_unlock() code for such accesses. Isn't this true? > > Cheers, > Ben. > > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-23 13:42 ` Sinan Kaya @ 2018-03-24 1:22 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-24 1:22 UTC (permalink / raw) To: Sinan Kaya, Oliver Cc: linux-rdma, Marc Zyngier, linuxppc dev list, Will Deacon On Fri, 2018-03-23 at 09:42 -0400, Sinan Kaya wrote: > On 3/22/2018 8:16 PM, Benjamin Herrenschmidt wrote: > > On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote: > > > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: > > > > > > No, it's not sufficient. > > > > > > > > Just to clarify ... barrier() is just a compiler barrier, it means the > > > > compiler will generate things in the order they are written. This isn't > > > > sufficient on archs with an OO memory model, where an actual memory > > > > barrier instruction needs to be emited. > > > > > > Surprisingly, ARM64 GCC compiler generates a write barrier as > > > opposed to preventing code reordering. > > > > > > I was curious if this is an ARM only thing or not. > > > > Are you sure of that ? I thought it's the ARM implementation of writel > > that had an explicit write barrier in it: > > Yes, I'm %100 sure. The answer is both writel() and barrier() generates > a write barrier instruction. I found this by searching the kernel disassembly > for back to back "dsb st" instruction. I'm not sure you are correct here. As I wrote below, the implementatoin of writel() contains an *explicit" memory barrier which is completely different to a barrier() instruction: > > > #define writel(v,c) ({ __iowmb(); writel_relaxed((v),(c)); }) > > > > And __iowmb() is > > > > #define __iowmb() wmb() > > > > Note, I'm a bit dubious about this in ARM: > > > > #define readl(c) ({ u32 __v = readl_relaxed(c); __iormb(); __v; } > > > > Will, Marc, on powerpc, we put a sync *before* the read in readl etc... > > > > The reasoning was there could be some DMA setup followed by a side > > effect readl rather than a side effect writel to trigger a DMA. Granted > > I wouldn't expect modern devices to be that stupid, but I have vague > > memory of some devices back in the day having that sort of read ops. > > > > In general, I though the model offerred by x86 and thus by Linux > > readl/writel was full synchronization both before and after the MMIO, > > vs either other MMIO or all other forms of ops (cachable memory, locks > > etc...). > > > > Also, can't the above readl_relaxed leak out of a lock ? > > I think you are asking about PPC, correct? No, I'm asking about ARM. > I read somewhere that PPC implementation keeps track of MMIO accesses and > has an implicit barrier inside the spin_unlock() code for such accesses. > Isn't this true? Yes, I wrote that code :-) We keep track of writes and put an implicit stronger barrier in unlock on ppc64 because drivers never get mmiowb right. Cheers, Ben. > > > > Cheers, > > Ben. > > > > > > ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-24 1:22 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-24 1:22 UTC (permalink / raw) To: Sinan Kaya, Oliver Cc: linuxppc dev list, linux-rdma, Marc Zyngier, Will Deacon On Fri, 2018-03-23 at 09:42 -0400, Sinan Kaya wrote: > On 3/22/2018 8:16 PM, Benjamin Herrenschmidt wrote: > > On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote: > > > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: > > > > > > No, it's not sufficient. > > > > > > > > Just to clarify ... barrier() is just a compiler barrier, it means the > > > > compiler will generate things in the order they are written. This isn't > > > > sufficient on archs with an OO memory model, where an actual memory > > > > barrier instruction needs to be emited. > > > > > > Surprisingly, ARM64 GCC compiler generates a write barrier as > > > opposed to preventing code reordering. > > > > > > I was curious if this is an ARM only thing or not. > > > > Are you sure of that ? I thought it's the ARM implementation of writel > > that had an explicit write barrier in it: > > Yes, I'm %100 sure. The answer is both writel() and barrier() generates > a write barrier instruction. I found this by searching the kernel disassembly > for back to back "dsb st" instruction. I'm not sure you are correct here. As I wrote below, the implementatoin of writel() contains an *explicit" memory barrier which is completely different to a barrier() instruction: > > > #define writel(v,c) ({ __iowmb(); writel_relaxed((v),(c)); }) > > > > And __iowmb() is > > > > #define __iowmb() wmb() > > > > Note, I'm a bit dubious about this in ARM: > > > > #define readl(c) ({ u32 __v = readl_relaxed(c); __iormb(); __v; } > > > > Will, Marc, on powerpc, we put a sync *before* the read in readl etc... > > > > The reasoning was there could be some DMA setup followed by a side > > effect readl rather than a side effect writel to trigger a DMA. Granted > > I wouldn't expect modern devices to be that stupid, but I have vague > > memory of some devices back in the day having that sort of read ops. > > > > In general, I though the model offerred by x86 and thus by Linux > > readl/writel was full synchronization both before and after the MMIO, > > vs either other MMIO or all other forms of ops (cachable memory, locks > > etc...). > > > > Also, can't the above readl_relaxed leak out of a lock ? > > I think you are asking about PPC, correct? No, I'm asking about ARM. > I read somewhere that PPC implementation keeps track of MMIO accesses and > has an implicit barrier inside the spin_unlock() code for such accesses. > Isn't this true? Yes, I wrote that code :-) We keep track of writes and put an implicit stronger barrier in unlock on ppc64 because drivers never get mmiowb right. Cheers, Ben. > > > > Cheers, > > Ben. > > > > > > ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-24 1:22 ` Benjamin Herrenschmidt @ 2018-03-24 15:06 ` Sinan Kaya -1 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-24 15:06 UTC (permalink / raw) To: Benjamin Herrenschmidt, Oliver Cc: linux-rdma, Marc Zyngier, linuxppc dev list, Will Deacon On 3/23/2018 9:22 PM, Benjamin Herrenschmidt wrote: >> Yes, I'm %100 sure. The answer is both writel() and barrier() generates >> a write barrier instruction. I found this by searching the kernel disassembly >> for back to back "dsb st" instruction. > I'm not sure you are correct here. As I wrote below, the implementatoin > of writel() contains an *explicit" memory barrier which is completely > different to a barrier() instruction: OK. I did some directed tests and I'm taking it back. barrier() is a compiler reordering statement only. What got me confused was this sequence: wmb() barrier() writel() I thought that the second barrier instruction was coming from barrier() but it was actually coming from writel(). -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-24 15:06 ` Sinan Kaya 0 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-24 15:06 UTC (permalink / raw) To: Benjamin Herrenschmidt, Oliver Cc: linuxppc dev list, linux-rdma, Marc Zyngier, Will Deacon On 3/23/2018 9:22 PM, Benjamin Herrenschmidt wrote: >> Yes, I'm %100 sure. The answer is both writel() and barrier() generates >> a write barrier instruction. I found this by searching the kernel disassembly >> for back to back "dsb st" instruction. > I'm not sure you are correct here. As I wrote below, the implementatoin > of writel() contains an *explicit" memory barrier which is completely > different to a barrier() instruction: OK. I did some directed tests and I'm taking it back. barrier() is a compiler reordering statement only. What got me confused was this sequence: wmb() barrier() writel() I thought that the second barrier instruction was coming from barrier() but it was actually coming from writel(). -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-23 0:16 ` Benjamin Herrenschmidt @ 2018-03-26 11:44 ` Will Deacon -1 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-26 11:44 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Sinan Kaya, linuxppc dev list, Oliver, Marc Zyngier, linux-rdma Hi Ben, I don't seem to have the beginning of this thread, so please bounce it over if you'd like me to look at it! On Fri, Mar 23, 2018 at 11:16:08AM +1100, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote: > > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: > > > > > No, it's not sufficient. > > > > > > Just to clarify ... barrier() is just a compiler barrier, it means the > > > compiler will generate things in the order they are written. This isn't > > > sufficient on archs with an OO memory model, where an actual memory > > > barrier instruction needs to be emited. > > > > Surprisingly, ARM64 GCC compiler generates a write barrier as > > opposed to preventing code reordering. In context, this looks like a misunderstanding somewhere. barrier() is a compiler barrier for us just like everybody else and we use the generic implementation with the empty asm + memory clobber. > > I was curious if this is an ARM only thing or not. > > Are you sure of that ? I thought it's the ARM implementation of writel > that had an explicit write barrier in it: > > #define writel(v,c) ({ __iowmb(); writel_relaxed((v),(c)); }) > > And __iowmb() is > > #define __iowmb() wmb() > > Note, I'm a bit dubious about this in ARM: > > #define readl(c) ({ u32 __v = readl_relaxed(c); __iormb(); __v; } > > Will, Marc, on powerpc, we put a sync *before* the read in readl etc... > > The reasoning was there could be some DMA setup followed by a side > effect readl rather than a side effect writel to trigger a DMA. Granted > I wouldn't expect modern devices to be that stupid, but I have vague > memory of some devices back in the day having that sort of read ops. The reason we have it afterwards was for something like: while (!(readl(&status_register) & DMA_DONE)) data = *dma_buffer to ensure that we don't read stale data from the buffer. You might also need this for systems with spurious/early IRQ delivery for DMA completion. You'd have to throw in an explicit mb() if you wanted to order prior writel before the side-effectcs of a a later readl. > In general, I though the model offerred by x86 and thus by Linux > readl/writel was full synchronization both before and after the MMIO, > vs either other MMIO or all other forms of ops (cachable memory, locks > etc...). > > Also, can't the above readl_relaxed leak out of a lock ? No, it's ordered with respect to the release store to the lockword but that doesn't mean that an unlock does anything like ensure that the read has been satisifed (in particular, for your scenario above where it has side-effects then unlocking the lock doesn't guarantee that they've occurred). Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 11:44 ` Will Deacon 0 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-26 11:44 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Sinan Kaya, Oliver, linuxppc dev list, linux-rdma, Marc Zyngier Hi Ben, I don't seem to have the beginning of this thread, so please bounce it over if you'd like me to look at it! On Fri, Mar 23, 2018 at 11:16:08AM +1100, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote: > > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: > > > > > No, it's not sufficient. > > > > > > Just to clarify ... barrier() is just a compiler barrier, it means the > > > compiler will generate things in the order they are written. This isn't > > > sufficient on archs with an OO memory model, where an actual memory > > > barrier instruction needs to be emited. > > > > Surprisingly, ARM64 GCC compiler generates a write barrier as > > opposed to preventing code reordering. In context, this looks like a misunderstanding somewhere. barrier() is a compiler barrier for us just like everybody else and we use the generic implementation with the empty asm + memory clobber. > > I was curious if this is an ARM only thing or not. > > Are you sure of that ? I thought it's the ARM implementation of writel > that had an explicit write barrier in it: > > #define writel(v,c) ({ __iowmb(); writel_relaxed((v),(c)); }) > > And __iowmb() is > > #define __iowmb() wmb() > > Note, I'm a bit dubious about this in ARM: > > #define readl(c) ({ u32 __v = readl_relaxed(c); __iormb(); __v; } > > Will, Marc, on powerpc, we put a sync *before* the read in readl etc... > > The reasoning was there could be some DMA setup followed by a side > effect readl rather than a side effect writel to trigger a DMA. Granted > I wouldn't expect modern devices to be that stupid, but I have vague > memory of some devices back in the day having that sort of read ops. The reason we have it afterwards was for something like: while (!(readl(&status_register) & DMA_DONE)) data = *dma_buffer to ensure that we don't read stale data from the buffer. You might also need this for systems with spurious/early IRQ delivery for DMA completion. You'd have to throw in an explicit mb() if you wanted to order prior writel before the side-effectcs of a a later readl. > In general, I though the model offerred by x86 and thus by Linux > readl/writel was full synchronization both before and after the MMIO, > vs either other MMIO or all other forms of ops (cachable memory, locks > etc...). > > Also, can't the above readl_relaxed leak out of a lock ? No, it's ordered with respect to the release store to the lockword but that doesn't mean that an unlock does anything like ensure that the read has been satisifed (in particular, for your scenario above where it has side-effects then unlocking the lock doesn't guarantee that they've occurred). Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 11:44 ` Will Deacon @ 2018-03-26 12:11 ` okaya -1 siblings, 0 replies; 216+ messages in thread From: okaya @ 2018-03-26 12:11 UTC (permalink / raw) To: Will Deacon; +Cc: linux-rdma, Oliver, linuxppc dev list, Marc Zyngier On 2018-03-26 07:44, Will Deacon wrote: > Hi Ben, > > I don't seem to have the beginning of this thread, so please bounce it > over > if you'd like me to look at it! > https://www.spinics.net/lists/linux-rdma/msg62570.html https://www.spinics.net/lists/linux-rdma/index.html#62666 > On Fri, Mar 23, 2018 at 11:16:08AM +1100, Benjamin Herrenschmidt wrote: >> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote: >> > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: >> > > > > No, it's not sufficient. >> > > >> > > Just to clarify ... barrier() is just a compiler barrier, it means the >> > > compiler will generate things in the order they are written. This isn't >> > > sufficient on archs with an OO memory model, where an actual memory >> > > barrier instruction needs to be emited. >> > >> > Surprisingly, ARM64 GCC compiler generates a write barrier as >> > opposed to preventing code reordering. > > In context, this looks like a misunderstanding somewhere. barrier() is > a > compiler barrier for us just like everybody else and we use the generic > implementation with the empty asm + memory clobber. > True, I clarified it this weekend https://www.spinics.net/lists/linux-rdma/msg62788.html ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 12:11 ` okaya 0 siblings, 0 replies; 216+ messages in thread From: okaya @ 2018-03-26 12:11 UTC (permalink / raw) To: Will Deacon Cc: Benjamin Herrenschmidt, Oliver, linuxppc dev list, linux-rdma, Marc Zyngier On 2018-03-26 07:44, Will Deacon wrote: > Hi Ben, > > I don't seem to have the beginning of this thread, so please bounce it > over > if you'd like me to look at it! > https://www.spinics.net/lists/linux-rdma/msg62570.html https://www.spinics.net/lists/linux-rdma/index.html#62666 > On Fri, Mar 23, 2018 at 11:16:08AM +1100, Benjamin Herrenschmidt wrote: >> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote: >> > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: >> > > > > No, it's not sufficient. >> > > >> > > Just to clarify ... barrier() is just a compiler barrier, it means the >> > > compiler will generate things in the order they are written. This isn't >> > > sufficient on archs with an OO memory model, where an actual memory >> > > barrier instruction needs to be emited. >> > >> > Surprisingly, ARM64 GCC compiler generates a write barrier as >> > opposed to preventing code reordering. > > In context, this looks like a misunderstanding somewhere. barrier() is > a > compiler barrier for us just like everybody else and we use the generic > implementation with the empty asm + memory clobber. > True, I clarified it this weekend https://www.spinics.net/lists/linux-rdma/msg62788.html ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 12:11 ` okaya @ 2018-03-26 12:42 ` Sinan Kaya -1 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-26 12:42 UTC (permalink / raw) To: Will Deacon; +Cc: linux-rdma, Oliver, linuxppc dev list, Marc Zyngier On 3/26/2018 8:11 AM, okaya@codeaurora.org wrote: > On 2018-03-26 07:44, Will Deacon wrote: >> Hi Ben, >> >> I don't seem to have the beginning of this thread, so please bounce it over >> if you'd like me to look at it! >> > > https://www.spinics.net/lists/linux-rdma/msg62570.html > > https://www.spinics.net/lists/linux-rdma/index.html#62666 > To add some more details on why we are looking at this now: I posted several patches last week to remove duplicate barriers on ARM while trying to make the code friendly with other architectures. https://www.spinics.net/lists/netdev/msg491842.html https://www.spinics.net/lists/linux-rdma/msg62434.html https://www.spinics.net/lists/arm-kernel/msg642336.html The conversation on this thread is interesting. https://patchwork.kernel.org/patch/10288987/ 1. I tried to replace wmb()+writel() with wmb()+writel_relaxed(). 2. writel_relaxed() is equal to writel() at this moment for PPC. 3. Chelsio developers wanted to pull it into wmb()+__raw_writel() direction to take advantage of the same optimization for PPC. 4. Dave informed us that behavior of __raw_write() is not identical on all architectures. 5. We decided to go back to PPC and ask to implement writel_relaxed() instead of coming up with writel_realy_relaxed() API. > >> On Fri, Mar 23, 2018 at 11:16:08AM +1100, Benjamin Herrenschmidt wrote: >>> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote: >>> > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: >>> > > > > No, it's not sufficient. >>> > > >>> > > Just to clarify ... barrier() is just a compiler barrier, it means the >>> > > compiler will generate things in the order they are written. This isn't >>> > > sufficient on archs with an OO memory model, where an actual memory >>> > > barrier instruction needs to be emited. >>> > >>> > Surprisingly, ARM64 GCC compiler generates a write barrier as >>> > opposed to preventing code reordering. >> >> In context, this looks like a misunderstanding somewhere. barrier() is a >> compiler barrier for us just like everybody else and we use the generic >> implementation with the empty asm + memory clobber. >> > > True, I clarified it this weekend > > https://www.spinics.net/lists/linux-rdma/msg62788.html > > > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 12:42 ` Sinan Kaya 0 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-26 12:42 UTC (permalink / raw) To: Will Deacon Cc: Benjamin Herrenschmidt, Oliver, linuxppc dev list, linux-rdma, Marc Zyngier On 3/26/2018 8:11 AM, okaya@codeaurora.org wrote: > On 2018-03-26 07:44, Will Deacon wrote: >> Hi Ben, >> >> I don't seem to have the beginning of this thread, so please bounce it over >> if you'd like me to look at it! >> > > https://www.spinics.net/lists/linux-rdma/msg62570.html > > https://www.spinics.net/lists/linux-rdma/index.html#62666 > To add some more details on why we are looking at this now: I posted several patches last week to remove duplicate barriers on ARM while trying to make the code friendly with other architectures. https://www.spinics.net/lists/netdev/msg491842.html https://www.spinics.net/lists/linux-rdma/msg62434.html https://www.spinics.net/lists/arm-kernel/msg642336.html The conversation on this thread is interesting. https://patchwork.kernel.org/patch/10288987/ 1. I tried to replace wmb()+writel() with wmb()+writel_relaxed(). 2. writel_relaxed() is equal to writel() at this moment for PPC. 3. Chelsio developers wanted to pull it into wmb()+__raw_writel() direction to take advantage of the same optimization for PPC. 4. Dave informed us that behavior of __raw_write() is not identical on all architectures. 5. We decided to go back to PPC and ask to implement writel_relaxed() instead of coming up with writel_realy_relaxed() API. > >> On Fri, Mar 23, 2018 at 11:16:08AM +1100, Benjamin Herrenschmidt wrote: >>> On Thu, 2018-03-22 at 12:51 -0500, Sinan Kaya wrote: >>> > On 3/22/2018 8:52 AM, Benjamin Herrenschmidt wrote: >>> > > > > No, it's not sufficient. >>> > > >>> > > Just to clarify ... barrier() is just a compiler barrier, it means the >>> > > compiler will generate things in the order they are written. This isn't >>> > > sufficient on archs with an OO memory model, where an actual memory >>> > > barrier instruction needs to be emited. >>> > >>> > Surprisingly, ARM64 GCC compiler generates a write barrier as >>> > opposed to preventing code reordering. >> >> In context, this looks like a misunderstanding somewhere. barrier() is a >> compiler barrier for us just like everybody else and we use the generic >> implementation with the empty asm + memory clobber. >> > > True, I clarified it this weekend > > https://www.spinics.net/lists/linux-rdma/msg62788.html > > > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-22 13:52 ` Benjamin Herrenschmidt @ 2018-03-23 16:35 ` Jason Gunthorpe -1 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-23 16:35 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Oliver, linux-rdma On Fri, Mar 23, 2018 at 12:52:02AM +1100, Benjamin Herrenschmidt wrote: > > > - Make writel_relaxed() be a simple store without barriers, and > > > readl_relaxed() be "eieio, read, eieio", thus allowing write combining > > > to happen between successive writel_relaxed on WC space (no change on > > > normal NC space) while maintaining the ordering between relaxed reads > > > and writes. The flip side is a (slight) increased overhead of > > > readl_relaxed. > > > > Are there many drivers that actually do writeX() on WC space? > > memory-barriers.txt > > pretty much says that all bets are off and no ordering guarantees can be assumed > > when using readX/writeX on prefetchable IO memory. It seems sketchy enough to > > give me some pause, but maybe it works fine elsewhere. > > I don't know whether any does it, but I want to provide a way for a > driver to somewhat reliably obtain write combine semantics without > having to hand code endian swap and other horrors involved with using > __raw_* accessors. Many of the drivers in drivers/infiniband work with write combining memory. The usual pattern is a desire to push 32 or 64 bytes to the WC BAR as efficiently as possible, ideally in a single PCI-E TLP. A memcpy_to_wc primitive could probably cover these use cases, no need to redesign the IO accessors.. The WC memory is never read, so read/write order is not important to any infiniband driver. What is very important is keeping the WC behavior isolated within the spinlock. WC to the same addresses cannot be permitted in this pattern: writel(addr = 0); mmiowmb(); spin_unlock(); spin_lock() writel(addr = 0); The CPU must always generate two PCI-E TLPs to the device. This is a super performance critical operation for most drivers and directly impacts network performance. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-23 16:35 ` Jason Gunthorpe 0 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-23 16:35 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Oliver, Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma On Fri, Mar 23, 2018 at 12:52:02AM +1100, Benjamin Herrenschmidt wrote: > > > - Make writel_relaxed() be a simple store without barriers, and > > > readl_relaxed() be "eieio, read, eieio", thus allowing write combining > > > to happen between successive writel_relaxed on WC space (no change on > > > normal NC space) while maintaining the ordering between relaxed reads > > > and writes. The flip side is a (slight) increased overhead of > > > readl_relaxed. > > > > Are there many drivers that actually do writeX() on WC space? > > memory-barriers.txt > > pretty much says that all bets are off and no ordering guarantees can be assumed > > when using readX/writeX on prefetchable IO memory. It seems sketchy enough to > > give me some pause, but maybe it works fine elsewhere. > > I don't know whether any does it, but I want to provide a way for a > driver to somewhat reliably obtain write combine semantics without > having to hand code endian swap and other horrors involved with using > __raw_* accessors. Many of the drivers in drivers/infiniband work with write combining memory. The usual pattern is a desire to push 32 or 64 bytes to the WC BAR as efficiently as possible, ideally in a single PCI-E TLP. A memcpy_to_wc primitive could probably cover these use cases, no need to redesign the IO accessors.. The WC memory is never read, so read/write order is not important to any infiniband driver. What is very important is keeping the WC behavior isolated within the spinlock. WC to the same addresses cannot be permitted in this pattern: writel(addr = 0); mmiowmb(); spin_unlock(); spin_lock() writel(addr = 0); The CPU must always generate two PCI-E TLPs to the device. This is a super performance critical operation for most drivers and directly impacts network performance. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-23 16:35 ` Jason Gunthorpe @ 2018-03-24 1:23 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-24 1:23 UTC (permalink / raw) To: Jason Gunthorpe Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Oliver, linux-rdma On Fri, 2018-03-23 at 10:35 -0600, Jason Gunthorpe wrote: > On Fri, Mar 23, 2018 at 12:52:02AM +1100, Benjamin Herrenschmidt wrote: > > > > > - Make writel_relaxed() be a simple store without barriers, and > > > > readl_relaxed() be "eieio, read, eieio", thus allowing write combining > > > > to happen between successive writel_relaxed on WC space (no change on > > > > normal NC space) while maintaining the ordering between relaxed reads > > > > and writes. The flip side is a (slight) increased overhead of > > > > readl_relaxed. > > > > > > Are there many drivers that actually do writeX() on WC space? > > > memory-barriers.txt > > > pretty much says that all bets are off and no ordering guarantees can be assumed > > > when using readX/writeX on prefetchable IO memory. It seems sketchy enough to > > > give me some pause, but maybe it works fine elsewhere. > > > > I don't know whether any does it, but I want to provide a way for a > > driver to somewhat reliably obtain write combine semantics without > > having to hand code endian swap and other horrors involved with using > > __raw_* accessors. > > Many of the drivers in drivers/infiniband work with write combining > memory. > > The usual pattern is a desire to push 32 or 64 bytes to the WC BAR as > efficiently as possible, ideally in a single PCI-E TLP. > > A memcpy_to_wc primitive could probably cover these use cases, no need > to redesign the IO accessors.. > > The WC memory is never read, so read/write order is not important to > any infiniband driver. > > What is very important is keeping the WC behavior isolated within the > spinlock. WC to the same addresses cannot be permitted in this pattern: > > writel(addr = 0); > mmiowmb(); > spin_unlock(); > spin_lock() > writel(addr = 0); > > The CPU must always generate two PCI-E TLPs to the device. On powerpc you'll never get write combining with writel. So that at least is covered. > This is a super performance critical operation for most drivers and > directly impacts network performance. > > Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-24 1:23 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-24 1:23 UTC (permalink / raw) To: Jason Gunthorpe Cc: Oliver, Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma On Fri, 2018-03-23 at 10:35 -0600, Jason Gunthorpe wrote: > On Fri, Mar 23, 2018 at 12:52:02AM +1100, Benjamin Herrenschmidt wrote: > > > > > - Make writel_relaxed() be a simple store without barriers, and > > > > readl_relaxed() be "eieio, read, eieio", thus allowing write combining > > > > to happen between successive writel_relaxed on WC space (no change on > > > > normal NC space) while maintaining the ordering between relaxed reads > > > > and writes. The flip side is a (slight) increased overhead of > > > > readl_relaxed. > > > > > > Are there many drivers that actually do writeX() on WC space? > > > memory-barriers.txt > > > pretty much says that all bets are off and no ordering guarantees can be assumed > > > when using readX/writeX on prefetchable IO memory. It seems sketchy enough to > > > give me some pause, but maybe it works fine elsewhere. > > > > I don't know whether any does it, but I want to provide a way for a > > driver to somewhat reliably obtain write combine semantics without > > having to hand code endian swap and other horrors involved with using > > __raw_* accessors. > > Many of the drivers in drivers/infiniband work with write combining > memory. > > The usual pattern is a desire to push 32 or 64 bytes to the WC BAR as > efficiently as possible, ideally in a single PCI-E TLP. > > A memcpy_to_wc primitive could probably cover these use cases, no need > to redesign the IO accessors.. > > The WC memory is never read, so read/write order is not important to > any infiniband driver. > > What is very important is keeping the WC behavior isolated within the > spinlock. WC to the same addresses cannot be permitted in this pattern: > > writel(addr = 0); > mmiowmb(); > spin_unlock(); > spin_lock() > writel(addr = 0); > > The CPU must always generate two PCI-E TLPs to the device. On powerpc you'll never get write combining with writel. So that at least is covered. > This is a super performance critical operation for most drivers and > directly impacts network performance. > > Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed 2018-03-24 1:23 ` Benjamin Herrenschmidt @ 2018-03-26 11:08 ` David Laight -1 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-26 11:08 UTC (permalink / raw) To: 'Benjamin Herrenschmidt', Jason Gunthorpe Cc: Sinan Kaya, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma > > This is a super performance critical operation for most drivers and > > directly impacts network performance. Perhaps there ought to be writel_nobarrier() (etc) that never contain any barriers at all. This might mean that they are always just the memory operation, but it would make it more obvious what the driver was doing. The driver would then be explicitly responsible for all the rmb(), wmb() and mmiowb() (etc). Performance critical paths could then avoid all the extra barriers. David ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed @ 2018-03-26 11:08 ` David Laight 0 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-26 11:08 UTC (permalink / raw) To: 'Benjamin Herrenschmidt', Jason Gunthorpe Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Oliver, linux-rdma PiA+IFRoaXMgaXMgYSBzdXBlciBwZXJmb3JtYW5jZSBjcml0aWNhbCBvcGVyYXRpb24gZm9yIG1v c3QgZHJpdmVycyBhbmQNCj4gPiBkaXJlY3RseSBpbXBhY3RzIG5ldHdvcmsgcGVyZm9ybWFuY2Uu DQoNClBlcmhhcHMgdGhlcmUgb3VnaHQgdG8gYmUgd3JpdGVsX25vYmFycmllcigpIChldGMpIHRo YXQgbmV2ZXIgY29udGFpbg0KYW55IGJhcnJpZXJzIGF0IGFsbC4NClRoaXMgbWlnaHQgbWVhbiB0 aGF0IHRoZXkgYXJlIGFsd2F5cyBqdXN0IHRoZSBtZW1vcnkgb3BlcmF0aW9uLA0KYnV0IGl0IHdv dWxkIG1ha2UgaXQgbW9yZSBvYnZpb3VzIHdoYXQgdGhlIGRyaXZlciB3YXMgZG9pbmcuDQoNClRo ZSBkcml2ZXIgd291bGQgdGhlbiBiZSBleHBsaWNpdGx5IHJlc3BvbnNpYmxlIGZvciBhbGwgdGhl IHJtYigpLCB3bWIoKQ0KYW5kIG1taW93YigpIChldGMpLg0KUGVyZm9ybWFuY2UgY3JpdGljYWwg cGF0aHMgY291bGQgdGhlbiBhdm9pZCBhbGwgdGhlIGV4dHJhIGJhcnJpZXJzLg0KDQoJRGF2aWQN Cg0KDQo= ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 11:08 ` David Laight @ 2018-03-26 16:54 ` Jason Gunthorpe -1 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-26 16:54 UTC (permalink / raw) To: David Laight Cc: Sinan Kaya, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: > > > This is a super performance critical operation for most drivers and > > > directly impacts network performance. > > Perhaps there ought to be writel_nobarrier() (etc) that never contain > any barriers at all. > This might mean that they are always just the memory operation, > but it would make it more obvious what the driver was doing. I think that is what writel_relaxed is supposed to be. The only restriction it has is that the writes to a single device using UC memory must be kept in program order.. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 16:54 ` Jason Gunthorpe 0 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-26 16:54 UTC (permalink / raw) To: David Laight Cc: 'Benjamin Herrenschmidt', Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Oliver, linux-rdma On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: > > > This is a super performance critical operation for most drivers and > > > directly impacts network performance. > > Perhaps there ought to be writel_nobarrier() (etc) that never contain > any barriers at all. > This might mean that they are always just the memory operation, > but it would make it more obvious what the driver was doing. I think that is what writel_relaxed is supposed to be. The only restriction it has is that the writes to a single device using UC memory must be kept in program order.. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 16:54 ` Jason Gunthorpe @ 2018-03-26 19:44 ` Arnd Bergmann -1 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-26 19:44 UTC (permalink / raw) To: Jason Gunthorpe Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), David Laight, Oliver, linux-rdma On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: >> > > This is a super performance critical operation for most drivers and >> > > directly impacts network performance. >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain >> any barriers at all. >> This might mean that they are always just the memory operation, >> but it would make it more obvious what the driver was doing. > > I think that is what writel_relaxed is supposed to be. > > The only restriction it has is that the writes to a single device > using UC memory must be kept in program order.. Not sure about whether we have ever defined what happens to writel_relaxed() on WC memory though: On ARM, we disallow the compiler to combine writes, but the CPU still might. It's also not entirely clear to me what we want writel() inside a spinlock to mean: should the spinlock guarantee that two writel() calls on different CPUs that are protected by spinlocks are serialized by those locks, or not? Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 19:44 ` Arnd Bergmann 0 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-26 19:44 UTC (permalink / raw) To: Jason Gunthorpe Cc: David Laight, Sinan Kaya, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: >> > > This is a super performance critical operation for most drivers and >> > > directly impacts network performance. >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain >> any barriers at all. >> This might mean that they are always just the memory operation, >> but it would make it more obvious what the driver was doing. > > I think that is what writel_relaxed is supposed to be. > > The only restriction it has is that the writes to a single device > using UC memory must be kept in program order.. Not sure about whether we have ever defined what happens to writel_relaxed() on WC memory though: On ARM, we disallow the compiler to combine writes, but the CPU still might. It's also not entirely clear to me what we want writel() inside a spinlock to mean: should the spinlock guarantee that two writel() calls on different CPUs that are protected by spinlocks are serialized by those locks, or not? Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 19:44 ` Arnd Bergmann @ 2018-03-26 20:25 ` Jason Gunthorpe -1 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-26 20:25 UTC (permalink / raw) To: Arnd Bergmann Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), David Laight, Oliver, linux-rdma On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote: > On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: > >> > > This is a super performance critical operation for most drivers and > >> > > directly impacts network performance. > >> > >> Perhaps there ought to be writel_nobarrier() (etc) that never contain > >> any barriers at all. > >> This might mean that they are always just the memory operation, > >> but it would make it more obvious what the driver was doing. > > > > I think that is what writel_relaxed is supposed to be. > > > > The only restriction it has is that the writes to a single device > > using UC memory must be kept in program order.. > > Not sure about whether we have ever defined what happens to > writel_relaxed() on WC memory though: On ARM, we disallow > the compiler to combine writes, but the CPU still might. If the driver uses WC memory then I think it should not expect anything in terms of how writes map to TLPs other than nothing combines across mmiowb() and mmiowb() is fully globally ordered when enclosed in a spinlock. The entire point of using WC memory is usually to get combining :) If the driver doesn't want that then it should map UC.. > It's also not entirely clear to me what we want writel() inside a > spinlock to mean: should the spinlock guarantee that two writel() > calls on different CPUs that are protected by spinlocks are > serialized by those locks, or not? Yes for writel, I think that is already defined by the barriers document The same document says that _relaxed() does not give that guarentee. The lwn articule on this went into some depth on the interaction with spinlocks. As far as I can see, containment in a spinlock seems to be the only different between writel and writel_relaxed.. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 20:25 ` Jason Gunthorpe 0 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-26 20:25 UTC (permalink / raw) To: Arnd Bergmann Cc: David Laight, Sinan Kaya, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote: > On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: > >> > > This is a super performance critical operation for most drivers and > >> > > directly impacts network performance. > >> > >> Perhaps there ought to be writel_nobarrier() (etc) that never contain > >> any barriers at all. > >> This might mean that they are always just the memory operation, > >> but it would make it more obvious what the driver was doing. > > > > I think that is what writel_relaxed is supposed to be. > > > > The only restriction it has is that the writes to a single device > > using UC memory must be kept in program order.. > > Not sure about whether we have ever defined what happens to > writel_relaxed() on WC memory though: On ARM, we disallow > the compiler to combine writes, but the CPU still might. If the driver uses WC memory then I think it should not expect anything in terms of how writes map to TLPs other than nothing combines across mmiowb() and mmiowb() is fully globally ordered when enclosed in a spinlock. The entire point of using WC memory is usually to get combining :) If the driver doesn't want that then it should map UC.. > It's also not entirely clear to me what we want writel() inside a > spinlock to mean: should the spinlock guarantee that two writel() > calls on different CPUs that are protected by spinlocks are > serialized by those locks, or not? Yes for writel, I think that is already defined by the barriers document The same document says that _relaxed() does not give that guarentee. The lwn articule on this went into some depth on the interaction with spinlocks. As far as I can see, containment in a spinlock seems to be the only different between writel and writel_relaxed.. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 20:25 ` Jason Gunthorpe @ 2018-03-26 20:43 ` Arnd Bergmann -1 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-26 20:43 UTC (permalink / raw) To: Jason Gunthorpe Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), David Laight, Oliver, linux-rdma On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote: >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: >> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: >> >> > > This is a super performance critical operation for most drivers and >> >> > > directly impacts network performance. >> >> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain >> >> any barriers at all. >> >> This might mean that they are always just the memory operation, >> >> but it would make it more obvious what the driver was doing. >> > >> > I think that is what writel_relaxed is supposed to be. >> > >> > The only restriction it has is that the writes to a single device >> > using UC memory must be kept in program order.. >> >> Not sure about whether we have ever defined what happens to >> writel_relaxed() on WC memory though: On ARM, we disallow >> the compiler to combine writes, but the CPU still might. > > If the driver uses WC memory then I think it should not expect > anything in terms of how writes map to TLPs other than nothing > combines across mmiowb() and mmiowb() is fully globally ordered when > enclosed in a spinlock. > > The entire point of using WC memory is usually to get combining :) If > the driver doesn't want that then it should map UC.. Usually, WC memory is used with memcpy_toio() though, which by definition doesn't have any barriers between accesses, and is required to get the correct byte ordering on writes to memory buffers. >> It's also not entirely clear to me what we want writel() inside a >> spinlock to mean: should the spinlock guarantee that two writel() >> calls on different CPUs that are protected by spinlocks are >> serialized by those locks, or not? > > Yes for writel, I think that is already defined by the barriers > document Sorry, I meant writel_relaxed(), not writel() > The same document says that _relaxed() does not give that guarentee. > > The lwn articule on this went into some depth on the interaction with > spinlocks. > > As far as I can see, containment in a spinlock seems to be the only > different between writel and writel_relaxed.. I was always puzzled by this: The intention of _relaxed() on ARM (where it originates) was to skip the barrier that serializes DMA with MMIO, not to skip the serialization between MMIO and locks. I never fully understood the part about the locks, but from what I remember, ARM is still serialized without the barrier here, but dropping the barrier on powerpc writel_relaxed() would not serialize against locks or DMA. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 20:43 ` Arnd Bergmann 0 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-26 20:43 UTC (permalink / raw) To: Jason Gunthorpe Cc: David Laight, Sinan Kaya, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote: >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: >> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: >> >> > > This is a super performance critical operation for most drivers and >> >> > > directly impacts network performance. >> >> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain >> >> any barriers at all. >> >> This might mean that they are always just the memory operation, >> >> but it would make it more obvious what the driver was doing. >> > >> > I think that is what writel_relaxed is supposed to be. >> > >> > The only restriction it has is that the writes to a single device >> > using UC memory must be kept in program order.. >> >> Not sure about whether we have ever defined what happens to >> writel_relaxed() on WC memory though: On ARM, we disallow >> the compiler to combine writes, but the CPU still might. > > If the driver uses WC memory then I think it should not expect > anything in terms of how writes map to TLPs other than nothing > combines across mmiowb() and mmiowb() is fully globally ordered when > enclosed in a spinlock. > > The entire point of using WC memory is usually to get combining :) If > the driver doesn't want that then it should map UC.. Usually, WC memory is used with memcpy_toio() though, which by definition doesn't have any barriers between accesses, and is required to get the correct byte ordering on writes to memory buffers. >> It's also not entirely clear to me what we want writel() inside a >> spinlock to mean: should the spinlock guarantee that two writel() >> calls on different CPUs that are protected by spinlocks are >> serialized by those locks, or not? > > Yes for writel, I think that is already defined by the barriers > document Sorry, I meant writel_relaxed(), not writel() > The same document says that _relaxed() does not give that guarentee. > > The lwn articule on this went into some depth on the interaction with > spinlocks. > > As far as I can see, containment in a spinlock seems to be the only > different between writel and writel_relaxed.. I was always puzzled by this: The intention of _relaxed() on ARM (where it originates) was to skip the barrier that serializes DMA with MMIO, not to skip the serialization between MMIO and locks. I never fully understood the part about the locks, but from what I remember, ARM is still serialized without the barrier here, but dropping the barrier on powerpc writel_relaxed() would not serialize against locks or DMA. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 20:43 ` Arnd Bergmann @ 2018-03-26 21:09 ` Jason Gunthorpe -1 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-26 21:09 UTC (permalink / raw) To: Arnd Bergmann Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), David Laight, Oliver, linux-rdma On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote: > On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote: > >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > >> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: > >> >> > > This is a super performance critical operation for most drivers and > >> >> > > directly impacts network performance. > >> >> > >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain > >> >> any barriers at all. > >> >> This might mean that they are always just the memory operation, > >> >> but it would make it more obvious what the driver was doing. > >> > > >> > I think that is what writel_relaxed is supposed to be. > >> > > >> > The only restriction it has is that the writes to a single device > >> > using UC memory must be kept in program order.. > >> > >> Not sure about whether we have ever defined what happens to > >> writel_relaxed() on WC memory though: On ARM, we disallow > >> the compiler to combine writes, but the CPU still might. > > > > If the driver uses WC memory then I think it should not expect > > anything in terms of how writes map to TLPs other than nothing > > combines across mmiowb() and mmiowb() is fully globally ordered when > > enclosed in a spinlock. > > > > The entire point of using WC memory is usually to get combining :) If > > the driver doesn't want that then it should map UC.. > > Usually, WC memory is used with memcpy_toio() though, which > by definition doesn't have any barriers between accesses, and > is required to get the correct byte ordering on writes to memory buffers. memcpy_toio is too expensive to actually use for anything performance though. It is too pessimistic. What the drivers usually want is a unwound block of 4 or 8 8-byte copies. No function calls, no branching. Everything is already known to be aligned. Most of the drivers have a unwound loop with writeq() or something to do it. > > The same document says that _relaxed() does not give that guarentee. > > > > The lwn articule on this went into some depth on the interaction with > > spinlocks. > > > > As far as I can see, containment in a spinlock seems to be the only > > different between writel and writel_relaxed.. > > I was always puzzled by this: The intention of _relaxed() on ARM > (where it originates) was to skip the barrier that serializes DMA > with MMIO, not to skip the serialization between MMIO and locks. But that was never a requirement of writel(), Documentation/memory-barriers.txt gives an explicit example demanding the wmb() before writel() for ordering system memory against writel. I actually have no idea why ARM had that barrier, I always assumed it was to give program ordering to the accesses and that _relaxed allowed re-ordering (the usual meaning of relaxed).. But the barrier document makes it pretty clear that the only difference between the two is spinlock containment, and WillD wrote this text, so I belive it is accurate for ARM. Very confusing. > I never fully understood the part about the locks, but from what > I remember, ARM is still serialized without the barrier here, but > dropping the barrier on powerpc writel_relaxed() would not > serialize against locks or DMA. WC is usually the problem here.. I've been told it is necessary on ARM as well.. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 21:09 ` Jason Gunthorpe 0 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-26 21:09 UTC (permalink / raw) To: Arnd Bergmann Cc: David Laight, Sinan Kaya, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote: > On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote: > >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > >> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: > >> >> > > This is a super performance critical operation for most drivers and > >> >> > > directly impacts network performance. > >> >> > >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain > >> >> any barriers at all. > >> >> This might mean that they are always just the memory operation, > >> >> but it would make it more obvious what the driver was doing. > >> > > >> > I think that is what writel_relaxed is supposed to be. > >> > > >> > The only restriction it has is that the writes to a single device > >> > using UC memory must be kept in program order.. > >> > >> Not sure about whether we have ever defined what happens to > >> writel_relaxed() on WC memory though: On ARM, we disallow > >> the compiler to combine writes, but the CPU still might. > > > > If the driver uses WC memory then I think it should not expect > > anything in terms of how writes map to TLPs other than nothing > > combines across mmiowb() and mmiowb() is fully globally ordered when > > enclosed in a spinlock. > > > > The entire point of using WC memory is usually to get combining :) If > > the driver doesn't want that then it should map UC.. > > Usually, WC memory is used with memcpy_toio() though, which > by definition doesn't have any barriers between accesses, and > is required to get the correct byte ordering on writes to memory buffers. memcpy_toio is too expensive to actually use for anything performance though. It is too pessimistic. What the drivers usually want is a unwound block of 4 or 8 8-byte copies. No function calls, no branching. Everything is already known to be aligned. Most of the drivers have a unwound loop with writeq() or something to do it. > > The same document says that _relaxed() does not give that guarentee. > > > > The lwn articule on this went into some depth on the interaction with > > spinlocks. > > > > As far as I can see, containment in a spinlock seems to be the only > > different between writel and writel_relaxed.. > > I was always puzzled by this: The intention of _relaxed() on ARM > (where it originates) was to skip the barrier that serializes DMA > with MMIO, not to skip the serialization between MMIO and locks. But that was never a requirement of writel(), Documentation/memory-barriers.txt gives an explicit example demanding the wmb() before writel() for ordering system memory against writel. I actually have no idea why ARM had that barrier, I always assumed it was to give program ordering to the accesses and that _relaxed allowed re-ordering (the usual meaning of relaxed).. But the barrier document makes it pretty clear that the only difference between the two is spinlock containment, and WillD wrote this text, so I belive it is accurate for ARM. Very confusing. > I never fully understood the part about the locks, but from what > I remember, ARM is still serialized without the barrier here, but > dropping the barrier on powerpc writel_relaxed() would not > serialize against locks or DMA. WC is usually the problem here.. I've been told it is necessary on ARM as well.. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 21:09 ` Jason Gunthorpe @ 2018-03-26 21:30 ` Arnd Bergmann -1 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-26 21:30 UTC (permalink / raw) To: Jason Gunthorpe Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Mon, Mar 26, 2018 at 11:09 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote: >> On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: >> > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote: >> >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: >> >> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: >> >> >> > > This is a super performance critical operation for most drivers and >> >> >> > > directly impacts network performance. >> >> >> >> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain >> >> >> any barriers at all. >> >> >> This might mean that they are always just the memory operation, >> >> >> but it would make it more obvious what the driver was doing. >> >> > >> >> > I think that is what writel_relaxed is supposed to be. >> >> > >> >> > The only restriction it has is that the writes to a single device >> >> > using UC memory must be kept in program order.. >> >> >> >> Not sure about whether we have ever defined what happens to >> >> writel_relaxed() on WC memory though: On ARM, we disallow >> >> the compiler to combine writes, but the CPU still might. >> > >> > If the driver uses WC memory then I think it should not expect >> > anything in terms of how writes map to TLPs other than nothing >> > combines across mmiowb() and mmiowb() is fully globally ordered when >> > enclosed in a spinlock. >> > >> > The entire point of using WC memory is usually to get combining :) If >> > the driver doesn't want that then it should map UC.. >> >> Usually, WC memory is used with memcpy_toio() though, which >> by definition doesn't have any barriers between accesses, and >> is required to get the correct byte ordering on writes to memory buffers. > > memcpy_toio is too expensive to actually use for anything performance > though. It is too pessimistic. What the drivers usually want is a > unwound block of 4 or 8 8-byte copies. No function calls, no > branching. Everything is already known to be aligned. > > Most of the drivers have a unwound loop with writeq() or something to > do it. But isn't the writeq() barrier much more expensive than anything you'd do in function calls? >> > The same document says that _relaxed() does not give that guarentee. >> > >> > The lwn articule on this went into some depth on the interaction with >> > spinlocks. >> > >> > As far as I can see, containment in a spinlock seems to be the only >> > different between writel and writel_relaxed.. >> >> I was always puzzled by this: The intention of _relaxed() on ARM >> (where it originates) was to skip the barrier that serializes DMA >> with MMIO, not to skip the serialization between MMIO and locks. > > But that was never a requirement of writel(), > Documentation/memory-barriers.txt gives an explicit example demanding > the wmb() before writel() for ordering system memory against writel. Indeed, but it's in an example for when to use dma_wmb(), not wmb(). Adding Alexander Duyck to Cc, he added that section as part of 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and dma_wmb()"). Also adding the other people that were involved with that. > I actually have no idea why ARM had that barrier, I always assumed it > was to give program ordering to the accesses and that _relaxed allowed > re-ordering (the usual meaning of relaxed).. > > But the barrier document makes it pretty clear that the only > difference between the two is spinlock containment, and WillD wrote > this text, so I belive it is accurate for ARM. > > Very confusing. It does mention serialization with both DMA and locks in the section about readX_relaxed()/writeX_relaxed(). The part about DMA is very clear here, and I must have just forgotten the exact semantics with regards to spinlocks. I'm still not sure what prevents a writel() from leaking out the end of a spinlock section that doesn't happen with writel_relaxed(), since the barrier in writel() comes before the access, and the spin_unlock() shouldn't affect the external buses. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 21:30 ` Arnd Bergmann 0 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-26 21:30 UTC (permalink / raw) To: Jason Gunthorpe Cc: David Laight, Sinan Kaya, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Benjamin Herrenschmidt, Paul E. McKenney On Mon, Mar 26, 2018 at 11:09 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Mon, Mar 26, 2018 at 10:43:43PM +0200, Arnd Bergmann wrote: >> On Mon, Mar 26, 2018 at 10:25 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: >> > On Mon, Mar 26, 2018 at 09:44:15PM +0200, Arnd Bergmann wrote: >> >> On Mon, Mar 26, 2018 at 6:54 PM, Jason Gunthorpe <jgg@ziepe.ca> wrote: >> >> > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: >> >> >> > > This is a super performance critical operation for most drivers and >> >> >> > > directly impacts network performance. >> >> >> >> >> >> Perhaps there ought to be writel_nobarrier() (etc) that never contain >> >> >> any barriers at all. >> >> >> This might mean that they are always just the memory operation, >> >> >> but it would make it more obvious what the driver was doing. >> >> > >> >> > I think that is what writel_relaxed is supposed to be. >> >> > >> >> > The only restriction it has is that the writes to a single device >> >> > using UC memory must be kept in program order.. >> >> >> >> Not sure about whether we have ever defined what happens to >> >> writel_relaxed() on WC memory though: On ARM, we disallow >> >> the compiler to combine writes, but the CPU still might. >> > >> > If the driver uses WC memory then I think it should not expect >> > anything in terms of how writes map to TLPs other than nothing >> > combines across mmiowb() and mmiowb() is fully globally ordered when >> > enclosed in a spinlock. >> > >> > The entire point of using WC memory is usually to get combining :) If >> > the driver doesn't want that then it should map UC.. >> >> Usually, WC memory is used with memcpy_toio() though, which >> by definition doesn't have any barriers between accesses, and >> is required to get the correct byte ordering on writes to memory buffers. > > memcpy_toio is too expensive to actually use for anything performance > though. It is too pessimistic. What the drivers usually want is a > unwound block of 4 or 8 8-byte copies. No function calls, no > branching. Everything is already known to be aligned. > > Most of the drivers have a unwound loop with writeq() or something to > do it. But isn't the writeq() barrier much more expensive than anything you'd do in function calls? >> > The same document says that _relaxed() does not give that guarentee. >> > >> > The lwn articule on this went into some depth on the interaction with >> > spinlocks. >> > >> > As far as I can see, containment in a spinlock seems to be the only >> > different between writel and writel_relaxed.. >> >> I was always puzzled by this: The intention of _relaxed() on ARM >> (where it originates) was to skip the barrier that serializes DMA >> with MMIO, not to skip the serialization between MMIO and locks. > > But that was never a requirement of writel(), > Documentation/memory-barriers.txt gives an explicit example demanding > the wmb() before writel() for ordering system memory against writel. Indeed, but it's in an example for when to use dma_wmb(), not wmb(). Adding Alexander Duyck to Cc, he added that section as part of 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and dma_wmb()"). Also adding the other people that were involved with that. > I actually have no idea why ARM had that barrier, I always assumed it > was to give program ordering to the accesses and that _relaxed allowed > re-ordering (the usual meaning of relaxed).. > > But the barrier document makes it pretty clear that the only > difference between the two is spinlock containment, and WillD wrote > this text, so I belive it is accurate for ARM. > > Very confusing. It does mention serialization with both DMA and locks in the section about readX_relaxed()/writeX_relaxed(). The part about DMA is very clear here, and I must have just forgotten the exact semantics with regards to spinlocks. I'm still not sure what prevents a writel() from leaking out the end of a spinlock section that doesn't happen with writel_relaxed(), since the barrier in writel() comes before the access, and the spin_unlock() shouldn't affect the external buses. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 21:30 ` Arnd Bergmann @ 2018-03-26 21:46 ` Sinan Kaya -1 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-26 21:46 UTC (permalink / raw) To: Arnd Bergmann, Jason Gunthorpe Cc: Paul E. McKenney, linux-rdma, Will Deacon, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On 3/26/2018 5:30 PM, Arnd Bergmann wrote: >> But that was never a requirement of writel(), >> Documentation/memory-barriers.txt gives an explicit example demanding >> the wmb() before writel() for ordering system memory against writel. > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > Adding Alexander Duyck to Cc, he added that section as part of > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > dma_wmb()"). Also adding the other people that were involved with that. > ARM developers can get away with not including wmb() in their code and use writel() to observe memory writes due to implicit barriers. However, same code will not work on Intel. writel() has a compiler barrier in it for x86. wmb() has a sync operation in it for x86. Unless wmb() is called, PCIe device won't observe memory updates from the CPU. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 21:46 ` Sinan Kaya 0 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-26 21:46 UTC (permalink / raw) To: Arnd Bergmann, Jason Gunthorpe Cc: David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Benjamin Herrenschmidt, Paul E. McKenney On 3/26/2018 5:30 PM, Arnd Bergmann wrote: >> But that was never a requirement of writel(), >> Documentation/memory-barriers.txt gives an explicit example demanding >> the wmb() before writel() for ordering system memory against writel. > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > Adding Alexander Duyck to Cc, he added that section as part of > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > dma_wmb()"). Also adding the other people that were involved with that. > ARM developers can get away with not including wmb() in their code and use writel() to observe memory writes due to implicit barriers. However, same code will not work on Intel. writel() has a compiler barrier in it for x86. wmb() has a sync operation in it for x86. Unless wmb() is called, PCIe device won't observe memory updates from the CPU. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 21:46 ` Sinan Kaya @ 2018-03-26 22:01 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 22:01 UTC (permalink / raw) To: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe Cc: Paul E. McKenney, linux-rdma, Will Deacon, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > On 3/26/2018 5:30 PM, Arnd Bergmann wrote: > > > But that was never a requirement of writel(), > > > Documentation/memory-barriers.txt gives an explicit example demanding > > > the wmb() before writel() for ordering system memory against writel. > > > > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > > Adding Alexander Duyck to Cc, he added that section as part of > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > > dma_wmb()"). Also adding the other people that were involved with that. > > > > ARM developers can get away with not including wmb() in their code and use > writel() to observe memory writes due to implicit barriers. > > However, same code will not work on Intel. Wrong. It will. You do NOT need wmb between writes to memory and writel. > writel() has a compiler barrier in it for x86. > wmb() has a sync operation in it for x86. > > Unless wmb() is called, PCIe device won't observe memory updates from the CPU. This is completely wrong. They will. Intel provides the necessary ordering guarantees without an explicit wmb. Otherwise almost all drivers out there are broken which I very much doubt :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 22:01 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 22:01 UTC (permalink / raw) To: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe Cc: David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > On 3/26/2018 5:30 PM, Arnd Bergmann wrote: > > > But that was never a requirement of writel(), > > > Documentation/memory-barriers.txt gives an explicit example demanding > > > the wmb() before writel() for ordering system memory against writel. > > > > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > > Adding Alexander Duyck to Cc, he added that section as part of > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > > dma_wmb()"). Also adding the other people that were involved with that. > > > > ARM developers can get away with not including wmb() in their code and use > writel() to observe memory writes due to implicit barriers. > > However, same code will not work on Intel. Wrong. It will. You do NOT need wmb between writes to memory and writel. > writel() has a compiler barrier in it for x86. > wmb() has a sync operation in it for x86. > > Unless wmb() is called, PCIe device won't observe memory updates from the CPU. This is completely wrong. They will. Intel provides the necessary ordering guarantees without an explicit wmb. Otherwise almost all drivers out there are broken which I very much doubt :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 22:01 ` Benjamin Herrenschmidt @ 2018-03-26 22:08 ` Sinan Kaya -1 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-26 22:08 UTC (permalink / raw) To: Benjamin Herrenschmidt, Arnd Bergmann, Jason Gunthorpe Cc: linux-rdma, Will Deacon, Alexander Duyck, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On 3/26/2018 6:01 PM, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: >> On 3/26/2018 5:30 PM, Arnd Bergmann wrote: >>>> But that was never a requirement of writel(), >>>> Documentation/memory-barriers.txt gives an explicit example demanding >>>> the wmb() before writel() for ordering system memory against writel. >>> >>> Indeed, but it's in an example for when to use dma_wmb(), not wmb(). >>> Adding Alexander Duyck to Cc, he added that section as part of >>> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and >>> dma_wmb()"). Also adding the other people that were involved with that. >>> >> >> ARM developers can get away with not including wmb() in their code and use >> writel() to observe memory writes due to implicit barriers. >> >> However, same code will not work on Intel. > > Wrong. It will. > > You do NOT need wmb between writes to memory and writel. If writel() provides such a guarantee, why do I see code sequences like wmb() writel() all over the place. > >> writel() has a compiler barrier in it for x86. >> wmb() has a sync operation in it for x86. >> >> Unless wmb() is called, PCIe device won't observe memory updates from the CPU. > > This is completely wrong. They will. Intel provides the necessary > ordering guarantees without an explicit wmb. > I'm still reserving my doubts here. I was told about an explicit wmb() requirement last week. > Otherwise almost all drivers out there are broken which I very much > doubt :-) > > Cheers, > Ben. > > > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 22:08 ` Sinan Kaya 0 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-26 22:08 UTC (permalink / raw) To: Benjamin Herrenschmidt, Arnd Bergmann, Jason Gunthorpe Cc: David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On 3/26/2018 6:01 PM, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: >> On 3/26/2018 5:30 PM, Arnd Bergmann wrote: >>>> But that was never a requirement of writel(), >>>> Documentation/memory-barriers.txt gives an explicit example demanding >>>> the wmb() before writel() for ordering system memory against writel. >>> >>> Indeed, but it's in an example for when to use dma_wmb(), not wmb(). >>> Adding Alexander Duyck to Cc, he added that section as part of >>> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and >>> dma_wmb()"). Also adding the other people that were involved with that. >>> >> >> ARM developers can get away with not including wmb() in their code and use >> writel() to observe memory writes due to implicit barriers. >> >> However, same code will not work on Intel. > > Wrong. It will. > > You do NOT need wmb between writes to memory and writel. If writel() provides such a guarantee, why do I see code sequences like wmb() writel() all over the place. > >> writel() has a compiler barrier in it for x86. >> wmb() has a sync operation in it for x86. >> >> Unless wmb() is called, PCIe device won't observe memory updates from the CPU. > > This is completely wrong. They will. Intel provides the necessary > ordering guarantees without an explicit wmb. > I'm still reserving my doubts here. I was told about an explicit wmb() requirement last week. > Otherwise almost all drivers out there are broken which I very much > doubt :-) > > Cheers, > Ben. > > > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 22:08 ` Sinan Kaya @ 2018-03-26 22:28 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 22:28 UTC (permalink / raw) To: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe Cc: linux-rdma, Will Deacon, Alexander Duyck, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Mon, 2018-03-26 at 18:08 -0400, Sinan Kaya wrote: > On 3/26/2018 6:01 PM, Benjamin Herrenschmidt wrote: > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > On 3/26/2018 5:30 PM, Arnd Bergmann wrote: > > > > > But that was never a requirement of writel(), > > > > > Documentation/memory-barriers.txt gives an explicit example demanding > > > > > the wmb() before writel() for ordering system memory against writel. > > > > > > > > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > > > > Adding Alexander Duyck to Cc, he added that section as part of > > > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > > > > dma_wmb()"). Also adding the other people that were involved with that. > > > > > > > > > > ARM developers can get away with not including wmb() in their code and use > > > writel() to observe memory writes due to implicit barriers. > > > > > > However, same code will not work on Intel. > > > > Wrong. It will. > > > > You do NOT need wmb between writes to memory and writel. > > If writel() provides such a guarantee, why do I see code sequences like > > wmb() > writel() > > all over the place. Because it was badly documented and people didn't know what to do, or maybe the underlying mapping is WC ? I don't know for sure but I can tell you Linus opinion on the matter back in the days was very clear and that's why we implemented writel the way we did on powerpc. > > > > > writel() has a compiler barrier in it for x86. > > > wmb() has a sync operation in it for x86. > > > > > > Unless wmb() is called, PCIe device won't observe memory updates from the CPU. > > > > This is completely wrong. They will. Intel provides the necessary > > ordering guarantees without an explicit wmb. > > > > I'm still reserving my doubts here. I was told about an explicit > wmb() requirement last week. By whome ? > > Otherwise almost all drivers out there are broken which I very much > > doubt :-) > > > > Cheers, > > Ben. > > > > > > > > ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 22:28 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 22:28 UTC (permalink / raw) To: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe Cc: David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Mon, 2018-03-26 at 18:08 -0400, Sinan Kaya wrote: > On 3/26/2018 6:01 PM, Benjamin Herrenschmidt wrote: > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > On 3/26/2018 5:30 PM, Arnd Bergmann wrote: > > > > > But that was never a requirement of writel(), > > > > > Documentation/memory-barriers.txt gives an explicit example demanding > > > > > the wmb() before writel() for ordering system memory against writel. > > > > > > > > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > > > > Adding Alexander Duyck to Cc, he added that section as part of > > > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > > > > dma_wmb()"). Also adding the other people that were involved with that. > > > > > > > > > > ARM developers can get away with not including wmb() in their code and use > > > writel() to observe memory writes due to implicit barriers. > > > > > > However, same code will not work on Intel. > > > > Wrong. It will. > > > > You do NOT need wmb between writes to memory and writel. > > If writel() provides such a guarantee, why do I see code sequences like > > wmb() > writel() > > all over the place. Because it was badly documented and people didn't know what to do, or maybe the underlying mapping is WC ? I don't know for sure but I can tell you Linus opinion on the matter back in the days was very clear and that's why we implemented writel the way we did on powerpc. > > > > > writel() has a compiler barrier in it for x86. > > > wmb() has a sync operation in it for x86. > > > > > > Unless wmb() is called, PCIe device won't observe memory updates from the CPU. > > > > This is completely wrong. They will. Intel provides the necessary > > ordering guarantees without an explicit wmb. > > > > I'm still reserving my doubts here. I was told about an explicit > wmb() requirement last week. By whome ? > > Otherwise almost all drivers out there are broken which I very much > > doubt :-) > > > > Cheers, > > Ben. > > > > > > > > ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 22:01 ` Benjamin Herrenschmidt @ 2018-03-26 22:27 ` Jason Gunthorpe -1 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-26 22:27 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon, Sinan Kaya, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > On 3/26/2018 5:30 PM, Arnd Bergmann wrote: > > > > But that was never a requirement of writel(), > > > > Documentation/memory-barriers.txt gives an explicit example demanding > > > > the wmb() before writel() for ordering system memory against writel. > > > > > > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > > > Adding Alexander Duyck to Cc, he added that section as part of > > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > > > dma_wmb()"). Also adding the other people that were involved with that. > > > > > > > ARM developers can get away with not including wmb() in their code and use > > writel() to observe memory writes due to implicit barriers. > > > > However, same code will not work on Intel. > > Wrong. It will. > > You do NOT need wmb between writes to memory and writel. > > > writel() has a compiler barrier in it for x86. > > wmb() has a sync operation in it for x86. > > > > Unless wmb() is called, PCIe device won't observe memory updates from the CPU. > > This is completely wrong. They will. Intel provides the necessary > ordering guarantees without an explicit wmb. > > Otherwise almost all drivers out there are broken which I very much > doubt :-) But.. Sinan is right, you look anywhere in the driver tree and you find stuff like this: drivers/net/ethernet/intel/i40e/i40e_txrx.c /* Force memory writes to complete before letting h/w * know there are new descriptors to fetch. */ wmb(); It is *systemic* I even see patches adding wmb() based on actual observed memory corruption during testing on Intel: https://patchwork.kernel.org/patch/10177207/ So you think all of this is unnecessary and writel is totally strongly ordered, even on multi-socket Intel? Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 22:27 ` Jason Gunthorpe 0 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-26 22:27 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Sinan Kaya, Arnd Bergmann, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > On 3/26/2018 5:30 PM, Arnd Bergmann wrote: > > > > But that was never a requirement of writel(), > > > > Documentation/memory-barriers.txt gives an explicit example demanding > > > > the wmb() before writel() for ordering system memory against writel. > > > > > > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > > > Adding Alexander Duyck to Cc, he added that section as part of > > > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > > > dma_wmb()"). Also adding the other people that were involved with that. > > > > > > > ARM developers can get away with not including wmb() in their code and use > > writel() to observe memory writes due to implicit barriers. > > > > However, same code will not work on Intel. > > Wrong. It will. > > You do NOT need wmb between writes to memory and writel. > > > writel() has a compiler barrier in it for x86. > > wmb() has a sync operation in it for x86. > > > > Unless wmb() is called, PCIe device won't observe memory updates from the CPU. > > This is completely wrong. They will. Intel provides the necessary > ordering guarantees without an explicit wmb. > > Otherwise almost all drivers out there are broken which I very much > doubt :-) But.. Sinan is right, you look anywhere in the driver tree and you find stuff like this: drivers/net/ethernet/intel/i40e/i40e_txrx.c /* Force memory writes to complete before letting h/w * know there are new descriptors to fetch. */ wmb(); It is *systemic* I even see patches adding wmb() based on actual observed memory corruption during testing on Intel: https://patchwork.kernel.org/patch/10177207/ So you think all of this is unnecessary and writel is totally strongly ordered, even on multi-socket Intel? Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 22:27 ` Jason Gunthorpe @ 2018-03-26 22:36 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 22:36 UTC (permalink / raw) To: Jason Gunthorpe Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon, Sinan Kaya, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote: > > Otherwise almost all drivers out there are broken which I very much > > doubt :-) > > But.. Sinan is right, you look anywhere in the driver tree and you > find stuff like this: > > drivers/net/ethernet/intel/i40e/i40e_txrx.c > > /* Force memory writes to complete before letting h/w > * know there are new descriptors to fetch. > */ > wmb(); > > > It is *systemic* Yes, because they all copied e1000e :-) If you look at the comment in there, it does say it's only for weakly ordered archs such as ia64, and even then, probably predates Linus strong statement on the matter. > I even see patches adding wmb() based on actual observed memory > corruption during testing on Intel: > > https://patchwork.kernel.org/patch/10177207/ > > So you think all of this is unnecessary and writel is totally strongly > ordered, even on multi-socket Intel? I don't kow, it used to be the case, at least that's what drove us to define things the way we did. Maybe things changed, but if that's the case, nobody knows for sure, and we probably want to get Linus POV on the matter. I know I still write drivers that do not add a wmb in that case because I expect things to work without it. If that has changed, we probably can relax some of the barriers in our implementations of writel on a number of architectures, but not before auditing a bunch more drivers to make sure they have the write wmb()'s Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 22:36 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 22:36 UTC (permalink / raw) To: Jason Gunthorpe Cc: Sinan Kaya, Arnd Bergmann, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote: > > Otherwise almost all drivers out there are broken which I very much > > doubt :-) > > But.. Sinan is right, you look anywhere in the driver tree and you > find stuff like this: > > drivers/net/ethernet/intel/i40e/i40e_txrx.c > > /* Force memory writes to complete before letting h/w > * know there are new descriptors to fetch. > */ > wmb(); > > > It is *systemic* Yes, because they all copied e1000e :-) If you look at the comment in there, it does say it's only for weakly ordered archs such as ia64, and even then, probably predates Linus strong statement on the matter. > I even see patches adding wmb() based on actual observed memory > corruption during testing on Intel: > > https://patchwork.kernel.org/patch/10177207/ > > So you think all of this is unnecessary and writel is totally strongly > ordered, even on multi-socket Intel? I don't kow, it used to be the case, at least that's what drove us to define things the way we did. Maybe things changed, but if that's the case, nobody knows for sure, and we probably want to get Linus POV on the matter. I know I still write drivers that do not add a wmb in that case because I expect things to work without it. If that has changed, we probably can relax some of the barriers in our implementations of writel on a number of architectures, but not before auditing a bunch more drivers to make sure they have the write wmb()'s Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 22:36 ` Benjamin Herrenschmidt @ 2018-03-26 22:42 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 22:42 UTC (permalink / raw) To: Jason Gunthorpe Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon, Sinan Kaya, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, 2018-03-27 at 09:36 +1100, Benjamin Herrenschmidt wrote: > I don't kow, it used to be the case, at least that's what drove us to > define things the way we did. > > Maybe things changed, but if that's the case, nobody knows for sure, > and we probably want to get Linus POV on the matter. > > I know I still write drivers that do not add a wmb in that case because > I expect things to work without it. > > If that has changed, we probably can relax some of the barriers in our > implementations of writel on a number of architectures, but not before > auditing a bunch more drivers to make sure they have the write wmb()'s Note also that this was the entire point behind the definition of the _relaxed() accessors, to lift that specific ordering guarantee. If you now says that memory + writel requires a wmb() in between then you made writel be identical to writel_relaxed. You might notice that Documentation/driver-api/device-io.rst makes no mention of wmb() at all. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 22:42 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 22:42 UTC (permalink / raw) To: Jason Gunthorpe Cc: Sinan Kaya, Arnd Bergmann, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Tue, 2018-03-27 at 09:36 +1100, Benjamin Herrenschmidt wrote: > I don't kow, it used to be the case, at least that's what drove us to > define things the way we did. > > Maybe things changed, but if that's the case, nobody knows for sure, > and we probably want to get Linus POV on the matter. > > I know I still write drivers that do not add a wmb in that case because > I expect things to work without it. > > If that has changed, we probably can relax some of the barriers in our > implementations of writel on a number of architectures, but not before > auditing a bunch more drivers to make sure they have the write wmb()'s Note also that this was the entire point behind the definition of the _relaxed() accessors, to lift that specific ordering guarantee. If you now says that memory + writel requires a wmb() in between then you made writel be identical to writel_relaxed. You might notice that Documentation/driver-api/device-io.rst makes no mention of wmb() at all. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 22:36 ` Benjamin Herrenschmidt @ 2018-03-26 22:50 ` Jason Gunthorpe -1 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-26 22:50 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon, Sinan Kaya, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote: > > > Otherwise almost all drivers out there are broken which I very much > > > doubt :-) > > > > But.. Sinan is right, you look anywhere in the driver tree and you > > find stuff like this: > > > > drivers/net/ethernet/intel/i40e/i40e_txrx.c > > > > /* Force memory writes to complete before letting h/w > > * know there are new descriptors to fetch. > > */ > > wmb(); > > > > > > It is *systemic* > > Yes, because they all copied e1000e :-) If you look at the comment in > there, it does say it's only for weakly ordered archs such as ia64, and > even then, probably predates Linus strong statement on the matter. Hahah, sure I'll buy that.. But still, if this really is the case, a *strong* statement in barriers.txt to that effect (and not an example demanding the wmb()!) would be very helpful for those of us that have to review driver code! Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 22:50 ` Jason Gunthorpe 0 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-26 22:50 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Sinan Kaya, Arnd Bergmann, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote: > > > Otherwise almost all drivers out there are broken which I very much > > > doubt :-) > > > > But.. Sinan is right, you look anywhere in the driver tree and you > > find stuff like this: > > > > drivers/net/ethernet/intel/i40e/i40e_txrx.c > > > > /* Force memory writes to complete before letting h/w > > * know there are new descriptors to fetch. > > */ > > wmb(); > > > > > > It is *systemic* > > Yes, because they all copied e1000e :-) If you look at the comment in > there, it does say it's only for weakly ordered archs such as ia64, and > even then, probably predates Linus strong statement on the matter. Hahah, sure I'll buy that.. But still, if this really is the case, a *strong* statement in barriers.txt to that effect (and not an example demanding the wmb()!) would be very helpful for those of us that have to review driver code! Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 22:50 ` Jason Gunthorpe @ 2018-03-26 23:59 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 23:59 UTC (permalink / raw) To: Jason Gunthorpe Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon, Sinan Kaya, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Mon, 2018-03-26 at 16:50 -0600, Jason Gunthorpe wrote: > On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote: > > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote: > > > > Otherwise almost all drivers out there are broken which I very much > > > > doubt :-) > > > > > > But.. Sinan is right, you look anywhere in the driver tree and you > > > find stuff like this: > > > > > > drivers/net/ethernet/intel/i40e/i40e_txrx.c > > > > > > /* Force memory writes to complete before letting h/w > > > * know there are new descriptors to fetch. > > > */ > > > wmb(); > > > > > > > > > It is *systemic* > > > > Yes, because they all copied e1000e :-) If you look at the comment in > > there, it does say it's only for weakly ordered archs such as ia64, and > > even then, probably predates Linus strong statement on the matter. > > Hahah, sure I'll buy that.. > > But still, if this really is the case, a *strong* statement in > barriers.txt to that effect (and not an example demanding the wmb()!) > would be very helpful for those of us that have to review driver code! I agree, and that Mellanox bug you pointed me to seems to indicate that this may not even be true on x86 anymore ... I think we might need to revisit this properly... Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 23:59 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 23:59 UTC (permalink / raw) To: Jason Gunthorpe Cc: Sinan Kaya, Arnd Bergmann, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Mon, 2018-03-26 at 16:50 -0600, Jason Gunthorpe wrote: > On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote: > > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote: > > > > Otherwise almost all drivers out there are broken which I very much > > > > doubt :-) > > > > > > But.. Sinan is right, you look anywhere in the driver tree and you > > > find stuff like this: > > > > > > drivers/net/ethernet/intel/i40e/i40e_txrx.c > > > > > > /* Force memory writes to complete before letting h/w > > > * know there are new descriptors to fetch. > > > */ > > > wmb(); > > > > > > > > > It is *systemic* > > > > Yes, because they all copied e1000e :-) If you look at the comment in > > there, it does say it's only for weakly ordered archs such as ia64, and > > even then, probably predates Linus strong statement on the matter. > > Hahah, sure I'll buy that.. > > But still, if this really is the case, a *strong* statement in > barriers.txt to that effect (and not an example demanding the wmb()!) > would be very helpful for those of us that have to review driver code! I agree, and that Mellanox bug you pointed me to seems to indicate that this may not even be true on x86 anymore ... I think we might need to revisit this properly... Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 23:59 ` Benjamin Herrenschmidt @ 2018-03-27 1:39 ` Jason Gunthorpe -1 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-27 1:39 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon, Sinan Kaya, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, Mar 27, 2018 at 10:59:40AM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 16:50 -0600, Jason Gunthorpe wrote: > > On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote: > > > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote: > > > > > Otherwise almost all drivers out there are broken which I very much > > > > > doubt :-) > > > > > > > > But.. Sinan is right, you look anywhere in the driver tree and you > > > > find stuff like this: > > > > > > > > drivers/net/ethernet/intel/i40e/i40e_txrx.c > > > > > > > > /* Force memory writes to complete before letting h/w > > > > * know there are new descriptors to fetch. > > > > */ > > > > wmb(); > > > > > > > > > > > > It is *systemic* > > > > > > Yes, because they all copied e1000e :-) If you look at the comment in > > > there, it does say it's only for weakly ordered archs such as ia64, and > > > even then, probably predates Linus strong statement on the matter. > > > > Hahah, sure I'll buy that.. > > > > But still, if this really is the case, a *strong* statement in > > barriers.txt to that effect (and not an example demanding the wmb()!) > > would be very helpful for those of us that have to review driver code! > > I agree, and that Mellanox bug you pointed me to seems to indicate that > this may not even be true on x86 anymore ... However, with bugs like that it is hard to know what is going on.. It could be a CPU bug instead. > I think we might need to revisit this properly... I would love to hear a definitive statement from Intel on what wmb(); writel(); does on x86.. Sinan's patches are backwards if writel is ordered, instead of using writel_relaxed, they should be eliminating the wmb(). But there is no way patches like that could go ahead until barriers.txt is updated.. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 1:39 ` Jason Gunthorpe 0 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-27 1:39 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Sinan Kaya, Arnd Bergmann, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Tue, Mar 27, 2018 at 10:59:40AM +1100, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 16:50 -0600, Jason Gunthorpe wrote: > > On Tue, Mar 27, 2018 at 09:36:11AM +1100, Benjamin Herrenschmidt wrote: > > > On Mon, 2018-03-26 at 16:27 -0600, Jason Gunthorpe wrote: > > > > > Otherwise almost all drivers out there are broken which I very much > > > > > doubt :-) > > > > > > > > But.. Sinan is right, you look anywhere in the driver tree and you > > > > find stuff like this: > > > > > > > > drivers/net/ethernet/intel/i40e/i40e_txrx.c > > > > > > > > /* Force memory writes to complete before letting h/w > > > > * know there are new descriptors to fetch. > > > > */ > > > > wmb(); > > > > > > > > > > > > It is *systemic* > > > > > > Yes, because they all copied e1000e :-) If you look at the comment in > > > there, it does say it's only for weakly ordered archs such as ia64, and > > > even then, probably predates Linus strong statement on the matter. > > > > Hahah, sure I'll buy that.. > > > > But still, if this really is the case, a *strong* statement in > > barriers.txt to that effect (and not an example demanding the wmb()!) > > would be very helpful for those of us that have to review driver code! > > I agree, and that Mellanox bug you pointed me to seems to indicate that > this may not even be true on x86 anymore ... However, with bugs like that it is hard to know what is going on.. It could be a CPU bug instead. > I think we might need to revisit this properly... I would love to hear a definitive statement from Intel on what wmb(); writel(); does on x86.. Sinan's patches are backwards if writel is ordered, instead of using writel_relaxed, they should be eliminating the wmb(). But there is no way patches like that could go ahead until barriers.txt is updated.. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 22:27 ` Jason Gunthorpe @ 2018-03-27 7:56 ` Arnd Bergmann -1 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 7:56 UTC (permalink / raw) To: Jason Gunthorpe Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > I even see patches adding wmb() based on actual observed memory > corruption during testing on Intel: > > https://patchwork.kernel.org/patch/10177207/ > > So you think all of this is unnecessary and writel is totally strongly > ordered, even on multi-socket Intel? This example adds a wmb() between two writes to a coherent DMA area, it is definitely required there. I'm pretty sure I've never seen any bug reports pointing to a missing wmb() between memory and MMIO write accesses, but if you remember seeing them in the list, maybe you can look again for some evidence of something going wrong on x86 without it? Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 7:56 ` Arnd Bergmann 0 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 7:56 UTC (permalink / raw) To: Jason Gunthorpe Cc: Benjamin Herrenschmidt, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > I even see patches adding wmb() based on actual observed memory > corruption during testing on Intel: > > https://patchwork.kernel.org/patch/10177207/ > > So you think all of this is unnecessary and writel is totally strongly > ordered, even on multi-socket Intel? This example adds a wmb() between two writes to a coherent DMA area, it is definitely required there. I'm pretty sure I've never seen any bug reports pointing to a missing wmb() between memory and MMIO write accesses, but if you remember seeing them in the list, maybe you can look again for some evidence of something going wrong on x86 without it? Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 7:56 ` Arnd Bergmann @ 2018-03-27 8:56 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 8:56 UTC (permalink / raw) To: Arnd Bergmann, Jason Gunthorpe Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > I even see patches adding wmb() based on actual observed memory > > corruption during testing on Intel: > > > > https://patchwork.kernel.org/patch/10177207/ > > > > So you think all of this is unnecessary and writel is totally strongly > > ordered, even on multi-socket Intel? > > This example adds a wmb() between two writes to a coherent DMA > area, it is definitely required there. Ah you are right, I incorrectly assumed that the "prod_db" function was an MMIO. So we do NOT have a counter example where wmb is needed on x86, pfiew ! :-) > I'm pretty sure I've never seen > any bug reports pointing to a missing wmb() between memory > and MMIO write accesses, but if you remember seeing them in the > list, maybe you can look again for some evidence of something going > wrong on x86 without it? The interesting thing is that we do seem to have a whole LOT of these spurrious wmb before writel all over the tree, I suspect because of that incorrect recommendation in memory-barriers.txt. We should fix that. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 8:56 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 8:56 UTC (permalink / raw) To: Arnd Bergmann, Jason Gunthorpe Cc: Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > I even see patches adding wmb() based on actual observed memory > > corruption during testing on Intel: > > > > https://patchwork.kernel.org/patch/10177207/ > > > > So you think all of this is unnecessary and writel is totally strongly > > ordered, even on multi-socket Intel? > > This example adds a wmb() between two writes to a coherent DMA > area, it is definitely required there. Ah you are right, I incorrectly assumed that the "prod_db" function was an MMIO. So we do NOT have a counter example where wmb is needed on x86, pfiew ! :-) > I'm pretty sure I've never seen > any bug reports pointing to a missing wmb() between memory > and MMIO write accesses, but if you remember seeing them in the > list, maybe you can look again for some evidence of something going > wrong on x86 without it? The interesting thing is that we do seem to have a whole LOT of these spurrious wmb before writel all over the tree, I suspect because of that incorrect recommendation in memory-barriers.txt. We should fix that. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 8:56 ` Benjamin Herrenschmidt @ 2018-03-27 9:44 ` Arnd Bergmann -1 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 9:44 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, Mar 27, 2018 at 10:56 AM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote: >> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: >> >> I'm pretty sure I've never seen >> any bug reports pointing to a missing wmb() between memory >> and MMIO write accesses, but if you remember seeing them in the >> list, maybe you can look again for some evidence of something going >> wrong on x86 without it? > > The interesting thing is that we do seem to have a whole LOT of these > spurrious wmb before writel all over the tree, I suspect because of > that incorrect recommendation in memory-barriers.txt. > > We should fix that. Maybe the problem is just that it's so counter-intuitive that we don't need that barrier in Linux, when the hardware does need one on some architectures. How about we define a barrier type instruction specifically for this purpose, something like wmb_before_mmio() and have all architectures define that to an empty macro? That way, having correct code using wmb_before_mmio() will not trigger an incorrect review comment that leads to extra wmb(). ;-) Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 9:44 ` Arnd Bergmann 0 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 9:44 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Jason Gunthorpe, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Tue, Mar 27, 2018 at 10:56 AM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote: >> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: >> >> I'm pretty sure I've never seen >> any bug reports pointing to a missing wmb() between memory >> and MMIO write accesses, but if you remember seeing them in the >> list, maybe you can look again for some evidence of something going >> wrong on x86 without it? > > The interesting thing is that we do seem to have a whole LOT of these > spurrious wmb before writel all over the tree, I suspect because of > that incorrect recommendation in memory-barriers.txt. > > We should fix that. Maybe the problem is just that it's so counter-intuitive that we don't need that barrier in Linux, when the hardware does need one on some architectures. How about we define a barrier type instruction specifically for this purpose, something like wmb_before_mmio() and have all architectures define that to an empty macro? That way, having correct code using wmb_before_mmio() will not trigger an incorrect review comment that leads to extra wmb(). ;-) Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 9:44 ` Arnd Bergmann @ 2018-03-27 10:00 ` Will Deacon -1 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 10:00 UTC (permalink / raw) To: Arnd Bergmann Cc: Paul E. McKenney, linux-rdma, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, Mar 27, 2018 at 11:44:22AM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 10:56 AM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote: > >> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > >> > >> I'm pretty sure I've never seen > >> any bug reports pointing to a missing wmb() between memory > >> and MMIO write accesses, but if you remember seeing them in the > >> list, maybe you can look again for some evidence of something going > >> wrong on x86 without it? > > > > The interesting thing is that we do seem to have a whole LOT of these > > spurrious wmb before writel all over the tree, I suspect because of > > that incorrect recommendation in memory-barriers.txt. > > > > We should fix that. > > Maybe the problem is just that it's so counter-intuitive that we don't > need that barrier in Linux, when the hardware does need one on some > architectures. > > How about we define a barrier type instruction specifically for this > purpose, something like wmb_before_mmio() and have all architectures > define that to an empty macro? > > That way, having correct code using wmb_before_mmio() will not > trigger an incorrect review comment that leads to extra wmb(). ;-) Please don't add more barriers :)! I think that will make it even more difficult to understand how to use the ones we already have -- the problem here seems to be that the documentation that was added for the dma_* barriers got this wrong, but it was at least in contradiction with the section elsewhere in memory-barriers.txt that describes the relaxed I/O accessors. I guess somebody could hack checkpatch to look for back-to-back wmb/writel sequences? I suspect we could do something with coccinelle too. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 10:00 ` Will Deacon 0 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 10:00 UTC (permalink / raw) To: Arnd Bergmann Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney On Tue, Mar 27, 2018 at 11:44:22AM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 10:56 AM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote: > >> On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > >> > >> I'm pretty sure I've never seen > >> any bug reports pointing to a missing wmb() between memory > >> and MMIO write accesses, but if you remember seeing them in the > >> list, maybe you can look again for some evidence of something going > >> wrong on x86 without it? > > > > The interesting thing is that we do seem to have a whole LOT of these > > spurrious wmb before writel all over the tree, I suspect because of > > that incorrect recommendation in memory-barriers.txt. > > > > We should fix that. > > Maybe the problem is just that it's so counter-intuitive that we don't > need that barrier in Linux, when the hardware does need one on some > architectures. > > How about we define a barrier type instruction specifically for this > purpose, something like wmb_before_mmio() and have all architectures > define that to an empty macro? > > That way, having correct code using wmb_before_mmio() will not > trigger an incorrect review comment that leads to extra wmb(). ;-) Please don't add more barriers :)! I think that will make it even more difficult to understand how to use the ones we already have -- the problem here seems to be that the documentation that was added for the dma_* barriers got this wrong, but it was at least in contradiction with the section elsewhere in memory-barriers.txt that describes the relaxed I/O accessors. I guess somebody could hack checkpatch to look for back-to-back wmb/writel sequences? I suspect we could do something with coccinelle too. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 9:44 ` Arnd Bergmann @ 2018-03-27 11:23 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 11:23 UTC (permalink / raw) To: Arnd Bergmann Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote: > > The interesting thing is that we do seem to have a whole LOT of these > > spurrious wmb before writel all over the tree, I suspect because of > > that incorrect recommendation in memory-barriers.txt. > > > > We should fix that. > > Maybe the problem is just that it's so counter-intuitive that we don't > need that barrier in Linux, when the hardware does need one on some > architectures. > > How about we define a barrier type instruction specifically for this > purpose, something like wmb_before_mmio() and have all architectures > define that to an empty macro? This is exactly what wmb() is about and exactly what Linux rejected back in the day (and in hindsight I agree with him). > That way, having correct code using wmb_before_mmio() will not > trigger an incorrect review comment that leads to extra wmb(). ;-) Ah, you mean have an empty macro that will always be empty on all architectures just to fool people ? :-) Not sure that will fly ... I think we just need to be documenting that stuff better and not have incorrect examples. Also a sweep to remove some useless ones like the one in e1000e would help. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 11:23 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 11:23 UTC (permalink / raw) To: Arnd Bergmann Cc: Jason Gunthorpe, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote: > > The interesting thing is that we do seem to have a whole LOT of these > > spurrious wmb before writel all over the tree, I suspect because of > > that incorrect recommendation in memory-barriers.txt. > > > > We should fix that. > > Maybe the problem is just that it's so counter-intuitive that we don't > need that barrier in Linux, when the hardware does need one on some > architectures. > > How about we define a barrier type instruction specifically for this > purpose, something like wmb_before_mmio() and have all architectures > define that to an empty macro? This is exactly what wmb() is about and exactly what Linux rejected back in the day (and in hindsight I agree with him). > That way, having correct code using wmb_before_mmio() will not > trigger an incorrect review comment that leads to extra wmb(). ;-) Ah, you mean have an empty macro that will always be empty on all architectures just to fool people ? :-) Not sure that will fly ... I think we just need to be documenting that stuff better and not have incorrect examples. Also a sweep to remove some useless ones like the one in e1000e would help. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 11:23 ` Benjamin Herrenschmidt @ 2018-03-27 12:22 ` okaya -1 siblings, 0 replies; 216+ messages in thread From: okaya @ 2018-03-27 12:22 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon, Jason Gunthorpe, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On 2018-03-27 07:23, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote: >> > The interesting thing is that we do seem to have a whole LOT of these >> > spurrious wmb before writel all over the tree, I suspect because of >> > that incorrect recommendation in memory-barriers.txt. >> > >> > We should fix that. >> >> Maybe the problem is just that it's so counter-intuitive that we don't >> need that barrier in Linux, when the hardware does need one on some >> architectures. >> >> How about we define a barrier type instruction specifically for this >> purpose, something like wmb_before_mmio() and have all architectures >> define that to an empty macro? > > This is exactly what wmb() is about and exactly what Linux rejected > back in the day (and in hindsight I agree with him). > >> That way, having correct code using wmb_before_mmio() will not >> trigger an incorrect review comment that leads to extra wmb(). ;-) > > Ah, you mean have an empty macro that will always be empty on all > architectures just to fool people ? :-) > > Not sure that will fly ... I think we just need to be documenting that > stuff better and not have incorrect examples. Also a sweep to remove > some useless ones like the one in e1000e would help. I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches) I will have to just remove the wmb and keep writel, then repost. Some of these got applied. It will cause some churn for the maintainers. > > Cheers, > Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 12:22 ` okaya 0 siblings, 0 replies; 216+ messages in thread From: okaya @ 2018-03-27 12:22 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On 2018-03-27 07:23, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote: >> > The interesting thing is that we do seem to have a whole LOT of these >> > spurrious wmb before writel all over the tree, I suspect because of >> > that incorrect recommendation in memory-barriers.txt. >> > >> > We should fix that. >> >> Maybe the problem is just that it's so counter-intuitive that we don't >> need that barrier in Linux, when the hardware does need one on some >> architectures. >> >> How about we define a barrier type instruction specifically for this >> purpose, something like wmb_before_mmio() and have all architectures >> define that to an empty macro? > > This is exactly what wmb() is about and exactly what Linux rejected > back in the day (and in hindsight I agree with him). > >> That way, having correct code using wmb_before_mmio() will not >> trigger an incorrect review comment that leads to extra wmb(). ;-) > > Ah, you mean have an empty macro that will always be empty on all > architectures just to fool people ? :-) > > Not sure that will fly ... I think we just need to be documenting that > stuff better and not have incorrect examples. Also a sweep to remove > some useless ones like the one in e1000e would help. I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches) I will have to just remove the wmb and keep writel, then repost. Some of these got applied. It will cause some churn for the maintainers. > > Cheers, > Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 12:22 ` okaya @ 2018-03-27 14:12 ` Jason Gunthorpe -1 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-27 14:12 UTC (permalink / raw) To: okaya Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, Mar 27, 2018 at 08:22:55AM -0400, okaya@codeaurora.org wrote: > On 2018-03-27 07:23, Benjamin Herrenschmidt wrote: > >On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote: > >>> The interesting thing is that we do seem to have a whole LOT of these > >>> spurrious wmb before writel all over the tree, I suspect because of > >>> that incorrect recommendation in memory-barriers.txt. > >>> > >>> We should fix that. > >> > >>Maybe the problem is just that it's so counter-intuitive that we don't > >>need that barrier in Linux, when the hardware does need one on some > >>architectures. > >> > >>How about we define a barrier type instruction specifically for this > >>purpose, something like wmb_before_mmio() and have all architectures > >>define that to an empty macro? > > > >This is exactly what wmb() is about and exactly what Linux rejected > >back in the day (and in hindsight I agree with him). > > > >>That way, having correct code using wmb_before_mmio() will not > >>trigger an incorrect review comment that leads to extra wmb(). ;-) > > > >Ah, you mean have an empty macro that will always be empty on all > >architectures just to fool people ? :-) > > > >Not sure that will fly ... I think we just need to be documenting that > >stuff better and not have incorrect examples. Also a sweep to remove > >some useless ones like the one in e1000e would help. > > I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches) > > I will have to just remove the wmb and keep writel, then repost. Okay, but before you do that, can we get a statement how this works for WC? Some of these writels are to WC memory, do they need the wmb()?!? Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 14:12 ` Jason Gunthorpe 0 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-27 14:12 UTC (permalink / raw) To: okaya Cc: Benjamin Herrenschmidt, Arnd Bergmann, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Tue, Mar 27, 2018 at 08:22:55AM -0400, okaya@codeaurora.org wrote: > On 2018-03-27 07:23, Benjamin Herrenschmidt wrote: > >On Tue, 2018-03-27 at 11:44 +0200, Arnd Bergmann wrote: > >>> The interesting thing is that we do seem to have a whole LOT of these > >>> spurrious wmb before writel all over the tree, I suspect because of > >>> that incorrect recommendation in memory-barriers.txt. > >>> > >>> We should fix that. > >> > >>Maybe the problem is just that it's so counter-intuitive that we don't > >>need that barrier in Linux, when the hardware does need one on some > >>architectures. > >> > >>How about we define a barrier type instruction specifically for this > >>purpose, something like wmb_before_mmio() and have all architectures > >>define that to an empty macro? > > > >This is exactly what wmb() is about and exactly what Linux rejected > >back in the day (and in hindsight I agree with him). > > > >>That way, having correct code using wmb_before_mmio() will not > >>trigger an incorrect review comment that leads to extra wmb(). ;-) > > > >Ah, you mean have an empty macro that will always be empty on all > >architectures just to fool people ? :-) > > > >Not sure that will fly ... I think we just need to be documenting that > >stuff better and not have incorrect examples. Also a sweep to remove > >some useless ones like the one in e1000e would help. > > I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches) > > I will have to just remove the wmb and keep writel, then repost. Okay, but before you do that, can we get a statement how this works for WC? Some of these writels are to WC memory, do they need the wmb()?!? Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 14:12 ` Jason Gunthorpe @ 2018-03-27 21:27 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 21:27 UTC (permalink / raw) To: Jason Gunthorpe, okaya Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Will Deacon, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, 2018-03-27 at 08:12 -0600, Jason Gunthorpe wrote: > > I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches) > > > > I will have to just remove the wmb and keep writel, then repost. > > Okay, but before you do that, can we get a statement how this works > for WC? > > Some of these writels are to WC memory, do they need the wmb()?!? This is an issue as we don't have well defined semantics for WC. At this point, I would suggest staying away from that (ie, not changing them). We need to look into it. I know for example that on powerpc I cannot give you any weaker semantic on WC for writel (I have to put a full sync in there), but I am trying to see if I can make writel_relaxed both work with the existing semantics and provide combining. But it's not yet a given (our weaker IO barrier, eieio, isn't architecturally defined to do anything on G=0 space, looking with the HW guys at what the HW actually does). Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 21:27 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 21:27 UTC (permalink / raw) To: Jason Gunthorpe, okaya Cc: Arnd Bergmann, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Tue, 2018-03-27 at 08:12 -0600, Jason Gunthorpe wrote: > > I have been converting wmb+writel to wmb+writel_relaxed. (About 30 patches) > > > > I will have to just remove the wmb and keep writel, then repost. > > Okay, but before you do that, can we get a statement how this works > for WC? > > Some of these writels are to WC memory, do they need the wmb()?!? This is an issue as we don't have well defined semantics for WC. At this point, I would suggest staying away from that (ie, not changing them). We need to look into it. I know for example that on powerpc I cannot give you any weaker semantic on WC for writel (I have to put a full sync in there), but I am trying to see if I can make writel_relaxed both work with the existing semantics and provide combining. But it's not yet a given (our weaker IO barrier, eieio, isn't architecturally defined to do anything on G=0 space, looking with the HW guys at what the HW actually does). Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 8:56 ` Benjamin Herrenschmidt @ 2018-03-27 9:57 ` Will Deacon -1 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 9:57 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Paul E. McKenney, Arnd Bergmann, corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, peterz, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), mingo [+ locking/ordering/docs people] On Tue, Mar 27, 2018 at 07:56:59PM +1100, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote: > > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > > > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > > > I even see patches adding wmb() based on actual observed memory > > > corruption during testing on Intel: > > > > > > https://patchwork.kernel.org/patch/10177207/ > > > > > > So you think all of this is unnecessary and writel is totally strongly > > > ordered, even on multi-socket Intel? > > > > This example adds a wmb() between two writes to a coherent DMA > > area, it is definitely required there. > > Ah you are right, I incorrectly assumed that the "prod_db" function was > an MMIO. So we do NOT have a counter example where wmb is needed on > x86, pfiew ! :-) > > > I'm pretty sure I've never seen > > any bug reports pointing to a missing wmb() between memory > > and MMIO write accesses, but if you remember seeing them in the > > list, maybe you can look again for some evidence of something going > > wrong on x86 without it? > > The interesting thing is that we do seem to have a whole LOT of these > spurrious wmb before writel all over the tree, I suspect because of > that incorrect recommendation in memory-barriers.txt. > > We should fix that. Patch below. Thoughts? Will --->8 >From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001 From: Will Deacon <will.deacon@arm.com> Date: Tue, 27 Mar 2018 10:49:58 +0100 Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering example The section of memory-barriers.txt that describes the dma_Xmb() barriers has an incorrect example claiming that a wmb() is required after writing to coherent memory in order for those writes to be visible to a device before a subsequent MMIO access using writel() can reach the device. In fact, this ordering guarantee is provided (at significant cost on some architectures such as arm and power) by writel, so the wmb() is not necessary. writel_relaxed exists for cases where this ordering is not required. Fix the example and update the text to make this clearer. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Reported-by: Sinan Kaya <okaya@codeaurora.org> Signed-off-by: Will Deacon <will.deacon@arm.com> --- Documentation/memory-barriers.txt | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index a863009849a3..2556b4b0e6f9 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: /* assign ownership */ desc->status = DEVICE_OWN; - /* force memory to sync before notifying device via MMIO */ - wmb(); - /* notify device of new descriptors */ writel(DESC_NOTIFY, doorbell); } @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions: The dma_rmb() allows us guarantee the device has released ownership before we read the data from the descriptor, and the dma_wmb() allows us to guarantee the data is written to the descriptor before the device - can see it now has ownership. The wmb() is needed to guarantee that the - cache coherent memory writes have completed before attempting a write to - the cache incoherent MMIO region. - - See Documentation/DMA-API.txt for more information on consistent memory. + can see it now has ownership. Note that, when using writel(), a prior + wmb() is not needed to guarantee that the cache coherent memory writes + have completed before writing to the cache incoherent MMIO region. + If this ordering between incoherent MMIO and coherent memory regions + is not required, writel_relaxed() can be used instead and is significantly + cheaper on some weakly-ordered architectures. + + See the subsection "Kernel I/O barrier effects" for more information on + relaxed I/O accessors and the Documentation/DMA-API.txt file for more + information on consistent memory. MMIO WRITE BARRIER -- 2.1.4 ^ permalink raw reply related [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 9:57 ` Will Deacon 0 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 9:57 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Arnd Bergmann, Jason Gunthorpe, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, peterz, mingo, corbet [+ locking/ordering/docs people] On Tue, Mar 27, 2018 at 07:56:59PM +1100, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 09:56 +0200, Arnd Bergmann wrote: > > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > > > > On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > > > I even see patches adding wmb() based on actual observed memory > > > corruption during testing on Intel: > > > > > > https://patchwork.kernel.org/patch/10177207/ > > > > > > So you think all of this is unnecessary and writel is totally strongly > > > ordered, even on multi-socket Intel? > > > > This example adds a wmb() between two writes to a coherent DMA > > area, it is definitely required there. > > Ah you are right, I incorrectly assumed that the "prod_db" function was > an MMIO. So we do NOT have a counter example where wmb is needed on > x86, pfiew ! :-) > > > I'm pretty sure I've never seen > > any bug reports pointing to a missing wmb() between memory > > and MMIO write accesses, but if you remember seeing them in the > > list, maybe you can look again for some evidence of something going > > wrong on x86 without it? > > The interesting thing is that we do seem to have a whole LOT of these > spurrious wmb before writel all over the tree, I suspect because of > that incorrect recommendation in memory-barriers.txt. > > We should fix that. Patch below. Thoughts? Will --->8 >From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001 From: Will Deacon <will.deacon@arm.com> Date: Tue, 27 Mar 2018 10:49:58 +0100 Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering example The section of memory-barriers.txt that describes the dma_Xmb() barriers has an incorrect example claiming that a wmb() is required after writing to coherent memory in order for those writes to be visible to a device before a subsequent MMIO access using writel() can reach the device. In fact, this ordering guarantee is provided (at significant cost on some architectures such as arm and power) by writel, so the wmb() is not necessary. writel_relaxed exists for cases where this ordering is not required. Fix the example and update the text to make this clearer. Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jason Gunthorpe <jgg@ziepe.ca> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Reported-by: Sinan Kaya <okaya@codeaurora.org> Signed-off-by: Will Deacon <will.deacon@arm.com> --- Documentation/memory-barriers.txt | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index a863009849a3..2556b4b0e6f9 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: /* assign ownership */ desc->status = DEVICE_OWN; - /* force memory to sync before notifying device via MMIO */ - wmb(); - /* notify device of new descriptors */ writel(DESC_NOTIFY, doorbell); } @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions: The dma_rmb() allows us guarantee the device has released ownership before we read the data from the descriptor, and the dma_wmb() allows us to guarantee the data is written to the descriptor before the device - can see it now has ownership. The wmb() is needed to guarantee that the - cache coherent memory writes have completed before attempting a write to - the cache incoherent MMIO region. - - See Documentation/DMA-API.txt for more information on consistent memory. + can see it now has ownership. Note that, when using writel(), a prior + wmb() is not needed to guarantee that the cache coherent memory writes + have completed before writing to the cache incoherent MMIO region. + If this ordering between incoherent MMIO and coherent memory regions + is not required, writel_relaxed() can be used instead and is significantly + cheaper on some weakly-ordered architectures. + + See the subsection "Kernel I/O barrier effects" for more information on + relaxed I/O accessors and the Documentation/DMA-API.txt file for more + information on consistent memory. MMIO WRITE BARRIER -- 2.1.4 ^ permalink raw reply related [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 9:57 ` Will Deacon @ 2018-03-27 10:05 ` Arnd Bergmann -1 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 10:05 UTC (permalink / raw) To: Will Deacon Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Tue, Mar 27, 2018 at 11:57 AM, Will Deacon <will.deacon@arm.com> wrote: > > From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001 > From: Will Deacon <will.deacon@arm.com> > Date: Tue, 27 Mar 2018 10:49:58 +0100 > Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering > example > > The section of memory-barriers.txt that describes the dma_Xmb() barriers > has an incorrect example claiming that a wmb() is required after writing > to coherent memory in order for those writes to be visible to a device > before a subsequent MMIO access using writel() can reach the device. > > In fact, this ordering guarantee is provided (at significant cost on some > architectures such as arm and power) by writel, so the wmb() is not > necessary. writel_relaxed exists for cases where this ordering is not > required. > > Fix the example and update the text to make this clearer. > > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Jonathan Corbet <corbet@lwn.net> > Reported-by: Sinan Kaya <okaya@codeaurora.org> > Signed-off-by: Will Deacon <will.deacon@arm.com> > --- > Documentation/memory-barriers.txt | 18 ++++++++++-------- > 1 file changed, 10 insertions(+), 8 deletions(-) > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > index a863009849a3..2556b4b0e6f9 100644 > --- a/Documentation/memory-barriers.txt > +++ b/Documentation/memory-barriers.txt > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: > /* assign ownership */ > desc->status = DEVICE_OWN; > > - /* force memory to sync before notifying device via MMIO */ > - wmb(); > - > /* notify device of new descriptors */ > writel(DESC_NOTIFY, doorbell); > } > @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions: > The dma_rmb() allows us guarantee the device has released ownership > before we read the data from the descriptor, and the dma_wmb() allows > us to guarantee the data is written to the descriptor before the device > - can see it now has ownership. The wmb() is needed to guarantee that the > - cache coherent memory writes have completed before attempting a write to > - the cache incoherent MMIO region. > - > - See Documentation/DMA-API.txt for more information on consistent memory. > + can see it now has ownership. Note that, when using writel(), a prior > + wmb() is not needed to guarantee that the cache coherent memory writes > + have completed before writing to the cache incoherent MMIO region. > + If this ordering between incoherent MMIO and coherent memory regions > + is not required, writel_relaxed() can be used instead and is significantly > + cheaper on some weakly-ordered architectures. I think that's a great improvement, but I'm a bit worried about recommending writel_relaxed() too much: I've seen a lot of drivers that just always use writel_relaxed() over write(), and some of them get that wrong when they don't understand the difference but end up using DMA without explicit barriers anyway. Also, having an architecture-independent driver use wmb()+writel_relaxed() ends up being more expensive than just using write(). Not sure how to best phrase it though. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 10:05 ` Arnd Bergmann 0 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 10:05 UTC (permalink / raw) To: Will Deacon Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Tue, Mar 27, 2018 at 11:57 AM, Will Deacon <will.deacon@arm.com> wrote: > > From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001 > From: Will Deacon <will.deacon@arm.com> > Date: Tue, 27 Mar 2018 10:49:58 +0100 > Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering > example > > The section of memory-barriers.txt that describes the dma_Xmb() barriers > has an incorrect example claiming that a wmb() is required after writing > to coherent memory in order for those writes to be visible to a device > before a subsequent MMIO access using writel() can reach the device. > > In fact, this ordering guarantee is provided (at significant cost on some > architectures such as arm and power) by writel, so the wmb() is not > necessary. writel_relaxed exists for cases where this ordering is not > required. > > Fix the example and update the text to make this clearer. > > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Cc: Arnd Bergmann <arnd@arndb.de> > Cc: Jason Gunthorpe <jgg@ziepe.ca> > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > Cc: Peter Zijlstra <peterz@infradead.org> > Cc: Ingo Molnar <mingo@redhat.com> > Cc: Jonathan Corbet <corbet@lwn.net> > Reported-by: Sinan Kaya <okaya@codeaurora.org> > Signed-off-by: Will Deacon <will.deacon@arm.com> > --- > Documentation/memory-barriers.txt | 18 ++++++++++-------- > 1 file changed, 10 insertions(+), 8 deletions(-) > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > index a863009849a3..2556b4b0e6f9 100644 > --- a/Documentation/memory-barriers.txt > +++ b/Documentation/memory-barriers.txt > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: > /* assign ownership */ > desc->status = DEVICE_OWN; > > - /* force memory to sync before notifying device via MMIO */ > - wmb(); > - > /* notify device of new descriptors */ > writel(DESC_NOTIFY, doorbell); > } > @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions: > The dma_rmb() allows us guarantee the device has released ownership > before we read the data from the descriptor, and the dma_wmb() allows > us to guarantee the data is written to the descriptor before the device > - can see it now has ownership. The wmb() is needed to guarantee that the > - cache coherent memory writes have completed before attempting a write to > - the cache incoherent MMIO region. > - > - See Documentation/DMA-API.txt for more information on consistent memory. > + can see it now has ownership. Note that, when using writel(), a prior > + wmb() is not needed to guarantee that the cache coherent memory writes > + have completed before writing to the cache incoherent MMIO region. > + If this ordering between incoherent MMIO and coherent memory regions > + is not required, writel_relaxed() can be used instead and is significantly > + cheaper on some weakly-ordered architectures. I think that's a great improvement, but I'm a bit worried about recommending writel_relaxed() too much: I've seen a lot of drivers that just always use writel_relaxed() over write(), and some of them get that wrong when they don't understand the difference but end up using DMA without explicit barriers anyway. Also, having an architecture-independent driver use wmb()+writel_relaxed() ends up being more expensive than just using write(). Not sure how to best phrase it though. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 10:05 ` Arnd Bergmann @ 2018-03-27 10:09 ` Will Deacon -1 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 10:09 UTC (permalink / raw) To: Arnd Bergmann Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 11:57 AM, Will Deacon <will.deacon@arm.com> wrote: > > > > > From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001 > > From: Will Deacon <will.deacon@arm.com> > > Date: Tue, 27 Mar 2018 10:49:58 +0100 > > Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering > > example > > > > The section of memory-barriers.txt that describes the dma_Xmb() barriers > > has an incorrect example claiming that a wmb() is required after writing > > to coherent memory in order for those writes to be visible to a device > > before a subsequent MMIO access using writel() can reach the device. > > > > In fact, this ordering guarantee is provided (at significant cost on some > > architectures such as arm and power) by writel, so the wmb() is not > > necessary. writel_relaxed exists for cases where this ordering is not > > required. > > > > Fix the example and update the text to make this clearer. > > > > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> > > Cc: Arnd Bergmann <arnd@arndb.de> > > Cc: Jason Gunthorpe <jgg@ziepe.ca> > > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Cc: Ingo Molnar <mingo@redhat.com> > > Cc: Jonathan Corbet <corbet@lwn.net> > > Reported-by: Sinan Kaya <okaya@codeaurora.org> > > Signed-off-by: Will Deacon <will.deacon@arm.com> > > --- > > Documentation/memory-barriers.txt | 18 ++++++++++-------- > > 1 file changed, 10 insertions(+), 8 deletions(-) > > > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > > index a863009849a3..2556b4b0e6f9 100644 > > --- a/Documentation/memory-barriers.txt > > +++ b/Documentation/memory-barriers.txt > > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: > > /* assign ownership */ > > desc->status = DEVICE_OWN; > > > > - /* force memory to sync before notifying device via MMIO */ > > - wmb(); > > - > > /* notify device of new descriptors */ > > writel(DESC_NOTIFY, doorbell); > > } > > @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions: > > The dma_rmb() allows us guarantee the device has released ownership > > before we read the data from the descriptor, and the dma_wmb() allows > > us to guarantee the data is written to the descriptor before the device > > - can see it now has ownership. The wmb() is needed to guarantee that the > > - cache coherent memory writes have completed before attempting a write to > > - the cache incoherent MMIO region. > > - > > - See Documentation/DMA-API.txt for more information on consistent memory. > > + can see it now has ownership. Note that, when using writel(), a prior > > + wmb() is not needed to guarantee that the cache coherent memory writes > > + have completed before writing to the cache incoherent MMIO region. > > + If this ordering between incoherent MMIO and coherent memory regions > > + is not required, writel_relaxed() can be used instead and is significantly > > + cheaper on some weakly-ordered architectures. > > I think that's a great improvement, but I'm a bit worried about recommending > writel_relaxed() too much: I've seen a lot of drivers that just always use > writel_relaxed() over write(), and some of them get that wrong when they > don't understand the difference but end up using DMA without explicit > barriers anyway. > > Also, having an architecture-independent driver use wmb()+writel_relaxed() > ends up being more expensive than just using write(). Not sure how to > best phrase it though. Perhaps I add reword that with a simple example to say: If this ordering between incoherent MMIO and coherent memory regions is not required (e.g. in a sequence of accesses all to the MMIO region) [...] since that seems to be the usual case where the _relaxed accessors help. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 10:09 ` Will Deacon 0 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 10:09 UTC (permalink / raw) To: Arnd Bergmann Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 11:57 AM, Will Deacon <will.deacon@arm.com> wrote: > > > > > From db0daeaf94f0f6232f8206fc07a74211324b11d9 Mon Sep 17 00:00:00 2001 > > From: Will Deacon <will.deacon@arm.com> > > Date: Tue, 27 Mar 2018 10:49:58 +0100 > > Subject: [PATCH] docs/memory-barriers.txt: Fix broken DMA vs MMIO ordering > > example > > > > The section of memory-barriers.txt that describes the dma_Xmb() barriers > > has an incorrect example claiming that a wmb() is required after writing > > to coherent memory in order for those writes to be visible to a device > > before a subsequent MMIO access using writel() can reach the device. > > > > In fact, this ordering guarantee is provided (at significant cost on some > > architectures such as arm and power) by writel, so the wmb() is not > > necessary. writel_relaxed exists for cases where this ordering is not > > required. > > > > Fix the example and update the text to make this clearer. > > > > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org> > > Cc: Arnd Bergmann <arnd@arndb.de> > > Cc: Jason Gunthorpe <jgg@ziepe.ca> > > Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> > > Cc: Peter Zijlstra <peterz@infradead.org> > > Cc: Ingo Molnar <mingo@redhat.com> > > Cc: Jonathan Corbet <corbet@lwn.net> > > Reported-by: Sinan Kaya <okaya@codeaurora.org> > > Signed-off-by: Will Deacon <will.deacon@arm.com> > > --- > > Documentation/memory-barriers.txt | 18 ++++++++++-------- > > 1 file changed, 10 insertions(+), 8 deletions(-) > > > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > > index a863009849a3..2556b4b0e6f9 100644 > > --- a/Documentation/memory-barriers.txt > > +++ b/Documentation/memory-barriers.txt > > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: > > /* assign ownership */ > > desc->status = DEVICE_OWN; > > > > - /* force memory to sync before notifying device via MMIO */ > > - wmb(); > > - > > /* notify device of new descriptors */ > > writel(DESC_NOTIFY, doorbell); > > } > > @@ -1919,11 +1916,16 @@ There are some more advanced barrier functions: > > The dma_rmb() allows us guarantee the device has released ownership > > before we read the data from the descriptor, and the dma_wmb() allows > > us to guarantee the data is written to the descriptor before the device > > - can see it now has ownership. The wmb() is needed to guarantee that the > > - cache coherent memory writes have completed before attempting a write to > > - the cache incoherent MMIO region. > > - > > - See Documentation/DMA-API.txt for more information on consistent memory. > > + can see it now has ownership. Note that, when using writel(), a prior > > + wmb() is not needed to guarantee that the cache coherent memory writes > > + have completed before writing to the cache incoherent MMIO region. > > + If this ordering between incoherent MMIO and coherent memory regions > > + is not required, writel_relaxed() can be used instead and is significantly > > + cheaper on some weakly-ordered architectures. > > I think that's a great improvement, but I'm a bit worried about recommending > writel_relaxed() too much: I've seen a lot of drivers that just always use > writel_relaxed() over write(), and some of them get that wrong when they > don't understand the difference but end up using DMA without explicit > barriers anyway. > > Also, having an architecture-independent driver use wmb()+writel_relaxed() > ends up being more expensive than just using write(). Not sure how to > best phrase it though. Perhaps I add reword that with a simple example to say: If this ordering between incoherent MMIO and coherent memory regions is not required (e.g. in a sequence of accesses all to the MMIO region) [...] since that seems to be the usual case where the _relaxed accessors help. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 10:09 ` Will Deacon @ 2018-03-27 10:53 ` Arnd Bergmann -1 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 10:53 UTC (permalink / raw) To: Will Deacon Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote: > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote: >> > - >> > - See Documentation/DMA-API.txt for more information on consistent memory. >> > + can see it now has ownership. Note that, when using writel(), a prior >> > + wmb() is not needed to guarantee that the cache coherent memory writes >> > + have completed before writing to the cache incoherent MMIO region. >> > + If this ordering between incoherent MMIO and coherent memory regions One more thing: I think the term "incoherent MMIO" is a bit confusing, I'd prefer just "MMIO" here. At least I don't have the faintest clue what the difference between "coherent MMIO" and "incoherent MMIO" would be ;-) >> > + is not required, writel_relaxed() can be used instead and is significantly >> > + cheaper on some weakly-ordered architectures. >> >> I think that's a great improvement, but I'm a bit worried about recommending >> writel_relaxed() too much: I've seen a lot of drivers that just always use >> writel_relaxed() over write(), and some of them get that wrong when they >> don't understand the difference but end up using DMA without explicit >> barriers anyway. >> >> Also, having an architecture-independent driver use wmb()+writel_relaxed() >> ends up being more expensive than just using write(). Not sure how to >> best phrase it though. > > Perhaps I add reword that with a simple example to say: > > If this ordering between incoherent MMIO and coherent memory regions > is not required (e.g. in a sequence of accesses all to the MMIO region) > [...] > > since that seems to be the usual case where the _relaxed accessors help. That still doesn't quite capture what I'd like driver writes to do: in essence I would recommend them to use writel() all the time, except in performance critical code that has been shown to be correct and has a comment to explain why _relaxed() is ok in that particular function. Maybe it can just be rephrased to warn against the use of writel_relaxed() here, and explain the difference that way: can see it now has ownership. Note that, when using writel(), a prior wmb() is not needed to guarantee that the cache coherent memory writes have completed before writing to the cache incoherent MMIO region. The cheaper writel_relaxed() does not guarantee the DMA to be visible to the device and must not be used here. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 10:53 ` Arnd Bergmann 0 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 10:53 UTC (permalink / raw) To: Will Deacon Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote: > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote: >> > - >> > - See Documentation/DMA-API.txt for more information on consistent memory. >> > + can see it now has ownership. Note that, when using writel(), a prior >> > + wmb() is not needed to guarantee that the cache coherent memory writes >> > + have completed before writing to the cache incoherent MMIO region. >> > + If this ordering between incoherent MMIO and coherent memory regions One more thing: I think the term "incoherent MMIO" is a bit confusing, I'd prefer just "MMIO" here. At least I don't have the faintest clue what the difference between "coherent MMIO" and "incoherent MMIO" would be ;-) >> > + is not required, writel_relaxed() can be used instead and is significantly >> > + cheaper on some weakly-ordered architectures. >> >> I think that's a great improvement, but I'm a bit worried about recommending >> writel_relaxed() too much: I've seen a lot of drivers that just always use >> writel_relaxed() over write(), and some of them get that wrong when they >> don't understand the difference but end up using DMA without explicit >> barriers anyway. >> >> Also, having an architecture-independent driver use wmb()+writel_relaxed() >> ends up being more expensive than just using write(). Not sure how to >> best phrase it though. > > Perhaps I add reword that with a simple example to say: > > If this ordering between incoherent MMIO and coherent memory regions > is not required (e.g. in a sequence of accesses all to the MMIO region) > [...] > > since that seems to be the usual case where the _relaxed accessors help. That still doesn't quite capture what I'd like driver writes to do: in essence I would recommend them to use writel() all the time, except in performance critical code that has been shown to be correct and has a comment to explain why _relaxed() is ok in that particular function. Maybe it can just be rephrased to warn against the use of writel_relaxed() here, and explain the difference that way: can see it now has ownership. Note that, when using writel(), a prior wmb() is not needed to guarantee that the cache coherent memory writes have completed before writing to the cache incoherent MMIO region. The cheaper writel_relaxed() does not guarantee the DMA to be visible to the device and must not be used here. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 10:53 ` Arnd Bergmann @ 2018-03-27 11:02 ` Will Deacon -1 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 11:02 UTC (permalink / raw) To: Arnd Bergmann Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote: > > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote: > >> > - > >> > - See Documentation/DMA-API.txt for more information on consistent memory. > >> > + can see it now has ownership. Note that, when using writel(), a prior > >> > + wmb() is not needed to guarantee that the cache coherent memory writes > >> > + have completed before writing to the cache incoherent MMIO region. > >> > + If this ordering between incoherent MMIO and coherent memory regions > > One more thing: I think the term "incoherent MMIO" is a bit confusing, I'd > prefer just "MMIO" here. At least I don't have the faintest clue what the > difference between "coherent MMIO" and "incoherent MMIO" would be ;-) Yes, you're right. I was just following the terminology that's already used here, but actually that seems not be used anywhere else in the document! I'll kill it. > >> > + is not required, writel_relaxed() can be used instead and is significantly > >> > + cheaper on some weakly-ordered architectures. > >> > >> I think that's a great improvement, but I'm a bit worried about recommending > >> writel_relaxed() too much: I've seen a lot of drivers that just always use > >> writel_relaxed() over write(), and some of them get that wrong when they > >> don't understand the difference but end up using DMA without explicit > >> barriers anyway. > >> > >> Also, having an architecture-independent driver use wmb()+writel_relaxed() > >> ends up being more expensive than just using write(). Not sure how to > >> best phrase it though. > > > > Perhaps I add reword that with a simple example to say: > > > > If this ordering between incoherent MMIO and coherent memory regions > > is not required (e.g. in a sequence of accesses all to the MMIO region) > > [...] > > > > since that seems to be the usual case where the _relaxed accessors help. > > That still doesn't quite capture what I'd like driver writes to do: in essence > I would recommend them to use writel() all the time, except in performance > critical code that has been shown to be correct and has a comment to explain > why _relaxed() is ok in that particular function. > > Maybe it can just be rephrased to warn against the use of writel_relaxed() > here, and explain the difference that way: > > can see it now has ownership. Note that, when using writel(), a prior > wmb() is not needed to guarantee that the cache coherent memory writes > have completed before writing to the cache incoherent MMIO region. > The cheaper writel_relaxed() does not guarantee the DMA to be visible > to the device and must not be used here. Fair enough. I'd rather people used _relaxed by default, but I have to admit that it will probably just result in them getting things wrong. Just a tiny bit of wordsmithing brings this to: diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index a863009849a3..3247547d1c36 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: /* assign ownership */ desc->status = DEVICE_OWN; - /* force memory to sync before notifying device via MMIO */ - wmb(); - /* notify device of new descriptors */ writel(DESC_NOTIFY, doorbell); } @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions: The dma_rmb() allows us guarantee the device has released ownership before we read the data from the descriptor, and the dma_wmb() allows us to guarantee the data is written to the descriptor before the device - can see it now has ownership. The wmb() is needed to guarantee that the - cache coherent memory writes have completed before attempting a write to - the cache incoherent MMIO region. - - See Documentation/DMA-API.txt for more information on consistent memory. + can see it now has ownership. Note that, when using writel(), a prior + wmb() is not needed to guarantee that the cache coherent memory writes + have completed before writing to the MMIO region. The cheaper + writel_relaxed() does not provide this guarantee and must not be used + here. + + See the subsection "Kernel I/O barrier effects" for more information on + relaxed I/O accessors and the Documentation/DMA-API.txt file for more + information on consistent memory. MMIO WRITE BARRIER If you're happy with that, I'll send it as a proper patch. Cheers, Will ^ permalink raw reply related [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 11:02 ` Will Deacon 0 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 11:02 UTC (permalink / raw) To: Arnd Bergmann Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote: > > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote: > >> > - > >> > - See Documentation/DMA-API.txt for more information on consistent memory. > >> > + can see it now has ownership. Note that, when using writel(), a prior > >> > + wmb() is not needed to guarantee that the cache coherent memory writes > >> > + have completed before writing to the cache incoherent MMIO region. > >> > + If this ordering between incoherent MMIO and coherent memory regions > > One more thing: I think the term "incoherent MMIO" is a bit confusing, I'd > prefer just "MMIO" here. At least I don't have the faintest clue what the > difference between "coherent MMIO" and "incoherent MMIO" would be ;-) Yes, you're right. I was just following the terminology that's already used here, but actually that seems not be used anywhere else in the document! I'll kill it. > >> > + is not required, writel_relaxed() can be used instead and is significantly > >> > + cheaper on some weakly-ordered architectures. > >> > >> I think that's a great improvement, but I'm a bit worried about recommending > >> writel_relaxed() too much: I've seen a lot of drivers that just always use > >> writel_relaxed() over write(), and some of them get that wrong when they > >> don't understand the difference but end up using DMA without explicit > >> barriers anyway. > >> > >> Also, having an architecture-independent driver use wmb()+writel_relaxed() > >> ends up being more expensive than just using write(). Not sure how to > >> best phrase it though. > > > > Perhaps I add reword that with a simple example to say: > > > > If this ordering between incoherent MMIO and coherent memory regions > > is not required (e.g. in a sequence of accesses all to the MMIO region) > > [...] > > > > since that seems to be the usual case where the _relaxed accessors help. > > That still doesn't quite capture what I'd like driver writes to do: in essence > I would recommend them to use writel() all the time, except in performance > critical code that has been shown to be correct and has a comment to explain > why _relaxed() is ok in that particular function. > > Maybe it can just be rephrased to warn against the use of writel_relaxed() > here, and explain the difference that way: > > can see it now has ownership. Note that, when using writel(), a prior > wmb() is not needed to guarantee that the cache coherent memory writes > have completed before writing to the cache incoherent MMIO region. > The cheaper writel_relaxed() does not guarantee the DMA to be visible > to the device and must not be used here. Fair enough. I'd rather people used _relaxed by default, but I have to admit that it will probably just result in them getting things wrong. Just a tiny bit of wordsmithing brings this to: diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt index a863009849a3..3247547d1c36 100644 --- a/Documentation/memory-barriers.txt +++ b/Documentation/memory-barriers.txt @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: /* assign ownership */ desc->status = DEVICE_OWN; - /* force memory to sync before notifying device via MMIO */ - wmb(); - /* notify device of new descriptors */ writel(DESC_NOTIFY, doorbell); } @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions: The dma_rmb() allows us guarantee the device has released ownership before we read the data from the descriptor, and the dma_wmb() allows us to guarantee the data is written to the descriptor before the device - can see it now has ownership. The wmb() is needed to guarantee that the - cache coherent memory writes have completed before attempting a write to - the cache incoherent MMIO region. - - See Documentation/DMA-API.txt for more information on consistent memory. + can see it now has ownership. Note that, when using writel(), a prior + wmb() is not needed to guarantee that the cache coherent memory writes + have completed before writing to the MMIO region. The cheaper + writel_relaxed() does not provide this guarantee and must not be used + here. + + See the subsection "Kernel I/O barrier effects" for more information on + relaxed I/O accessors and the Documentation/DMA-API.txt file for more + information on consistent memory. MMIO WRITE BARRIER If you're happy with that, I'll send it as a proper patch. Cheers, Will ^ permalink raw reply related [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 11:02 ` Will Deacon @ 2018-03-27 11:05 ` Arnd Bergmann -1 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 11:05 UTC (permalink / raw) To: Will Deacon Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Tue, Mar 27, 2018 at 1:02 PM, Will Deacon <will.deacon@arm.com> wrote: > On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote: >> On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote: >> > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote: > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > index a863009849a3..3247547d1c36 100644 > --- a/Documentation/memory-barriers.txt > +++ b/Documentation/memory-barriers.txt > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: > /* assign ownership */ > desc->status = DEVICE_OWN; > > - /* force memory to sync before notifying device via MMIO */ > - wmb(); > - > /* notify device of new descriptors */ > writel(DESC_NOTIFY, doorbell); > } > @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions: > The dma_rmb() allows us guarantee the device has released ownership > before we read the data from the descriptor, and the dma_wmb() allows > us to guarantee the data is written to the descriptor before the device > - can see it now has ownership. The wmb() is needed to guarantee that the > - cache coherent memory writes have completed before attempting a write to > - the cache incoherent MMIO region. > - > - See Documentation/DMA-API.txt for more information on consistent memory. > + can see it now has ownership. Note that, when using writel(), a prior > + wmb() is not needed to guarantee that the cache coherent memory writes > + have completed before writing to the MMIO region. The cheaper > + writel_relaxed() does not provide this guarantee and must not be used > + here. > + > + See the subsection "Kernel I/O barrier effects" for more information on > + relaxed I/O accessors and the Documentation/DMA-API.txt file for more > + information on consistent memory. > > > MMIO WRITE BARRIER > > > If you're happy with that, I'll send it as a proper patch. Looks good to me, thanks! Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 11:05 ` Arnd Bergmann 0 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 11:05 UTC (permalink / raw) To: Will Deacon Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Tue, Mar 27, 2018 at 1:02 PM, Will Deacon <will.deacon@arm.com> wrote: > On Tue, Mar 27, 2018 at 12:53:49PM +0200, Arnd Bergmann wrote: >> On Tue, Mar 27, 2018 at 12:09 PM, Will Deacon <will.deacon@arm.com> wrote: >> > On Tue, Mar 27, 2018 at 12:05:06PM +0200, Arnd Bergmann wrote: > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > index a863009849a3..3247547d1c36 100644 > --- a/Documentation/memory-barriers.txt > +++ b/Documentation/memory-barriers.txt > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: > /* assign ownership */ > desc->status = DEVICE_OWN; > > - /* force memory to sync before notifying device via MMIO */ > - wmb(); > - > /* notify device of new descriptors */ > writel(DESC_NOTIFY, doorbell); > } > @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions: > The dma_rmb() allows us guarantee the device has released ownership > before we read the data from the descriptor, and the dma_wmb() allows > us to guarantee the data is written to the descriptor before the device > - can see it now has ownership. The wmb() is needed to guarantee that the > - cache coherent memory writes have completed before attempting a write to > - the cache incoherent MMIO region. > - > - See Documentation/DMA-API.txt for more information on consistent memory. > + can see it now has ownership. Note that, when using writel(), a prior > + wmb() is not needed to guarantee that the cache coherent memory writes > + have completed before writing to the MMIO region. The cheaper > + writel_relaxed() does not provide this guarantee and must not be used > + here. > + > + See the subsection "Kernel I/O barrier effects" for more information on > + relaxed I/O accessors and the Documentation/DMA-API.txt file for more > + information on consistent memory. > > > MMIO WRITE BARRIER > > > If you're happy with that, I'll send it as a proper patch. Looks good to me, thanks! Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 11:02 ` Will Deacon @ 2018-03-27 11:25 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 11:25 UTC (permalink / raw) To: Will Deacon, Arnd Bergmann Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Tue, 2018-03-27 at 12:02 +0100, Will Deacon wrote: > can see it now has ownership. Note that, when using writel(), a prior > > wmb() is not needed to guarantee that the cache coherent memory writes > > have completed before writing to the cache incoherent MMIO region. > > The cheaper writel_relaxed() does not guarantee the DMA to be visible > > to the device and must not be used here. > > Fair enough. I'd rather people used _relaxed by default, but I have to admit > that it will probably just result in them getting things wrong. Just a tiny > bit of wordsmithing brings this to: I prefer people using writel() by default for the simple reason that 99% of writels out there are configuration stuff for which the performance difference doesn't matter, and people will just get it wrong. Let's focus on the rare fast path for optimisation. > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > index a863009849a3..3247547d1c36 100644 > --- a/Documentation/memory-barriers.txt > +++ b/Documentation/memory-barriers.txt > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: > /* assign ownership */ > desc->status = DEVICE_OWN; > > - /* force memory to sync before notifying device via MMIO */ > - wmb(); > - > /* notify device of new descriptors */ > writel(DESC_NOTIFY, doorbell); > } > @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions: > The dma_rmb() allows us guarantee the device has released ownership > before we read the data from the descriptor, and the dma_wmb() allows > us to guarantee the data is written to the descriptor before the device > - can see it now has ownership. The wmb() is needed to guarantee that the > - cache coherent memory writes have completed before attempting a write to > - the cache incoherent MMIO region. > - > - See Documentation/DMA-API.txt for more information on consistent memory. > + can see it now has ownership. Note that, when using writel(), a prior > + wmb() is not needed to guarantee that the cache coherent memory writes > + have completed before writing to the MMIO region. The cheaper > + writel_relaxed() does not provide this guarantee and must not be used > + here. > + > + See the subsection "Kernel I/O barrier effects" for more information on > + relaxed I/O accessors and the Documentation/DMA-API.txt file for more > + information on consistent memory. > > > MMIO WRITE BARRIER > > > If you're happy with that, I'll send it as a proper patch. > > Cheers, > > Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 11:25 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 11:25 UTC (permalink / raw) To: Will Deacon, Arnd Bergmann Cc: Jason Gunthorpe, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Tue, 2018-03-27 at 12:02 +0100, Will Deacon wrote: > can see it now has ownership. Note that, when using writel(), a prior > > wmb() is not needed to guarantee that the cache coherent memory writes > > have completed before writing to the cache incoherent MMIO region. > > The cheaper writel_relaxed() does not guarantee the DMA to be visible > > to the device and must not be used here. > > Fair enough. I'd rather people used _relaxed by default, but I have to admit > that it will probably just result in them getting things wrong. Just a tiny > bit of wordsmithing brings this to: I prefer people using writel() by default for the simple reason that 99% of writels out there are configuration stuff for which the performance difference doesn't matter, and people will just get it wrong. Let's focus on the rare fast path for optimisation. > > diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt > index a863009849a3..3247547d1c36 100644 > --- a/Documentation/memory-barriers.txt > +++ b/Documentation/memory-barriers.txt > @@ -1909,9 +1909,6 @@ There are some more advanced barrier functions: > /* assign ownership */ > desc->status = DEVICE_OWN; > > - /* force memory to sync before notifying device via MMIO */ > - wmb(); > - > /* notify device of new descriptors */ > writel(DESC_NOTIFY, doorbell); > } > @@ -1919,11 +1916,15 @@ There are some more advanced barrier functions: > The dma_rmb() allows us guarantee the device has released ownership > before we read the data from the descriptor, and the dma_wmb() allows > us to guarantee the data is written to the descriptor before the device > - can see it now has ownership. The wmb() is needed to guarantee that the > - cache coherent memory writes have completed before attempting a write to > - the cache incoherent MMIO region. > - > - See Documentation/DMA-API.txt for more information on consistent memory. > + can see it now has ownership. Note that, when using writel(), a prior > + wmb() is not needed to guarantee that the cache coherent memory writes > + have completed before writing to the MMIO region. The cheaper > + writel_relaxed() does not provide this guarantee and must not be used > + here. > + > + See the subsection "Kernel I/O barrier effects" for more information on > + relaxed I/O accessors and the Documentation/DMA-API.txt file for more > + information on consistent memory. > > > MMIO WRITE BARRIER > > > If you're happy with that, I'll send it as a proper patch. > > Cheers, > > Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed 2018-03-27 11:02 ` Will Deacon @ 2018-03-27 13:20 ` David Laight -1 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-27 13:20 UTC (permalink / raw) To: 'Will Deacon', Arnd Bergmann Cc: Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, Ingo Molnar, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) > Fair enough. I'd rather people used _relaxed by default, but I have to admit > that it will probably just result in them getting things wrong... Certainly requiring the driver writes use explicit barriers should make them understand when and why they are needed - and then put in the correct ones. The problem I've had is that I have a good idea which barriers are needed but find that readl/writel seem to contain a lot of extra ones. Maybe the are required in some places, but the extra synchronising instructions could easily have measureable performance effects on hot paths. Drivers are likely to contain sequences like: read_io if (...) return write_mem ... write_mem barrier write_mem barrier write_io for things like ring updates. Where the 'mem' might actually be in io space. In such sequences not all the synchronising instructions are needed. I'm not at all sure it is easy to get the right set. David ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed @ 2018-03-27 13:20 ` David Laight 0 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-27 13:20 UTC (permalink / raw) To: 'Will Deacon', Arnd Bergmann Cc: Benjamin Herrenschmidt, Jason Gunthorpe, Sinan Kaya, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet > Fair enough. I'd rather people used _relaxed by default, but I have to ad= mit > that it will probably just result in them getting things wrong... Certainly requiring the driver writes use explicit barriers should make them understand when and why they are needed - and then put in the correct = ones. The problem I've had is that I have a good idea which barriers are needed but find that readl/writel seem to contain a lot of extra ones. Maybe the are required in some places, but the extra synchronising instructions could easily have measureable performance effects on hot paths. Drivers are likely to contain sequences like: read_io if (...) return write_mem ... write_mem barrier write_mem barrier write_io for things like ring updates. Where the 'mem' might actually be in io space. In such sequences not all the synchronising instructions are needed. I'm not at all sure it is easy to get the right set. David ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 11:02 ` Will Deacon @ 2018-03-27 13:46 ` Sinan Kaya -1 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-27 13:46 UTC (permalink / raw) To: Will Deacon, Arnd Bergmann Cc: Jonathan Corbet, linux-rdma, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On 3/27/2018 7:02 AM, Will Deacon wrote: > - See Documentation/DMA-API.txt for more information on consistent memory. > + can see it now has ownership. Note that, when using writel(), a prior > + wmb() is not needed to guarantee that the cache coherent memory writes > + have completed before writing to the MMIO region. The cheaper > + writel_relaxed() does not provide this guarantee and must not be used > + here. Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb() in front of these. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 13:46 ` Sinan Kaya 0 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-27 13:46 UTC (permalink / raw) To: Will Deacon, Arnd Bergmann Cc: Benjamin Herrenschmidt, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On 3/27/2018 7:02 AM, Will Deacon wrote: > - See Documentation/DMA-API.txt for more information on consistent memory. > + can see it now has ownership. Note that, when using writel(), a prior > + wmb() is not needed to guarantee that the cache coherent memory writes > + have completed before writing to the MMIO region. The cheaper > + writel_relaxed() does not provide this guarantee and must not be used > + here. Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb() in front of these. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 13:46 ` Sinan Kaya @ 2018-03-27 14:36 ` Will Deacon -1 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 14:36 UTC (permalink / raw) To: Sinan Kaya Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Tue, Mar 27, 2018 at 09:46:51AM -0400, Sinan Kaya wrote: > On 3/27/2018 7:02 AM, Will Deacon wrote: > > - See Documentation/DMA-API.txt for more information on consistent memory. > > + can see it now has ownership. Note that, when using writel(), a prior > > + wmb() is not needed to guarantee that the cache coherent memory writes > > + have completed before writing to the MMIO region. The cheaper > > + writel_relaxed() does not provide this guarantee and must not be used > > + here. > > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb() > in front of these. I don't think so. My reading of memory-barriers.txt says that writeX might expand to outX, and outX is not ordered with respect to other types of memory. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 14:36 ` Will Deacon 0 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 14:36 UTC (permalink / raw) To: Sinan Kaya Cc: Arnd Bergmann, Benjamin Herrenschmidt, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Tue, Mar 27, 2018 at 09:46:51AM -0400, Sinan Kaya wrote: > On 3/27/2018 7:02 AM, Will Deacon wrote: > > - See Documentation/DMA-API.txt for more information on consistent memory. > > + can see it now has ownership. Note that, when using writel(), a prior > > + wmb() is not needed to guarantee that the cache coherent memory writes > > + have completed before writing to the MMIO region. The cheaper > > + writel_relaxed() does not provide this guarantee and must not be used > > + here. > > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb() > in front of these. I don't think so. My reading of memory-barriers.txt says that writeX might expand to outX, and outX is not ordered with respect to other types of memory. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 14:36 ` Will Deacon @ 2018-03-27 21:29 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 21:29 UTC (permalink / raw) To: Will Deacon, Sinan Kaya Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote: > > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb() > > in front of these. > > I don't think so. My reading of memory-barriers.txt says that writeX might > expand to outX, and outX is not ordered with respect to other types of > memory. Ugh ? My understanding of HW at least is the exact opposite. outX is *more* ordered if anything, than any other accessors. IO space is completely synchronous, non posted and ordered afaik. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 21:29 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 21:29 UTC (permalink / raw) To: Will Deacon, Sinan Kaya Cc: Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote: > > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb() > > in front of these. > > I don't think so. My reading of memory-barriers.txt says that writeX might > expand to outX, and outX is not ordered with respect to other types of > memory. Ugh ? My understanding of HW at least is the exact opposite. outX is *more* ordered if anything, than any other accessors. IO space is completely synchronous, non posted and ordered afaik. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 21:29 ` Benjamin Herrenschmidt @ 2018-03-28 8:53 ` Will Deacon -1 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-28 8:53 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Wed, Mar 28, 2018 at 08:29:45AM +1100, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote: > > > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb() > > > in front of these. > > > > I don't think so. My reading of memory-barriers.txt says that writeX might > > expand to outX, and outX is not ordered with respect to other types of > > memory. > > Ugh ? > > My understanding of HW at least is the exact opposite. outX is *more* > ordered if anything, than any other accessors. IO space is completely > synchronous, non posted and ordered afaik. I'm just going by memory-barriers.txt: (*) inX(), outX(): [...] They are guaranteed to be fully ordered with respect to each other. They are not guaranteed to be fully ordered with respect to other types of memory and I/O operation. For arm/arm64 these end up behaving exactly the same as readX/writeX, but I'm nervous about changing the documentation without understanding why it's like it is currently. Maybe another ia64 thing?. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-28 8:53 ` Will Deacon 0 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-28 8:53 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Wed, Mar 28, 2018 at 08:29:45AM +1100, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 15:36 +0100, Will Deacon wrote: > > > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb() > > > in front of these. > > > > I don't think so. My reading of memory-barriers.txt says that writeX might > > expand to outX, and outX is not ordered with respect to other types of > > memory. > > Ugh ? > > My understanding of HW at least is the exact opposite. outX is *more* > ordered if anything, than any other accessors. IO space is completely > synchronous, non posted and ordered afaik. I'm just going by memory-barriers.txt: (*) inX(), outX(): [...] They are guaranteed to be fully ordered with respect to each other. They are not guaranteed to be fully ordered with respect to other types of memory and I/O operation. For arm/arm64 these end up behaving exactly the same as readX/writeX, but I'm nervous about changing the documentation without understanding why it's like it is currently. Maybe another ia64 thing?. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed 2018-03-28 8:53 ` Will Deacon @ 2018-03-28 9:00 ` David Laight -1 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-28 9:00 UTC (permalink / raw) To: 'Will Deacon', Benjamin Herrenschmidt Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, Ingo Molnar, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) From: Will Deacon > Sent: 28 March 2018 09:54 ... > > > I don't think so. My reading of memory-barriers.txt says that writeX might > > > expand to outX, and outX is not ordered with respect to other types of > > > memory. > > > > Ugh ? > > > > My understanding of HW at least is the exact opposite. outX is *more* > > ordered if anything, than any other accessors. IO space is completely > > synchronous, non posted and ordered afaik. > > I'm just going by memory-barriers.txt: > > > (*) inX(), outX(): > > [...] > > They are guaranteed to be fully ordered with respect to each other. > > They are not guaranteed to be fully ordered with respect to other types of > memory and I/O operation. A long time ago there was a document from Intel that said that inb/outb weren't necessarily synchronised wrt memory accesses. (Might be P-pro era). However no processors actually behaved that way and more recent docs say that inb/outb are fully ordered. David ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed @ 2018-03-28 9:00 ` David Laight 0 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-28 9:00 UTC (permalink / raw) To: 'Will Deacon', Benjamin Herrenschmidt Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet From: Will Deacon > Sent: 28 March 2018 09:54 ... > > > I don't think so. My reading of memory-barriers.txt says that writeX = might > > > expand to outX, and outX is not ordered with respect to other types o= f > > > memory. > > > > Ugh ? > > > > My understanding of HW at least is the exact opposite. outX is *more* > > ordered if anything, than any other accessors. IO space is completely > > synchronous, non posted and ordered afaik. >=20 > I'm just going by memory-barriers.txt: >=20 >=20 > (*) inX(), outX(): >=20 > [...] >=20 > They are guaranteed to be fully ordered with respect to each other. >=20 > They are not guaranteed to be fully ordered with respect to other ty= pes of > memory and I/O operation. A long time ago there was a document from Intel that said that inb/outb wer= en't necessarily synchronised wrt memory accesses. (Might be P-pro era). However no processors actually behaved that way and more recent docs say that inb/outb are fully ordered. David ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 9:00 ` David Laight @ 2018-03-28 9:09 ` Will Deacon -1 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-28 9:09 UTC (permalink / raw) To: David Laight Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, Ingo Molnar, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Wed, Mar 28, 2018 at 09:00:01AM +0000, David Laight wrote: > From: Will Deacon > > Sent: 28 March 2018 09:54 > ... > > > > I don't think so. My reading of memory-barriers.txt says that writeX might > > > > expand to outX, and outX is not ordered with respect to other types of > > > > memory. > > > > > > Ugh ? > > > > > > My understanding of HW at least is the exact opposite. outX is *more* > > > ordered if anything, than any other accessors. IO space is completely > > > synchronous, non posted and ordered afaik. > > > > I'm just going by memory-barriers.txt: > > > > > > (*) inX(), outX(): > > > > [...] > > > > They are guaranteed to be fully ordered with respect to each other. > > > > They are not guaranteed to be fully ordered with respect to other types of > > memory and I/O operation. > > A long time ago there was a document from Intel that said that inb/outb weren't > necessarily synchronised wrt memory accesses. > (Might be P-pro era). > However no processors actually behaved that way and more recent docs > say that inb/outb are fully ordered. Thank you, David! I'll write another patch fixing this up and hopefully we'll soon have one making writeX/readX much clearer. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-28 9:09 ` Will Deacon 0 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-28 9:09 UTC (permalink / raw) To: David Laight Cc: Benjamin Herrenschmidt, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Wed, Mar 28, 2018 at 09:00:01AM +0000, David Laight wrote: > From: Will Deacon > > Sent: 28 March 2018 09:54 > ... > > > > I don't think so. My reading of memory-barriers.txt says that writeX might > > > > expand to outX, and outX is not ordered with respect to other types of > > > > memory. > > > > > > Ugh ? > > > > > > My understanding of HW at least is the exact opposite. outX is *more* > > > ordered if anything, than any other accessors. IO space is completely > > > synchronous, non posted and ordered afaik. > > > > I'm just going by memory-barriers.txt: > > > > > > (*) inX(), outX(): > > > > [...] > > > > They are guaranteed to be fully ordered with respect to each other. > > > > They are not guaranteed to be fully ordered with respect to other types of > > memory and I/O operation. > > A long time ago there was a document from Intel that said that inb/outb weren't > necessarily synchronised wrt memory accesses. > (Might be P-pro era). > However no processors actually behaved that way and more recent docs > say that inb/outb are fully ordered. Thank you, David! I'll write another patch fixing this up and hopefully we'll soon have one making writeX/readX much clearer. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 9:09 ` Will Deacon @ 2018-03-28 9:56 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 9:56 UTC (permalink / raw) To: Will Deacon, David Laight Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, Ingo Molnar, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Wed, 2018-03-28 at 10:09 +0100, Will Deacon wrote: > On Wed, Mar 28, 2018 at 09:00:01AM +0000, David Laight wrote: > > From: Will Deacon > > > Sent: 28 March 2018 09:54 > > > > ... > > > > > I don't think so. My reading of memory-barriers.txt says that writeX might > > > > > expand to outX, and outX is not ordered with respect to other types of > > > > > memory. > > > > > > > > Ugh ? > > > > > > > > My understanding of HW at least is the exact opposite. outX is *more* > > > > ordered if anything, than any other accessors. IO space is completely > > > > synchronous, non posted and ordered afaik. > > > > > > I'm just going by memory-barriers.txt: > > > > > > > > > (*) inX(), outX(): > > > > > > [...] > > > > > > They are guaranteed to be fully ordered with respect to each other. > > > > > > They are not guaranteed to be fully ordered with respect to other types of > > > memory and I/O operation. > > > > A long time ago there was a document from Intel that said that inb/outb weren't > > necessarily synchronised wrt memory accesses. > > (Might be P-pro era). > > However no processors actually behaved that way and more recent docs > > say that inb/outb are fully ordered. > > Thank you, David! I'll write another patch fixing this up and hopefully > we'll soon have one making writeX/readX much clearer. Thanks for doing the grunt work Will ! :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-28 9:56 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 9:56 UTC (permalink / raw) To: Will Deacon, David Laight Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Wed, 2018-03-28 at 10:09 +0100, Will Deacon wrote: > On Wed, Mar 28, 2018 at 09:00:01AM +0000, David Laight wrote: > > From: Will Deacon > > > Sent: 28 March 2018 09:54 > > > > ... > > > > > I don't think so. My reading of memory-barriers.txt says that writeX might > > > > > expand to outX, and outX is not ordered with respect to other types of > > > > > memory. > > > > > > > > Ugh ? > > > > > > > > My understanding of HW at least is the exact opposite. outX is *more* > > > > ordered if anything, than any other accessors. IO space is completely > > > > synchronous, non posted and ordered afaik. > > > > > > I'm just going by memory-barriers.txt: > > > > > > > > > (*) inX(), outX(): > > > > > > [...] > > > > > > They are guaranteed to be fully ordered with respect to each other. > > > > > > They are not guaranteed to be fully ordered with respect to other types of > > > memory and I/O operation. > > > > A long time ago there was a document from Intel that said that inb/outb weren't > > necessarily synchronised wrt memory accesses. > > (Might be P-pro era). > > However no processors actually behaved that way and more recent docs > > say that inb/outb are fully ordered. > > Thank you, David! I'll write another patch fixing this up and hopefully > we'll soon have one making writeX/readX much clearer. Thanks for doing the grunt work Will ! :-) Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 8:53 ` Will Deacon @ 2018-03-28 9:50 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 9:50 UTC (permalink / raw) To: Will Deacon Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote: > For arm/arm64 these end up behaving exactly the same as readX/writeX, but > I'm nervous about changing the documentation without understanding why it's > like it is currently. Maybe another ia64 thing?. I doubt it ... the Intel ancestry here would make me think they are completely ordered there too. powerpc and ARM can't quite make them synchronous I think, but at least they should have the same semantics as writel. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-28 9:50 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 9:50 UTC (permalink / raw) To: Will Deacon Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote: > For arm/arm64 these end up behaving exactly the same as readX/writeX, but > I'm nervous about changing the documentation without understanding why it's > like it is currently. Maybe another ia64 thing?. I doubt it ... the Intel ancestry here would make me think they are completely ordered there too. powerpc and ARM can't quite make them synchronous I think, but at least they should have the same semantics as writel. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 9:50 ` Benjamin Herrenschmidt @ 2018-03-28 9:55 ` Arnd Bergmann -1 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-28 9:55 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Jonathan Corbet, linux-rdma, Will Deacon, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Wed, Mar 28, 2018 at 11:50 AM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote: >> For arm/arm64 these end up behaving exactly the same as readX/writeX, but >> I'm nervous about changing the documentation without understanding why it's >> like it is currently. Maybe another ia64 thing?. > > I doubt it ... the Intel ancestry here would make me think they are > completely ordered there too. > > powerpc and ARM can't quite make them synchronous I think, but at least > they should have the same semantics as writel. One thing that ARM does IIRC is that it only guarantees to order writel() within one device, and the memory mapped PCI I/O space window almost certainly counts as a separate device to the CPU. In the absence of an enforced global synchronization during an I/O port access, that means writel() and outb() can be reordered before they arrive at a device in theory. Again, this rarely matters in practice, but I think it makes sense to document the less strict behavior here, given that we have common hardware that can't provide x86 compatible semantics. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-28 9:55 ` Arnd Bergmann 0 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-28 9:55 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Will Deacon, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Wed, Mar 28, 2018 at 11:50 AM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Wed, 2018-03-28 at 09:53 +0100, Will Deacon wrote: >> For arm/arm64 these end up behaving exactly the same as readX/writeX, but >> I'm nervous about changing the documentation without understanding why it's >> like it is currently. Maybe another ia64 thing?. > > I doubt it ... the Intel ancestry here would make me think they are > completely ordered there too. > > powerpc and ARM can't quite make them synchronous I think, but at least > they should have the same semantics as writel. One thing that ARM does IIRC is that it only guarantees to order writel() within one device, and the memory mapped PCI I/O space window almost certainly counts as a separate device to the CPU. In the absence of an enforced global synchronization during an I/O port access, that means writel() and outb() can be reordered before they arrive at a device in theory. Again, this rarely matters in practice, but I think it makes sense to document the less strict behavior here, given that we have common hardware that can't provide x86 compatible semantics. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 9:55 ` Arnd Bergmann @ 2018-03-28 10:01 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 10:01 UTC (permalink / raw) To: Arnd Bergmann Cc: Jonathan Corbet, linux-rdma, Will Deacon, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > powerpc and ARM can't quite make them synchronous I think, but at least > > they should have the same semantics as writel. > > One thing that ARM does IIRC is that it only guarantees to order writel() within > one device, and the memory mapped PCI I/O space window almost certainly > counts as a separate device to the CPU. That sounds bogus. > In the absence of an enforced global synchronization during an I/O port > access, that means writel() and outb() can be reordered before they arrive > at a device in theory. Again, this rarely matters in practice, but I think it > makes sense to document the less strict behavior here, given that we have > common hardware that can't provide x86 compatible semantics. Can't you put some kind of super heavy handed barrier in inX/outX ? These things are never going to be performance sensitive anyway... Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-28 10:01 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 10:01 UTC (permalink / raw) To: Arnd Bergmann Cc: Will Deacon, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > powerpc and ARM can't quite make them synchronous I think, but at least > > they should have the same semantics as writel. > > One thing that ARM does IIRC is that it only guarantees to order writel() within > one device, and the memory mapped PCI I/O space window almost certainly > counts as a separate device to the CPU. That sounds bogus. > In the absence of an enforced global synchronization during an I/O port > access, that means writel() and outb() can be reordered before they arrive > at a device in theory. Again, this rarely matters in practice, but I think it > makes sense to document the less strict behavior here, given that we have > common hardware that can't provide x86 compatible semantics. Can't you put some kind of super heavy handed barrier in inX/outX ? These things are never going to be performance sensitive anyway... Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 10:01 ` Benjamin Herrenschmidt @ 2018-03-28 10:13 ` Will Deacon -1 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-28 10:13 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote: > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > > powerpc and ARM can't quite make them synchronous I think, but at least > > > they should have the same semantics as writel. > > > > One thing that ARM does IIRC is that it only guarantees to order writel() within > > one device, and the memory mapped PCI I/O space window almost certainly > > counts as a separate device to the CPU. > > That sounds bogus. To elaborate, if you do the following on arm: writel(DEVICE_FOO); writel(DEVICE_BAR); we generally cannot guarantee in which order those accesses will hit the devices even if we add every barrier under the sun. You'd need something in between, specific to DEVICE_FOO (probably a read-back) to really push the first write out. This doesn't sound like it would be that uncommon to me. On the other hand: writel(DEVICE_FOO); writel(DEVICE_FOO); is obviously ordered and also things like: writel(DEVICE_FOO_IN_PCI_MEM_SPACE); writel(DEVICE_BAR_IN_SAME_PCI_MEM_SPACE); are ordered up to the PCI host bridge, because that's really the "device" here. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-28 10:13 ` Will Deacon 0 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-28 10:13 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Arnd Bergmann, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote: > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > > powerpc and ARM can't quite make them synchronous I think, but at least > > > they should have the same semantics as writel. > > > > One thing that ARM does IIRC is that it only guarantees to order writel() within > > one device, and the memory mapped PCI I/O space window almost certainly > > counts as a separate device to the CPU. > > That sounds bogus. To elaborate, if you do the following on arm: writel(DEVICE_FOO); writel(DEVICE_BAR); we generally cannot guarantee in which order those accesses will hit the devices even if we add every barrier under the sun. You'd need something in between, specific to DEVICE_FOO (probably a read-back) to really push the first write out. This doesn't sound like it would be that uncommon to me. On the other hand: writel(DEVICE_FOO); writel(DEVICE_FOO); is obviously ordered and also things like: writel(DEVICE_FOO_IN_PCI_MEM_SPACE); writel(DEVICE_BAR_IN_SAME_PCI_MEM_SPACE); are ordered up to the PCI host bridge, because that's really the "device" here. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 10:13 ` Will Deacon @ 2018-03-28 16:57 ` Jason Gunthorpe -1 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-28 16:57 UTC (permalink / raw) To: Will Deacon Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote: > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote: > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > > > powerpc and ARM can't quite make them synchronous I think, but at least > > > > they should have the same semantics as writel. > > > > > > One thing that ARM does IIRC is that it only guarantees to order writel() within > > > one device, and the memory mapped PCI I/O space window almost certainly > > > counts as a separate device to the CPU. > > > > That sounds bogus. > > To elaborate, if you do the following on arm: > > writel(DEVICE_FOO); > writel(DEVICE_BAR); > > we generally cannot guarantee in which order those accesses will hit the > devices even if we add every barrier under the sun. You'd need something > in between, specific to DEVICE_FOO (probably a read-back) to really push > the first write out. This doesn't sound like it would be that uncommon to > me. The PCI posted write does not require the above to execute 'in order' only that any bus segment shared by the two devices have the writes issued in CPU order. ie at a shared PCI root port for instance. If I recall this is very similar to the ordering that ARM's on-chip AXI interconnect is supposed to provide.. So I'd be very surprised if a modern ARM64 has an meaningful difference from x86 here. When talking about ordering between the devices, the relevant question is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA from the DEVICE_FOO. 'ordered' means that in this case writel(DEVICE_FOO) must be presented to FOO before anything generated by BAR. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-28 16:57 ` Jason Gunthorpe 0 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-28 16:57 UTC (permalink / raw) To: Will Deacon Cc: Benjamin Herrenschmidt, Arnd Bergmann, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote: > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote: > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > > > powerpc and ARM can't quite make them synchronous I think, but at least > > > > they should have the same semantics as writel. > > > > > > One thing that ARM does IIRC is that it only guarantees to order writel() within > > > one device, and the memory mapped PCI I/O space window almost certainly > > > counts as a separate device to the CPU. > > > > That sounds bogus. > > To elaborate, if you do the following on arm: > > writel(DEVICE_FOO); > writel(DEVICE_BAR); > > we generally cannot guarantee in which order those accesses will hit the > devices even if we add every barrier under the sun. You'd need something > in between, specific to DEVICE_FOO (probably a read-back) to really push > the first write out. This doesn't sound like it would be that uncommon to > me. The PCI posted write does not require the above to execute 'in order' only that any bus segment shared by the two devices have the writes issued in CPU order. ie at a shared PCI root port for instance. If I recall this is very similar to the ordering that ARM's on-chip AXI interconnect is supposed to provide.. So I'd be very surprised if a modern ARM64 has an meaningful difference from x86 here. When talking about ordering between the devices, the relevant question is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA from the DEVICE_FOO. 'ordered' means that in this case writel(DEVICE_FOO) must be presented to FOO before anything generated by BAR. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 16:57 ` Jason Gunthorpe @ 2018-03-29 9:19 ` Will Deacon -1 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-29 9:19 UTC (permalink / raw) To: Jason Gunthorpe Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote: > On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote: > > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote: > > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > > > > powerpc and ARM can't quite make them synchronous I think, but at least > > > > > they should have the same semantics as writel. > > > > > > > > One thing that ARM does IIRC is that it only guarantees to order writel() within > > > > one device, and the memory mapped PCI I/O space window almost certainly > > > > counts as a separate device to the CPU. > > > > > > That sounds bogus. > > > > To elaborate, if you do the following on arm: > > > > writel(DEVICE_FOO); > > writel(DEVICE_BAR); > > > > we generally cannot guarantee in which order those accesses will hit the > > devices even if we add every barrier under the sun. You'd need something > > in between, specific to DEVICE_FOO (probably a read-back) to really push > > the first write out. This doesn't sound like it would be that uncommon to > > me. > > The PCI posted write does not require the above to execute 'in order' > only that any bus segment shared by the two devices have the writes > issued in CPU order. ie at a shared PCI root port for instance. > > If I recall this is very similar to the ordering that ARM's on-chip > AXI interconnect is supposed to provide.. So I'd be very surprised if > a modern ARM64 has an meaningful difference from x86 here. >From the architectural perspective, writes to different "peripherals" are not ordered with respect to each other. The first writel will complete once it gets its write acknowledgement, but this may not necessarily come from the endpoint -- it could come from an intermediate buffer past the point of serialisation (i.e. the write will then be ordered with respect to other accesses to that same endpoint). The PCI root port would look like one peripheral here. > When talking about ordering between the devices, the relevant question > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA > from the DEVICE_FOO. 'ordered' means that in this case > writel(DEVICE_FOO) must be presented to FOO before anything generated > by BAR. Yes, and that isn't the case for arm because the writes can still be buffered. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-29 9:19 ` Will Deacon 0 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-29 9:19 UTC (permalink / raw) To: Jason Gunthorpe Cc: Benjamin Herrenschmidt, Arnd Bergmann, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote: > On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote: > > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote: > > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > > > > powerpc and ARM can't quite make them synchronous I think, but at least > > > > > they should have the same semantics as writel. > > > > > > > > One thing that ARM does IIRC is that it only guarantees to order writel() within > > > > one device, and the memory mapped PCI I/O space window almost certainly > > > > counts as a separate device to the CPU. > > > > > > That sounds bogus. > > > > To elaborate, if you do the following on arm: > > > > writel(DEVICE_FOO); > > writel(DEVICE_BAR); > > > > we generally cannot guarantee in which order those accesses will hit the > > devices even if we add every barrier under the sun. You'd need something > > in between, specific to DEVICE_FOO (probably a read-back) to really push > > the first write out. This doesn't sound like it would be that uncommon to > > me. > > The PCI posted write does not require the above to execute 'in order' > only that any bus segment shared by the two devices have the writes > issued in CPU order. ie at a shared PCI root port for instance. > > If I recall this is very similar to the ordering that ARM's on-chip > AXI interconnect is supposed to provide.. So I'd be very surprised if > a modern ARM64 has an meaningful difference from x86 here. >From the architectural perspective, writes to different "peripherals" are not ordered with respect to each other. The first writel will complete once it gets its write acknowledgement, but this may not necessarily come from the endpoint -- it could come from an intermediate buffer past the point of serialisation (i.e. the write will then be ordered with respect to other accesses to that same endpoint). The PCI root port would look like one peripheral here. > When talking about ordering between the devices, the relevant question > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA > from the DEVICE_FOO. 'ordered' means that in this case > writel(DEVICE_FOO) must be presented to FOO before anything generated > by BAR. Yes, and that isn't the case for arm because the writes can still be buffered. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-29 9:19 ` Will Deacon @ 2018-03-29 14:45 ` Jason Gunthorpe -1 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-29 14:45 UTC (permalink / raw) To: Will Deacon Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Thu, Mar 29, 2018 at 10:19:41AM +0100, Will Deacon wrote: > On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote: > > On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote: > > > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote: > > > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > > > > > powerpc and ARM can't quite make them synchronous I think, but at least > > > > > > they should have the same semantics as writel. > > > > > > > > > > One thing that ARM does IIRC is that it only guarantees to order writel() within > > > > > one device, and the memory mapped PCI I/O space window almost certainly > > > > > counts as a separate device to the CPU. > > > > > > > > That sounds bogus. > > > > > > To elaborate, if you do the following on arm: > > > > > > writel(DEVICE_FOO); > > > writel(DEVICE_BAR); > > > > > > we generally cannot guarantee in which order those accesses will hit the > > > devices even if we add every barrier under the sun. You'd need something > > > in between, specific to DEVICE_FOO (probably a read-back) to really push > > > the first write out. This doesn't sound like it would be that uncommon to > > > me. > > > > The PCI posted write does not require the above to execute 'in order' > > only that any bus segment shared by the two devices have the writes > > issued in CPU order. ie at a shared PCI root port for instance. > > > > If I recall this is very similar to the ordering that ARM's on-chip > > AXI interconnect is supposed to provide.. So I'd be very surprised if > > a modern ARM64 has an meaningful difference from x86 here. > > From the architectural perspective, writes to different "peripherals" are > not ordered with respect to each other. The first writel will complete once > it gets its write acknowledgement, but this may not necessarily come from > the endpoint -- it could come from an intermediate buffer past the point of > serialisation (i.e. the write will then be ordered with respect to other > accesses to that same endpoint). The PCI root port would look like one > peripheral here. That is basically the same as PCI - PCI has no write ACK, so all writes are buffered by the PCI interconnect and complete in some undefined temporal order when multiple end points are involved. This does not seem very different from what happens in x86.. > > When talking about ordering between the devices, the relevant question > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA > > from the DEVICE_FOO. 'ordered' means that in this case > > writel(DEVICE_FOO) must be presented to FOO before anything generated > > by BAR. > > Yes, and that isn't the case for arm because the writes can still be > buffered. The statement is not about buffering, or temporal completion order, or the order of acks returning to the CPU. It is about pure transaction ordering inside the interconnect. Can write BAR -> FOO pass write CPU -> FOO? Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-29 14:45 ` Jason Gunthorpe 0 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-29 14:45 UTC (permalink / raw) To: Will Deacon Cc: Benjamin Herrenschmidt, Arnd Bergmann, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Thu, Mar 29, 2018 at 10:19:41AM +0100, Will Deacon wrote: > On Wed, Mar 28, 2018 at 10:57:32AM -0600, Jason Gunthorpe wrote: > > On Wed, Mar 28, 2018 at 11:13:45AM +0100, Will Deacon wrote: > > > On Wed, Mar 28, 2018 at 09:01:27PM +1100, Benjamin Herrenschmidt wrote: > > > > On Wed, 2018-03-28 at 11:55 +0200, Arnd Bergmann wrote: > > > > > > powerpc and ARM can't quite make them synchronous I think, but at least > > > > > > they should have the same semantics as writel. > > > > > > > > > > One thing that ARM does IIRC is that it only guarantees to order writel() within > > > > > one device, and the memory mapped PCI I/O space window almost certainly > > > > > counts as a separate device to the CPU. > > > > > > > > That sounds bogus. > > > > > > To elaborate, if you do the following on arm: > > > > > > writel(DEVICE_FOO); > > > writel(DEVICE_BAR); > > > > > > we generally cannot guarantee in which order those accesses will hit the > > > devices even if we add every barrier under the sun. You'd need something > > > in between, specific to DEVICE_FOO (probably a read-back) to really push > > > the first write out. This doesn't sound like it would be that uncommon to > > > me. > > > > The PCI posted write does not require the above to execute 'in order' > > only that any bus segment shared by the two devices have the writes > > issued in CPU order. ie at a shared PCI root port for instance. > > > > If I recall this is very similar to the ordering that ARM's on-chip > > AXI interconnect is supposed to provide.. So I'd be very surprised if > > a modern ARM64 has an meaningful difference from x86 here. > > From the architectural perspective, writes to different "peripherals" are > not ordered with respect to each other. The first writel will complete once > it gets its write acknowledgement, but this may not necessarily come from > the endpoint -- it could come from an intermediate buffer past the point of > serialisation (i.e. the write will then be ordered with respect to other > accesses to that same endpoint). The PCI root port would look like one > peripheral here. That is basically the same as PCI - PCI has no write ACK, so all writes are buffered by the PCI interconnect and complete in some undefined temporal order when multiple end points are involved. This does not seem very different from what happens in x86.. > > When talking about ordering between the devices, the relevant question > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA > > from the DEVICE_FOO. 'ordered' means that in this case > > writel(DEVICE_FOO) must be presented to FOO before anything generated > > by BAR. > > Yes, and that isn't the case for arm because the writes can still be > buffered. The statement is not about buffering, or temporal completion order, or the order of acks returning to the CPU. It is about pure transaction ordering inside the interconnect. Can write BAR -> FOO pass write CPU -> FOO? Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed 2018-03-29 14:45 ` Jason Gunthorpe @ 2018-03-29 14:58 ` David Laight -1 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-29 14:58 UTC (permalink / raw) To: 'Jason Gunthorpe', Will Deacon Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Sinan Kaya, Peter Zijlstra, Ingo Molnar, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) From: Jason Gunthorpe > Sent: 29 March 2018 15:45 ... > > > When talking about ordering between the devices, the relevant question > > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA > > > from the DEVICE_FOO. 'ordered' means that in this case > > > writel(DEVICE_FOO) must be presented to FOO before anything generated > > > by BAR. > > > > Yes, and that isn't the case for arm because the writes can still be > > buffered. > > The statement is not about buffering, or temporal completion order, or > the order of acks returning to the CPU. It is about pure transaction > ordering inside the interconnect. > > Can write BAR -> FOO pass write CPU -> FOO? Almost certainly. The first cpu write can almost certainly be 'stalled' at the shared PCIe bridge. The second cpu write then completes (to a different target). That target then issues a peer to peer transfer that reaches the shared bridge. I doubt the order of the transactions is guaranteed when it becomes 'un-stalled'. Of course, these are peer to peer transfers, and strange ones at that. Normally you'd not be doing peer to peer transfers that access 'memory' the cpu has just written to. Requiring extra barriers in this case, or different functions for WC accesses shouldn't really be an issue. Even requiring a barrier between a write to dma coherent memory and a write that starts dma isn't really onerous. Even if it is a nop on all current architectures it is a good comment in the code. It could even have a 'dev' argument. David ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed @ 2018-03-29 14:58 ` David Laight 0 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-29 14:58 UTC (permalink / raw) To: 'Jason Gunthorpe', Will Deacon Cc: Benjamin Herrenschmidt, Arnd Bergmann, Sinan Kaya, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet From: Jason Gunthorpe > Sent: 29 March 2018 15:45 ... > > > When talking about ordering between the devices, the relevant questio= n > > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA > > > from the DEVICE_FOO. 'ordered' means that in this case > > > writel(DEVICE_FOO) must be presented to FOO before anything generated > > > by BAR. > > > > Yes, and that isn't the case for arm because the writes can still be > > buffered. >=20 > The statement is not about buffering, or temporal completion order, or > the order of acks returning to the CPU. It is about pure transaction > ordering inside the interconnect. >=20 > Can write BAR -> FOO pass write CPU -> FOO? Almost certainly. The first cpu write can almost certainly be 'stalled' at the shared PCIe br= idge. The second cpu write then completes (to a different target). That target then issues a peer to peer transfer that reaches the shared bri= dge. I doubt the order of the transactions is guaranteed when it becomes 'un-sta= lled'. Of course, these are peer to peer transfers, and strange ones at that. Normally you'd not be doing peer to peer transfers that access 'memory' the cpu has just written to. Requiring extra barriers in this case, or different functions for WC access= es shouldn't really be an issue. Even requiring a barrier between a write to dma coherent memory and a write that starts dma isn't really onerous. Even if it is a nop on all current architectures it is a good comment in th= e code. It could even have a 'dev' argument. David ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-29 14:58 ` David Laight @ 2018-03-29 16:40 ` Jason Gunthorpe -1 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-29 16:40 UTC (permalink / raw) To: David Laight Cc: Arnd Bergmann, Jonathan Corbet, linux-rdma, Will Deacon, Sinan Kaya, Peter Zijlstra, Ingo Molnar, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Thu, Mar 29, 2018 at 02:58:34PM +0000, David Laight wrote: > From: Jason Gunthorpe > > Sent: 29 March 2018 15:45 > ... > > > > When talking about ordering between the devices, the relevant question > > > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA > > > > from the DEVICE_FOO. 'ordered' means that in this case > > > > writel(DEVICE_FOO) must be presented to FOO before anything generated > > > > by BAR. > > > > > > Yes, and that isn't the case for arm because the writes can still be > > > buffered. > > > > The statement is not about buffering, or temporal completion order, or > > the order of acks returning to the CPU. It is about pure transaction > > ordering inside the interconnect. > > > > Can write BAR -> FOO pass write CPU -> FOO? > > Almost certainly. > The first cpu write can almost certainly be 'stalled' at the shared PCIe bridge. > The second cpu write then completes (to a different target). > That target then issues a peer to peer transfer that reaches the shared bridge. > I doubt the order of the transactions is guaranteed when it becomes 'un-stalled'. The PCI spec has very strong wording on ordering that covers this case. Stalled bridges have to follow the ordering rules, and posted writes cannot pass other posted writes. Since in PCI all three transactions: CPU -> FOO CPU -> BAR BAR -> FOO Must traverse a shared bus segment, they must be placed on that bus in the above order, and the bridge(s) toward FOO must preserve this order. ARM's AXI has similar rules, I just can't recall the tiny details right now :) > Of course, these are peer to peer transfers, and strange ones at that. > Normally you'd not be doing peer to peer transfers that access 'memory' > the cpu has just written to. It is the best situation I can think of where order of completion to different devices would matter to a generic Linux driver.. .. And there are patches circulating right now for NVMe that enable exactly this kind of transfer, and rely on these kind of semantics, so it is a relevant detail :) Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-29 16:40 ` Jason Gunthorpe 0 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-29 16:40 UTC (permalink / raw) To: David Laight Cc: Will Deacon, Benjamin Herrenschmidt, Arnd Bergmann, Sinan Kaya, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Thu, Mar 29, 2018 at 02:58:34PM +0000, David Laight wrote: > From: Jason Gunthorpe > > Sent: 29 March 2018 15:45 > ... > > > > When talking about ordering between the devices, the relevant question > > > > is what happens if the writel(DEVICE_BAR) triggers DEVICE_BAR to DMA > > > > from the DEVICE_FOO. 'ordered' means that in this case > > > > writel(DEVICE_FOO) must be presented to FOO before anything generated > > > > by BAR. > > > > > > Yes, and that isn't the case for arm because the writes can still be > > > buffered. > > > > The statement is not about buffering, or temporal completion order, or > > the order of acks returning to the CPU. It is about pure transaction > > ordering inside the interconnect. > > > > Can write BAR -> FOO pass write CPU -> FOO? > > Almost certainly. > The first cpu write can almost certainly be 'stalled' at the shared PCIe bridge. > The second cpu write then completes (to a different target). > That target then issues a peer to peer transfer that reaches the shared bridge. > I doubt the order of the transactions is guaranteed when it becomes 'un-stalled'. The PCI spec has very strong wording on ordering that covers this case. Stalled bridges have to follow the ordering rules, and posted writes cannot pass other posted writes. Since in PCI all three transactions: CPU -> FOO CPU -> BAR BAR -> FOO Must traverse a shared bus segment, they must be placed on that bus in the above order, and the bridge(s) toward FOO must preserve this order. ARM's AXI has similar rules, I just can't recall the tiny details right now :) > Of course, these are peer to peer transfers, and strange ones at that. > Normally you'd not be doing peer to peer transfers that access 'memory' > the cpu has just written to. It is the best situation I can think of where order of completion to different devices would matter to a generic Linux driver.. .. And there are patches circulating right now for NVMe that enable exactly this kind of transfer, and rely on these kind of semantics, so it is a relevant detail :) Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 13:46 ` Sinan Kaya @ 2018-03-27 21:24 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 21:24 UTC (permalink / raw) To: Sinan Kaya, Will Deacon, Arnd Bergmann Cc: Jonathan Corbet, linux-rdma, Jason Gunthorpe, Peter Zijlstra, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Ingo Molnar On Tue, 2018-03-27 at 09:46 -0400, Sinan Kaya wrote: > On 3/27/2018 7:02 AM, Will Deacon wrote: > > - See Documentation/DMA-API.txt for more information on consistent memory. > > + can see it now has ownership. Note that, when using writel(), a prior > > + wmb() is not needed to guarantee that the cache coherent memory writes > > + have completed before writing to the MMIO region. The cheaper > > + writel_relaxed() does not provide this guarantee and must not be used > > + here. > > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb() > in front of these. Yes, they should have the same semantics as writel Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 21:24 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 21:24 UTC (permalink / raw) To: Sinan Kaya, Will Deacon, Arnd Bergmann Cc: Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, Peter Zijlstra, Ingo Molnar, Jonathan Corbet On Tue, 2018-03-27 at 09:46 -0400, Sinan Kaya wrote: > On 3/27/2018 7:02 AM, Will Deacon wrote: > > - See Documentation/DMA-API.txt for more information on consistent memory. > > + can see it now has ownership. Note that, when using writel(), a prior > > + wmb() is not needed to guarantee that the cache coherent memory writes > > + have completed before writing to the MMIO region. The cheaper > > + writel_relaxed() does not provide this guarantee and must not be used > > + here. > > Can we say the same thing for iowrite32() and iowrite32be(). I also see wmb() > in front of these. Yes, they should have the same semantics as writel Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 9:57 ` Will Deacon @ 2018-03-27 11:21 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 11:21 UTC (permalink / raw) To: Will Deacon Cc: Paul E. McKenney, Arnd Bergmann, corbet, linux-rdma, Sinan Kaya, Jason Gunthorpe, peterz, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), mingo On Tue, 2018-03-27 at 10:57 +0100, Will Deacon wrote: > > > > The interesting thing is that we do seem to have a whole LOT of these > > spurrious wmb before writel all over the tree, I suspect because of > > that incorrect recommendation in memory-barriers.txt. > > > > We should fix that. > > Patch below. Thoughts? Looks good, we should probably also have a clearer (explicit) definition of that ordering in the driver-api doco. Now, to remove all those useless wmb's and find what other bugs they were papering over ... ;-) Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 11:21 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 11:21 UTC (permalink / raw) To: Will Deacon Cc: Arnd Bergmann, Jason Gunthorpe, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, peterz, mingo, corbet On Tue, 2018-03-27 at 10:57 +0100, Will Deacon wrote: > > > > The interesting thing is that we do seem to have a whole LOT of these > > spurrious wmb before writel all over the tree, I suspect because of > > that incorrect recommendation in memory-barriers.txt. > > > > We should fix that. > > Patch below. Thoughts? Looks good, we should probably also have a clearer (explicit) definition of that ordering in the driver-api doco. Now, to remove all those useless wmb's and find what other bugs they were papering over ... ;-) Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 7:56 ` Arnd Bergmann @ 2018-03-27 9:42 ` Will Deacon -1 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 9:42 UTC (permalink / raw) To: Arnd Bergmann Cc: Paul E. McKenney, linux-rdma, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > I even see patches adding wmb() based on actual observed memory > > corruption during testing on Intel: > > > > https://patchwork.kernel.org/patch/10177207/ > > > > So you think all of this is unnecessary and writel is totally strongly > > ordered, even on multi-socket Intel? > > This example adds a wmb() between two writes to a coherent DMA > area, it is definitely required there. I'm pretty sure I've never seen > any bug reports pointing to a missing wmb() between memory > and MMIO write accesses, but if you remember seeing them in the > list, maybe you can look again for some evidence of something going > wrong on x86 without it? If this is just about ordering accesses to coherent DMA, then using dma_wmb() instead will be much better performance on arm/arm64. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 9:42 ` Will Deacon 0 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 9:42 UTC (permalink / raw) To: Arnd Bergmann Cc: Jason Gunthorpe, Benjamin Herrenschmidt, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > I even see patches adding wmb() based on actual observed memory > > corruption during testing on Intel: > > > > https://patchwork.kernel.org/patch/10177207/ > > > > So you think all of this is unnecessary and writel is totally strongly > > ordered, even on multi-socket Intel? > > This example adds a wmb() between two writes to a coherent DMA > area, it is definitely required there. I'm pretty sure I've never seen > any bug reports pointing to a missing wmb() between memory > and MMIO write accesses, but if you remember seeing them in the > list, maybe you can look again for some evidence of something going > wrong on x86 without it? If this is just about ordering accesses to coherent DMA, then using dma_wmb() instead will be much better performance on arm/arm64. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 9:42 ` Will Deacon @ 2018-03-27 11:20 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 11:20 UTC (permalink / raw) To: Will Deacon, Arnd Bergmann Cc: Paul E. McKenney, linux-rdma, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, 2018-03-27 at 10:42 +0100, Will Deacon wrote: > > > > This example adds a wmb() between two writes to a coherent DMA > > area, it is definitely required there. I'm pretty sure I've never seen > > any bug reports pointing to a missing wmb() between memory > > and MMIO write accesses, but if you remember seeing them in the > > list, maybe you can look again for some evidence of something going > > wrong on x86 without it? > > If this is just about ordering accesses to coherent DMA, then using > dma_wmb() instead will be much better performance on arm/arm64. Ah, something we should look into for powerpc as well, as we could use an lwsync for that which is also cheaper than a full sync wmb does. dma_wmb() is basically the same as smp_wmb() without the CONFIG_SMP conditional right ? Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 11:20 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 11:20 UTC (permalink / raw) To: Will Deacon, Arnd Bergmann Cc: Jason Gunthorpe, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney On Tue, 2018-03-27 at 10:42 +0100, Will Deacon wrote: > > > > This example adds a wmb() between two writes to a coherent DMA > > area, it is definitely required there. I'm pretty sure I've never seen > > any bug reports pointing to a missing wmb() between memory > > and MMIO write accesses, but if you remember seeing them in the > > list, maybe you can look again for some evidence of something going > > wrong on x86 without it? > > If this is just about ordering accesses to coherent DMA, then using > dma_wmb() instead will be much better performance on arm/arm64. Ah, something we should look into for powerpc as well, as we could use an lwsync for that which is also cheaper than a full sync wmb does. dma_wmb() is basically the same as smp_wmb() without the CONFIG_SMP conditional right ? Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 11:20 ` Benjamin Herrenschmidt @ 2018-03-27 11:24 ` Will Deacon -1 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 11:24 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, Mar 27, 2018 at 10:20:02PM +1100, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 10:42 +0100, Will Deacon wrote: > > > > > > This example adds a wmb() between two writes to a coherent DMA > > > area, it is definitely required there. I'm pretty sure I've never seen > > > any bug reports pointing to a missing wmb() between memory > > > and MMIO write accesses, but if you remember seeing them in the > > > list, maybe you can look again for some evidence of something going > > > wrong on x86 without it? > > > > If this is just about ordering accesses to coherent DMA, then using > > dma_wmb() instead will be much better performance on arm/arm64. > > Ah, something we should look into for powerpc as well, as we could use > an lwsync for that which is also cheaper than a full sync wmb does. > > dma_wmb() is basically the same as smp_wmb() without the CONFIG_SMP > conditional right ? Almost -- the slight change we have on arm64 is to say that it's "outer-shareable", which means it also orders non-cacheable accesses in the case that dma_alloc_coherent is used to allocate a consistent buffer for a non-coherent device. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 11:24 ` Will Deacon 0 siblings, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 11:24 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Arnd Bergmann, Jason Gunthorpe, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney On Tue, Mar 27, 2018 at 10:20:02PM +1100, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 10:42 +0100, Will Deacon wrote: > > > > > > This example adds a wmb() between two writes to a coherent DMA > > > area, it is definitely required there. I'm pretty sure I've never seen > > > any bug reports pointing to a missing wmb() between memory > > > and MMIO write accesses, but if you remember seeing them in the > > > list, maybe you can look again for some evidence of something going > > > wrong on x86 without it? > > > > If this is just about ordering accesses to coherent DMA, then using > > dma_wmb() instead will be much better performance on arm/arm64. > > Ah, something we should look into for powerpc as well, as we could use > an lwsync for that which is also cheaper than a full sync wmb does. > > dma_wmb() is basically the same as smp_wmb() without the CONFIG_SMP > conditional right ? Almost -- the slight change we have on arm64 is to say that it's "outer-shareable", which means it also orders non-cacheable accesses in the case that dma_alloc_coherent is used to allocate a consistent buffer for a non-coherent device. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 9:42 ` Will Deacon @ 2018-03-27 14:24 ` Jason Gunthorpe -1 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-27 14:24 UTC (permalink / raw) To: Will Deacon Cc: Paul E. McKenney, Arnd Bergmann, linux-rdma, Sinan Kaya, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, Mar 27, 2018 at 10:42:00AM +0100, Will Deacon wrote: > On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote: > > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > > >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > > > I even see patches adding wmb() based on actual observed memory > > > corruption during testing on Intel: > > > > > > https://patchwork.kernel.org/patch/10177207/ > > > > > > So you think all of this is unnecessary and writel is totally strongly > > > ordered, even on multi-socket Intel? > > > > This example adds a wmb() between two writes to a coherent DMA > > area, it is definitely required there. I'm pretty sure I've never seen > > any bug reports pointing to a missing wmb() between memory > > and MMIO write accesses, but if you remember seeing them in the > > list, maybe you can look again for some evidence of something going > > wrong on x86 without it? > > If this is just about ordering accesses to coherent DMA, then using > dma_wmb() instead will be much better performance on arm/arm64. dma_wmb() is a NOP on x86, it was tested anyhow and didn't help this case.. Confusing, but probably not relevant to this discussion. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 14:24 ` Jason Gunthorpe 0 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-27 14:24 UTC (permalink / raw) To: Will Deacon Cc: Arnd Bergmann, Benjamin Herrenschmidt, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney On Tue, Mar 27, 2018 at 10:42:00AM +0100, Will Deacon wrote: > On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote: > > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > > >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > > > I even see patches adding wmb() based on actual observed memory > > > corruption during testing on Intel: > > > > > > https://patchwork.kernel.org/patch/10177207/ > > > > > > So you think all of this is unnecessary and writel is totally strongly > > > ordered, even on multi-socket Intel? > > > > This example adds a wmb() between two writes to a coherent DMA > > area, it is definitely required there. I'm pretty sure I've never seen > > any bug reports pointing to a missing wmb() between memory > > and MMIO write accesses, but if you remember seeing them in the > > list, maybe you can look again for some evidence of something going > > wrong on x86 without it? > > If this is just about ordering accesses to coherent DMA, then using > dma_wmb() instead will be much better performance on arm/arm64. dma_wmb() is a NOP on x86, it was tested anyhow and didn't help this case.. Confusing, but probably not relevant to this discussion. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 7:56 ` Arnd Bergmann @ 2018-03-27 14:16 ` Jason Gunthorpe -1 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-27 14:16 UTC (permalink / raw) To: Arnd Bergmann Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > I even see patches adding wmb() based on actual observed memory > > corruption during testing on Intel: > > > > https://patchwork.kernel.org/patch/10177207/ > > > > So you think all of this is unnecessary and writel is totally strongly > > ordered, even on multi-socket Intel? > > This example adds a wmb() between two writes to a coherent DMA > area, it is definitely required there. Ah! So it is, too many things called 'db' in that driver.. One of the 'db' is also WC BAR memory.. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 14:16 ` Jason Gunthorpe 0 siblings, 0 replies; 216+ messages in thread From: Jason Gunthorpe @ 2018-03-27 14:16 UTC (permalink / raw) To: Arnd Bergmann Cc: Benjamin Herrenschmidt, Sinan Kaya, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Tue, Mar 27, 2018 at 09:56:47AM +0200, Arnd Bergmann wrote: > On Tue, Mar 27, 2018 at 12:27 AM, Jason Gunthorpe <jgg@ziepe.ca> wrote: > > On Tue, Mar 27, 2018 at 09:01:57AM +1100, Benjamin Herrenschmidt wrote: > >> On Mon, 2018-03-26 at 17:46 -0400, Sinan Kaya wrote: > > > > I even see patches adding wmb() based on actual observed memory > > corruption during testing on Intel: > > > > https://patchwork.kernel.org/patch/10177207/ > > > > So you think all of this is unnecessary and writel is totally strongly > > ordered, even on multi-socket Intel? > > This example adds a wmb() between two writes to a coherent DMA > area, it is definitely required there. Ah! So it is, too many things called 'db' in that driver.. One of the 'db' is also WC BAR memory.. Jason ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 21:30 ` Arnd Bergmann @ 2018-03-26 22:00 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 22:00 UTC (permalink / raw) To: Arnd Bergmann, Jason Gunthorpe Cc: Paul E. McKenney, linux-rdma, Will Deacon, Sinan Kaya, David Laight, Oliver, Alexander Duyck, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote: > Most of the drivers have a unwound loop with writeq() or something to > > do it. > > But isn't the writeq() barrier much more expensive than anything you'd > do in function calls? It is for us, and will break any write combining. > > > > The same document says that _relaxed() does not give that guarentee. > > > > > > > > The lwn articule on this went into some depth on the interaction with > > > > spinlocks. > > > > > > > > As far as I can see, containment in a spinlock seems to be the only > > > > different between writel and writel_relaxed.. > > > > > > I was always puzzled by this: The intention of _relaxed() on ARM > > > (where it originates) was to skip the barrier that serializes DMA > > > with MMIO, not to skip the serialization between MMIO and locks. > > > > But that was never a requirement of writel(), > > Documentation/memory-barriers.txt gives an explicit example demanding > > the wmb() before writel() for ordering system memory against writel. This is a bug in the documentation. > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > Adding Alexander Duyck to Cc, he added that section as part of > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > dma_wmb()"). Also adding the other people that were involved with that. Linus himself made it very clear years ago. readl and writel have to order vs memory accesses. > > I actually have no idea why ARM had that barrier, I always assumed it > > was to give program ordering to the accesses and that _relaxed allowed > > re-ordering (the usual meaning of relaxed).. > > > > But the barrier document makes it pretty clear that the only > > difference between the two is spinlock containment, and WillD wrote > > this text, so I belive it is accurate for ARM. > > > > Very confusing. > > It does mention serialization with both DMA and locks in the > section about readX_relaxed()/writeX_relaxed(). The part > about DMA is very clear here, and I must have just forgotten > the exact semantics with regards to spinlocks. I'm still not > sure what prevents a writel() from leaking out the end of a > spinlock section that doesn't happen with writel_relaxed(), since > the barrier in writel() comes before the access, and the > spin_unlock() shouldn't affect the external buses. So... Historically, what happened is that we (we means whoever participated in the discussion on the list with Linus calling the shots really) decided that there was no sane way for drivers to understand a world where readl/writel didn't fully order things vs. memory accesses (ie, DMA). So it should always be correct to do: - Write to some in-memory buffer - writel() to kick the DMA read of that buffer without any extra barrier. The spinlock situation however got murky. Mostly that came up because on architecture (I forgot who, might have been ia64) has a hard time providing that consistency without making writel insanely expensive. Thus they created mmiowb whose main purpose was precisely to order writel with a following spin_unlock. I decided not to go down that path on power because getting all drivers "fixed" to do the right thing was going to be a losing battle, and instead added per-cpu tracking of writel in order to "escalate" to a heavier barrier in spin_unlock itself when necessary. Now, all this happened more than a decade ago and it's possible that the understanding or expectations "shifted" over time... Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 22:00 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 22:00 UTC (permalink / raw) To: Arnd Bergmann, Jason Gunthorpe Cc: David Laight, Sinan Kaya, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote: > Most of the drivers have a unwound loop with writeq() or something to > > do it. > > But isn't the writeq() barrier much more expensive than anything you'd > do in function calls? It is for us, and will break any write combining. > > > > The same document says that _relaxed() does not give that guarentee. > > > > > > > > The lwn articule on this went into some depth on the interaction with > > > > spinlocks. > > > > > > > > As far as I can see, containment in a spinlock seems to be the only > > > > different between writel and writel_relaxed.. > > > > > > I was always puzzled by this: The intention of _relaxed() on ARM > > > (where it originates) was to skip the barrier that serializes DMA > > > with MMIO, not to skip the serialization between MMIO and locks. > > > > But that was never a requirement of writel(), > > Documentation/memory-barriers.txt gives an explicit example demanding > > the wmb() before writel() for ordering system memory against writel. This is a bug in the documentation. > Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > Adding Alexander Duyck to Cc, he added that section as part of > 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > dma_wmb()"). Also adding the other people that were involved with that. Linus himself made it very clear years ago. readl and writel have to order vs memory accesses. > > I actually have no idea why ARM had that barrier, I always assumed it > > was to give program ordering to the accesses and that _relaxed allowed > > re-ordering (the usual meaning of relaxed).. > > > > But the barrier document makes it pretty clear that the only > > difference between the two is spinlock containment, and WillD wrote > > this text, so I belive it is accurate for ARM. > > > > Very confusing. > > It does mention serialization with both DMA and locks in the > section about readX_relaxed()/writeX_relaxed(). The part > about DMA is very clear here, and I must have just forgotten > the exact semantics with regards to spinlocks. I'm still not > sure what prevents a writel() from leaking out the end of a > spinlock section that doesn't happen with writel_relaxed(), since > the barrier in writel() comes before the access, and the > spin_unlock() shouldn't affect the external buses. So... Historically, what happened is that we (we means whoever participated in the discussion on the list with Linus calling the shots really) decided that there was no sane way for drivers to understand a world where readl/writel didn't fully order things vs. memory accesses (ie, DMA). So it should always be correct to do: - Write to some in-memory buffer - writel() to kick the DMA read of that buffer without any extra barrier. The spinlock situation however got murky. Mostly that came up because on architecture (I forgot who, might have been ia64) has a hard time providing that consistency without making writel insanely expensive. Thus they created mmiowb whose main purpose was precisely to order writel with a following spin_unlock. I decided not to go down that path on power because getting all drivers "fixed" to do the right thing was going to be a losing battle, and instead added per-cpu tracking of writel in order to "escalate" to a heavier barrier in spin_unlock itself when necessary. Now, all this happened more than a decade ago and it's possible that the understanding or expectations "shifted" over time... Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 22:00 ` Benjamin Herrenschmidt (?) @ 2018-03-27 14:46 ` Sinan Kaya 2018-03-27 15:01 ` Jose Abreu ` (2 more replies) -1 siblings, 3 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-27 14:46 UTC (permalink / raw) To: Benjamin Herrenschmidt, Arnd Bergmann, Jason Gunthorpe Cc: David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney, netdev, Alexander Duyck +netdev, +Alex On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote: > On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote: >> Most of the drivers have a unwound loop with writeq() or something to >>> do it. >> >> But isn't the writeq() barrier much more expensive than anything you'd >> do in function calls? > > It is for us, and will break any write combining. > >>>>> The same document says that _relaxed() does not give that guarentee. >>>>> >>>>> The lwn articule on this went into some depth on the interaction with >>>>> spinlocks. >>>>> >>>>> As far as I can see, containment in a spinlock seems to be the only >>>>> different between writel and writel_relaxed.. >>>> >>>> I was always puzzled by this: The intention of _relaxed() on ARM >>>> (where it originates) was to skip the barrier that serializes DMA >>>> with MMIO, not to skip the serialization between MMIO and locks. >>> >>> But that was never a requirement of writel(), >>> Documentation/memory-barriers.txt gives an explicit example demanding >>> the wmb() before writel() for ordering system memory against writel. > > This is a bug in the documentation. > >> Indeed, but it's in an example for when to use dma_wmb(), not wmb(). >> Adding Alexander Duyck to Cc, he added that section as part of >> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and >> dma_wmb()"). Also adding the other people that were involved with that. > > Linus himself made it very clear years ago. readl and writel have to > order vs memory accesses. > >>> I actually have no idea why ARM had that barrier, I always assumed it >>> was to give program ordering to the accesses and that _relaxed allowed >>> re-ordering (the usual meaning of relaxed).. >>> >>> But the barrier document makes it pretty clear that the only >>> difference between the two is spinlock containment, and WillD wrote >>> this text, so I belive it is accurate for ARM. >>> >>> Very confusing. >> >> It does mention serialization with both DMA and locks in the >> section about readX_relaxed()/writeX_relaxed(). The part >> about DMA is very clear here, and I must have just forgotten >> the exact semantics with regards to spinlocks. I'm still not >> sure what prevents a writel() from leaking out the end of a >> spinlock section that doesn't happen with writel_relaxed(), since >> the barrier in writel() comes before the access, and the >> spin_unlock() shouldn't affect the external buses. > > So... > > Historically, what happened is that we (we means whoever participated > in the discussion on the list with Linus calling the shots really) > decided that there was no sane way for drivers to understand a world > where readl/writel didn't fully order things vs. memory accesses (ie, > DMA). > > So it should always be correct to do: > > - Write to some in-memory buffer > - writel() to kick the DMA read of that buffer > > without any extra barrier. > > The spinlock situation however got murky. Mostly that came up because > on architecture (I forgot who, might have been ia64) has a hard time > providing that consistency without making writel insanely expensive. > > Thus they created mmiowb whose main purpose was precisely to order > writel with a following spin_unlock. > > I decided not to go down that path on power because getting all drivers > "fixed" to do the right thing was going to be a losing battle, and > instead added per-cpu tracking of writel in order to "escalate" to a > heavier barrier in spin_unlock itself when necessary. > > Now, all this happened more than a decade ago and it's possible that > the understanding or expectations "shifted" over time... Alex is raising concerns on the netdev list. Sinan "We are being told that if you use writel(), then you don't need a wmb() on all architectures." Alex: "I'm not sure who told you that but that is incorrect, at least for x86. If you attempt to use writel() without the wmb() we will have to NAK the patches. We will accept the wmb() with writel_releaxed() since that solves things for ARM." > Jason is seeking behavior clarification for write combined buffers. Alex: "Don't bother. I can tell you right now that for x86 you have to have a wmb() before the writel(). Based on the comment in (https://www.spinics.net/lists/linux-rdma/msg62666.html): Replacing wmb() + writel() with wmb() + writel_relaxed() will work on PPC, it will just not give you a benefit today. I say the patch set stays. This gives benefit on ARM, and has no effect on x86 and PowerPC. If you want to look at trying to optimize things further on PowerPC and such then go for it in terms of trying to implement the writel_relaxed(). Otherwise I say we call the ARM goodness a win and don't get ourselves too wrapped up in trying to fix this for all architectures." > > Cheers, > Ben. > > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 14:46 ` Sinan Kaya @ 2018-03-27 15:01 ` Jose Abreu 2018-03-27 15:10 ` Will Deacon 2018-03-27 21:35 ` Benjamin Herrenschmidt 2 siblings, 0 replies; 216+ messages in thread From: Jose Abreu @ 2018-03-27 15:01 UTC (permalink / raw) To: Sinan Kaya, Benjamin Herrenschmidt, Arnd Bergmann, Jason Gunthorpe Cc: David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney, netdev, Alexander Duyck Hi, On 27-03-2018 15:46, Sinan Kaya wrote: > > Sinan > "We are being told that if you use writel(), then you don't need a wmb() on > all architectures." > > Alex: > "I'm not sure who told you that but that is incorrect, at least for > x86. If you attempt to use writel() without the wmb() we will have to > NAK the patches. We will accept the wmb() with writel_releaxed() since > that solves things for ARM." > So this means we should always use writel() + wmb() in *all* accesses? I don't know about x86 but arc architecture doesn't have a wmb() in the writel() function (in some configs). I see the point in net drivers while you have dma + io accesses but for most drivers this shouldn't be needed, right? What about ordering of writes? Is it guaranteed that one write will happen before the next one ? Best Regards, Jose Miguel Abreu ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 14:46 ` Sinan Kaya 2018-03-27 15:01 ` Jose Abreu @ 2018-03-27 15:10 ` Will Deacon 2018-03-27 18:54 ` Alexander Duyck 2018-03-28 1:21 ` Benjamin Herrenschmidt 2018-03-27 21:35 ` Benjamin Herrenschmidt 2 siblings, 2 replies; 216+ messages in thread From: Will Deacon @ 2018-03-27 15:10 UTC (permalink / raw) To: Sinan Kaya Cc: Benjamin Herrenschmidt, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev, Alexander Duyck, torvalds Hi Alex, On Tue, Mar 27, 2018 at 10:46:58AM -0400, Sinan Kaya wrote: > +netdev, +Alex > > On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote: > > On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote: > >> Most of the drivers have a unwound loop with writeq() or something to > >>> do it. > >> > >> But isn't the writeq() barrier much more expensive than anything you'd > >> do in function calls? > > > > It is for us, and will break any write combining. > > > >>>>> The same document says that _relaxed() does not give that guarentee. > >>>>> > >>>>> The lwn articule on this went into some depth on the interaction with > >>>>> spinlocks. > >>>>> > >>>>> As far as I can see, containment in a spinlock seems to be the only > >>>>> different between writel and writel_relaxed.. > >>>> > >>>> I was always puzzled by this: The intention of _relaxed() on ARM > >>>> (where it originates) was to skip the barrier that serializes DMA > >>>> with MMIO, not to skip the serialization between MMIO and locks. > >>> > >>> But that was never a requirement of writel(), > >>> Documentation/memory-barriers.txt gives an explicit example demanding > >>> the wmb() before writel() for ordering system memory against writel. > > > > This is a bug in the documentation. > > > >> Indeed, but it's in an example for when to use dma_wmb(), not wmb(). > >> Adding Alexander Duyck to Cc, he added that section as part of > >> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and > >> dma_wmb()"). Also adding the other people that were involved with that. > > > > Linus himself made it very clear years ago. readl and writel have to > > order vs memory accesses. > > > >>> I actually have no idea why ARM had that barrier, I always assumed it > >>> was to give program ordering to the accesses and that _relaxed allowed > >>> re-ordering (the usual meaning of relaxed).. > >>> > >>> But the barrier document makes it pretty clear that the only > >>> difference between the two is spinlock containment, and WillD wrote > >>> this text, so I belive it is accurate for ARM. > >>> > >>> Very confusing. > >> > >> It does mention serialization with both DMA and locks in the > >> section about readX_relaxed()/writeX_relaxed(). The part > >> about DMA is very clear here, and I must have just forgotten > >> the exact semantics with regards to spinlocks. I'm still not > >> sure what prevents a writel() from leaking out the end of a > >> spinlock section that doesn't happen with writel_relaxed(), since > >> the barrier in writel() comes before the access, and the > >> spin_unlock() shouldn't affect the external buses. > > > > So... > > > > Historically, what happened is that we (we means whoever participated > > in the discussion on the list with Linus calling the shots really) > > decided that there was no sane way for drivers to understand a world > > where readl/writel didn't fully order things vs. memory accesses (ie, > > DMA). > > > > So it should always be correct to do: > > > > - Write to some in-memory buffer > > - writel() to kick the DMA read of that buffer > > > > without any extra barrier. > > > > The spinlock situation however got murky. Mostly that came up because > > on architecture (I forgot who, might have been ia64) has a hard time > > providing that consistency without making writel insanely expensive. > > > > Thus they created mmiowb whose main purpose was precisely to order > > writel with a following spin_unlock. > > > > I decided not to go down that path on power because getting all drivers > > "fixed" to do the right thing was going to be a losing battle, and > > instead added per-cpu tracking of writel in order to "escalate" to a > > heavier barrier in spin_unlock itself when necessary. > > > > Now, all this happened more than a decade ago and it's possible that > > the understanding or expectations "shifted" over time... > > Alex is raising concerns on the netdev list. > > Sinan > "We are being told that if you use writel(), then you don't need a wmb() on > all architectures." > > Alex: > "I'm not sure who told you that but that is incorrect, at least for > x86. If you attempt to use writel() without the wmb() we will have to > NAK the patches. We will accept the wmb() with writel_releaxed() since > that solves things for ARM." > > > Jason is seeking behavior clarification for write combined buffers. > > Alex: > "Don't bother. I can tell you right now that for x86 you have to have a > wmb() before the writel(). To clarify: are you saying that on x86 you need a wmb() prior to a writel if you want that writel to be ordered after prior writes to memory? Is this specific to WC memory or some other non-standard attribute? The only reason we have wmb() inside writel() on arm, arm64 and power is for parity with x86 because Linus (CC'd) wanted architectures to order I/O vs memory by default so that it was easier to write portable drivers. The performance impact of that implicit barrier is non-trivial, but we want the driver portability and I went as far as adding generic _relaxed versions for the cases where ordering isn't required. You seem to be suggesting that none of this is necessary and drivers would already run into problems on x86 if they didn't use wmb() explicitly in conjunction with writel, which I find hard to believe and is in direct contradiction with the current Linux I/O memory model (modulo the broken example in the dma_*mb section of memory-barriers.txt). Has something changed? Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 15:10 ` Will Deacon @ 2018-03-27 18:54 ` Alexander Duyck 2018-03-27 19:54 ` Arnd Bergmann 2018-03-27 21:33 ` Benjamin Herrenschmidt 2018-03-28 1:21 ` Benjamin Herrenschmidt 1 sibling, 2 replies; 216+ messages in thread From: Alexander Duyck @ 2018-03-27 18:54 UTC (permalink / raw) To: Will Deacon Cc: Sinan Kaya, Benjamin Herrenschmidt, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev, Linus Torvalds On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon <will.deacon@arm.com> wrote: > Hi Alex, > > On Tue, Mar 27, 2018 at 10:46:58AM -0400, Sinan Kaya wrote: >> +netdev, +Alex >> >> On 3/26/2018 6:00 PM, Benjamin Herrenschmidt wrote: >> > On Mon, 2018-03-26 at 23:30 +0200, Arnd Bergmann wrote: >> >> Most of the drivers have a unwound loop with writeq() or something to >> >>> do it. >> >> >> >> But isn't the writeq() barrier much more expensive than anything you'd >> >> do in function calls? >> > >> > It is for us, and will break any write combining. >> > >> >>>>> The same document says that _relaxed() does not give that guarentee. >> >>>>> >> >>>>> The lwn articule on this went into some depth on the interaction with >> >>>>> spinlocks. >> >>>>> >> >>>>> As far as I can see, containment in a spinlock seems to be the only >> >>>>> different between writel and writel_relaxed.. >> >>>> >> >>>> I was always puzzled by this: The intention of _relaxed() on ARM >> >>>> (where it originates) was to skip the barrier that serializes DMA >> >>>> with MMIO, not to skip the serialization between MMIO and locks. >> >>> >> >>> But that was never a requirement of writel(), >> >>> Documentation/memory-barriers.txt gives an explicit example demanding >> >>> the wmb() before writel() for ordering system memory against writel. >> > >> > This is a bug in the documentation. >> > >> >> Indeed, but it's in an example for when to use dma_wmb(), not wmb(). >> >> Adding Alexander Duyck to Cc, he added that section as part of >> >> 1077fa36f23e ("arch: Add lightweight memory barriers dma_rmb() and >> >> dma_wmb()"). Also adding the other people that were involved with that. >> > >> > Linus himself made it very clear years ago. readl and writel have to >> > order vs memory accesses. >> > >> >>> I actually have no idea why ARM had that barrier, I always assumed it >> >>> was to give program ordering to the accesses and that _relaxed allowed >> >>> re-ordering (the usual meaning of relaxed).. >> >>> >> >>> But the barrier document makes it pretty clear that the only >> >>> difference between the two is spinlock containment, and WillD wrote >> >>> this text, so I belive it is accurate for ARM. >> >>> >> >>> Very confusing. >> >> >> >> It does mention serialization with both DMA and locks in the >> >> section about readX_relaxed()/writeX_relaxed(). The part >> >> about DMA is very clear here, and I must have just forgotten >> >> the exact semantics with regards to spinlocks. I'm still not >> >> sure what prevents a writel() from leaking out the end of a >> >> spinlock section that doesn't happen with writel_relaxed(), since >> >> the barrier in writel() comes before the access, and the >> >> spin_unlock() shouldn't affect the external buses. >> > >> > So... >> > >> > Historically, what happened is that we (we means whoever participated >> > in the discussion on the list with Linus calling the shots really) >> > decided that there was no sane way for drivers to understand a world >> > where readl/writel didn't fully order things vs. memory accesses (ie, >> > DMA). >> > >> > So it should always be correct to do: >> > >> > - Write to some in-memory buffer >> > - writel() to kick the DMA read of that buffer >> > >> > without any extra barrier. >> > >> > The spinlock situation however got murky. Mostly that came up because >> > on architecture (I forgot who, might have been ia64) has a hard time >> > providing that consistency without making writel insanely expensive. >> > >> > Thus they created mmiowb whose main purpose was precisely to order >> > writel with a following spin_unlock. >> > >> > I decided not to go down that path on power because getting all drivers >> > "fixed" to do the right thing was going to be a losing battle, and >> > instead added per-cpu tracking of writel in order to "escalate" to a >> > heavier barrier in spin_unlock itself when necessary. >> > >> > Now, all this happened more than a decade ago and it's possible that >> > the understanding or expectations "shifted" over time... >> >> Alex is raising concerns on the netdev list. >> >> Sinan >> "We are being told that if you use writel(), then you don't need a wmb() on >> all architectures." >> >> Alex: >> "I'm not sure who told you that but that is incorrect, at least for >> x86. If you attempt to use writel() without the wmb() we will have to >> NAK the patches. We will accept the wmb() with writel_releaxed() since >> that solves things for ARM." >> >> > Jason is seeking behavior clarification for write combined buffers. >> >> Alex: >> "Don't bother. I can tell you right now that for x86 you have to have a >> wmb() before the writel(). > > To clarify: are you saying that on x86 you need a wmb() prior to a writel > if you want that writel to be ordered after prior writes to memory? Is this > specific to WC memory or some other non-standard attribute? Note, I am not a CPU guy so this is just my interpretation. It is my understanding that the wmb(), aka sfence, is needed on x86 to sort out writes between Write-back(WB) system memory and Strong Uncacheable (UC) MMIO accesses. I was hoping to be able to cite something in the software developers manual (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf), but that tends to be pretty vague. I have re-read section 22.34 (volume 3B) several times and I am still not clear on if it says we need the sfence or not. It is a matter of figuring out what the impact of store buffers and caching are for WB versus UC memory. > The only reason we have wmb() inside writel() on arm, arm64 and power is for > parity with x86 because Linus (CC'd) wanted architectures to order I/O vs > memory by default so that it was easier to write portable drivers. The > performance impact of that implicit barrier is non-trivial, but we want the > driver portability and I went as far as adding generic _relaxed versions for > the cases where ordering isn't required. You seem to be suggesting that none > of this is necessary and drivers would already run into problems on x86 if > they didn't use wmb() explicitly in conjunction with writel, which I find > hard to believe and is in direct contradiction with the current Linux I/O > memory model (modulo the broken example in the dma_*mb section of > memory-barriers.txt). Is the issue specifically related to memory versus I/O or are there potential ordering issues for MMIO versus MMIO? I recall when working on the dma_*mb section that the ARM barriers were much more complex versus some of the other architectures. One big difference that I can see for the x86 versus what you define for the "_relaxed" version of things is the ordering of MMIO operations with respect to locked transactions. I know x86 forces all MMIO operations to be completed before you can process any locked operation. > Has something changed? > > Will As far as I know the code has been this way for a while, something like 2002, when the barrier was already present in e1000. However there it was calling out weakly ordered models "such as IA-64". Since then pretty much all the hardware based network drivers at this point have similar code floating around with wmb() in place to prevent issues on weak ordered memory systems. So in any case we still need to be careful as there are architectures that are depending on this even if they might not be x86. :-/ - Alex ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 18:54 ` Alexander Duyck @ 2018-03-27 19:54 ` Arnd Bergmann 2018-03-27 21:33 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 19:54 UTC (permalink / raw) To: Alexander Duyck Cc: Will Deacon, Sinan Kaya, Benjamin Herrenschmidt, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev, Linus Torvalds On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon <will.deacon@arm.com> wrote: >>> >>> Sinan >>> "We are being told that if you use writel(), then you don't need a wmb() on >>> all architectures." >>> >>> Alex: >>> "I'm not sure who told you that but that is incorrect, at least for >>> x86. If you attempt to use writel() without the wmb() we will have to >>> NAK the patches. We will accept the wmb() with writel_releaxed() since >>> that solves things for ARM." >>> >>> > Jason is seeking behavior clarification for write combined buffers. >>> >>> Alex: >>> "Don't bother. I can tell you right now that for x86 you have to have a >>> wmb() before the writel(). >> >> To clarify: are you saying that on x86 you need a wmb() prior to a writel >> if you want that writel to be ordered after prior writes to memory? Is this >> specific to WC memory or some other non-standard attribute? > > Note, I am not a CPU guy so this is just my interpretation. It is my > understanding that the wmb(), aka sfence, is needed on x86 to sort out > writes between Write-back(WB) system memory and Strong Uncacheable > (UC) MMIO accesses. > > I was hoping to be able to cite something in the software developers > manual (https://software.intel.com/sites/default/files/managed/39/c5/325462-sdm-vol-1-2abcd-3abcd.pdf), > but that tends to be pretty vague. I have re-read section 22.34 > (volume 3B) several times and I am still not clear on if it says we > need the sfence or not. It is a matter of figuring out what the impact > of store buffers and caching are for WB versus UC memory. Here is what I found regarding the store buffer in that document: 11.10 STORE BUFFER Intel 64 and IA-32 processors temporarily store each write (store) to memory in a store buffer. The store buffer improves processor performance by allowing the processor to continue executing instructions without having to wait until a write to memory and/or to a cache is complete. It also allows writes to be delayed for more efficient use of memory-access bus cycles. In general, the existence of the store buffer is transparent to software, even in systems that use multiple processors. The processor ensures that write operations are always carried out in program order. It also insures that the contents of the store buffer are always drained to memory in the following situations: • When an exception or interrupt is generated. • (P6 and more recent processor families only) When a serializing instruction is executed. • When an I/O instruction is executed. • When a LOCK operation is performed. • (P6 and more recent processor families only) When a BINIT operation is performed. • (Pentium III, and more recent processor families only) When using an SFENCE instruction to order stores. • (Pentium 4 and more recent processor families only) When using an MFENCE instruction to order stores. The discussion of write ordering in Section 8.2, “Memory Ordering,” gives a detailed description of the operation of the store buffer. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 19:54 ` Arnd Bergmann 0 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 19:54 UTC (permalink / raw) To: Alexander Duyck Cc: Will Deacon, Sinan Kaya, Benjamin Herrenschmidt, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev, Linus Torvalds On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck <alexander.duyck@gmail.com> wrote: > On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon <will.deacon@arm.com> wrote: >>> >>> Sinan >>> "We are being told that if you use writel(), then you don't need a wmb(= ) on >>> all architectures." >>> >>> Alex: >>> "I'm not sure who told you that but that is incorrect, at least for >>> x86. If you attempt to use writel() without the wmb() we will have to >>> NAK the patches. We will accept the wmb() with writel_releaxed() since >>> that solves things for ARM." >>> >>> > Jason is seeking behavior clarification for write combined buffers. >>> >>> Alex: >>> "Don't bother. I can tell you right now that for x86 you have to have a >>> wmb() before the writel(). >> >> To clarify: are you saying that on x86 you need a wmb() prior to a write= l >> if you want that writel to be ordered after prior writes to memory? Is t= his >> specific to WC memory or some other non-standard attribute? > > Note, I am not a CPU guy so this is just my interpretation. It is my > understanding that the wmb(), aka sfence, is needed on x86 to sort out > writes between Write-back(WB) system memory and Strong Uncacheable > (UC) MMIO accesses. > > I was hoping to be able to cite something in the software developers > manual (https://software.intel.com/sites/default/files/managed/39/c5/3254= 62-sdm-vol-1-2abcd-3abcd.pdf), > but that tends to be pretty vague. I have re-read section 22.34 > (volume 3B) several times and I am still not clear on if it says we > need the sfence or not. It is a matter of figuring out what the impact > of store buffers and caching are for WB versus UC memory. Here is what I found regarding the store buffer in that document: 11.10 STORE BUFFER Intel 64 and IA-32 processors temporarily store each write (store) to memory in a store buffer. The store buffer improves processor performance by allowing the processor to continue executing instructions without having to wait until a write to memory and/or to a cache is complete. It also allows writes to be delayed for more efficient use of memory-access bus cycles. In general, the existence of the store buffer is transparent to software, even in systems that use multiple processors. The processor ensures that write operations are always carried out in program order. It also insures that the contents of the store buffer are always drained to memory in the following situations: =E2=80=A2 When an exception or interrupt is generated. =E2=80=A2 (P6 and more recent processor families only) When a serializing instruction is executed. =E2=80=A2 When an I/O instruction is executed. =E2=80=A2 When a LOCK operation is performed. =E2=80=A2 (P6 and more recent processor families only) When a BINIT operati= on is performed. =E2=80=A2 (Pentium III, and more recent processor families only) When using= an SFENCE instruction to order stores. =E2=80=A2 (Pentium 4 and more recent processor families only) When using an MFENCE instruction to order stores. The discussion of write ordering in Section 8.2, =E2=80=9CMemory Ordering,= =E2=80=9D gives a detailed description of the operation of the store buffer. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 19:54 ` Arnd Bergmann @ 2018-03-27 20:46 ` Arnd Bergmann -1 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 20:46 UTC (permalink / raw) To: Alexander Duyck Cc: Will Deacon, Sinan Kaya, Benjamin Herrenschmidt, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev, Linus Torvalds On Tue, Mar 27, 2018 at 9:54 PM, Arnd Bergmann <arnd@arndb.de> wrote: > On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck > <alexander.duyck@gmail.com> wrote: >> On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon <will.deacon@arm.com> wrote: > > 11.10 STORE BUFFER > Intel 64 and IA-32 processors temporarily store each write (store) to > memory in a store buffer. The store buffer > improves processor performance by allowing the processor to continue > executing instructions without having to > wait until a write to memory and/or to a cache is complete. It also > allows writes to be delayed for more efficient use > of memory-access bus cycles. > In general, the existence of the store buffer is transparent to > software, even in systems that use multiple processors. > The processor ensures that write operations are always carried out in > program order. It also insures that the > contents of the store buffer are always drained to memory in the > following situations: > • When an exception or interrupt is generated. > • (P6 and more recent processor families only) When a serializing > instruction is executed. > • When an I/O instruction is executed. I guess I/O instruction is still ambiguous on x86, it may just refer to 'inb'/'outb' style instructions rather than 'mov' on a device MMIO area. Here's a link to a reply from Linus that I found on this topic: http://yarchive.net/comp/linux/write_combining.html Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 20:46 ` Arnd Bergmann 0 siblings, 0 replies; 216+ messages in thread From: Arnd Bergmann @ 2018-03-27 20:46 UTC (permalink / raw) To: Alexander Duyck Cc: Will Deacon, Sinan Kaya, Benjamin Herrenschmidt, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev, Linus Torvalds On Tue, Mar 27, 2018 at 9:54 PM, Arnd Bergmann <arnd@arndb.de> wrote: > On Tue, Mar 27, 2018 at 8:54 PM, Alexander Duyck > <alexander.duyck@gmail.com> wrote: >> On Tue, Mar 27, 2018 at 8:10 AM, Will Deacon <will.deacon@arm.com> wrote= : > > 11.10 STORE BUFFER > Intel 64 and IA-32 processors temporarily store each write (store) to > memory in a store buffer. The store buffer > improves processor performance by allowing the processor to continue > executing instructions without having to > wait until a write to memory and/or to a cache is complete. It also > allows writes to be delayed for more efficient use > of memory-access bus cycles. > In general, the existence of the store buffer is transparent to > software, even in systems that use multiple processors. > The processor ensures that write operations are always carried out in > program order. It also insures that the > contents of the store buffer are always drained to memory in the > following situations: > =E2=80=A2 When an exception or interrupt is generated. > =E2=80=A2 (P6 and more recent processor families only) When a serializing > instruction is executed. > =E2=80=A2 When an I/O instruction is executed. I guess I/O instruction is still ambiguous on x86, it may just refer to 'inb'/'outb' style instructions rather than 'mov' on a device MMIO area. Here's a link to a reply from Linus that I found on this topic: http://yarchive.net/comp/linux/write_combining.html Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 18:54 ` Alexander Duyck 2018-03-27 19:54 ` Arnd Bergmann @ 2018-03-27 21:33 ` Benjamin Herrenschmidt 2018-03-28 0:39 ` Linus Torvalds 1 sibling, 1 reply; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 21:33 UTC (permalink / raw) To: Alexander Duyck, Will Deacon Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev, Linus Torvalds On Tue, 2018-03-27 at 11:54 -0700, Alexander Duyck wrote: > As far as I know the code has been this way for a while, something > like 2002, when the barrier was already present in e1000. However > there it was calling out weakly ordered models "such as IA-64". Since > then pretty much all the hardware based network drivers at this point > have similar code floating around with wmb() in place to prevent > issues on weak ordered memory systems. > > So in any case we still need to be careful as there are architectures > that are depending on this even if they might not be x86. :-/ Well, we need to clarify that once and for all, because as I wrote earlier, it was decreed by Linus more than a decade ago that writel would be fully ordered by itself vs. previous memory stores (at least on UC memory). This is why we added sync's to writel on powerpc and later ARM added similar barriers to theirs. This is also why writel_relaxed was added (though much later), since what writel_relaxed does is to life that specific requirement. IE. If what you say is true and wmb() is needed on x86, then writel_relaxed is now completely useless... Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 21:33 ` Benjamin Herrenschmidt @ 2018-03-28 0:39 ` Linus Torvalds 2018-03-28 1:03 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 216+ messages in thread From: Linus Torvalds @ 2018-03-28 0:39 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > Well, we need to clarify that once and for all, because as I wrote > earlier, it was decreed by Linus more than a decade ago that writel > would be fully ordered by itself vs. previous memory stores (at least > on UC memory). Yes. So "writel()" needs to be ordered with respect to other writel() uses on the same thread. Anything else *will* break drivers. Obviously, the drivers may then do magic to say "do write combining etc", but that magic will be architecture-specific. The other issue is that "writel()" needs to be ordered wrt other CPU's doing "writel()" if those writel's are in a spinlocked region. So it's not that "writel()" needs to be ordered wrt the spinlock itself, but you *do* need to honor ordering if you have something like this: spin_lock(&somelock); writel(a); writel(b); spin_unlock(&somelock); and if two CPU's run the above code "at the same time", then the *ONLY* acceptable sequence is abab. You cannot, and must not, ever see "aabb" at the device, for example, because of how the writel would basically leak out of the spinlock. That sounds "obvious", but dammit, a lot of architectures got that wrong, afaik. Linus ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 0:39 ` Linus Torvalds @ 2018-03-28 1:03 ` Benjamin Herrenschmidt 2018-03-28 2:51 ` Linus Torvalds 0 siblings, 1 reply; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 1:03 UTC (permalink / raw) To: Linus Torvalds Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev On Tue, 2018-03-27 at 14:39 -1000, Linus Torvalds wrote: > On Tue, Mar 27, 2018 at 11:33 AM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > > > Well, we need to clarify that once and for all, because as I wrote > > earlier, it was decreed by Linus more than a decade ago that writel > > would be fully ordered by itself vs. previous memory stores (at least > > on UC memory). > > Yes. > > So "writel()" needs to be ordered with respect to other writel() uses > on the same thread. Anything else *will* break drivers. Obviously, the > drivers may then do magic to say "do write combining etc", but that > magic will be architecture-specific. > > The other issue is that "writel()" needs to be ordered wrt other CPU's > doing "writel()" if those writel's are in a spinlocked region. .../... The discussion at hand is about dma_buffer->foo = 1; /* WB */ writel(KICK, DMA_KICK_REGISTER); /* UC */ (The WC case is something else, let's not mix things up just yet) IE, a store to normal WB cache memory followed by a writel to a device which will then DMA from that buffer. Back in the days, we did require on powerpc a wmb() between these, but you made the point that x86 didn't and driver writers would never get that right. We decided to go conservative, added the necessary barrier inside writel, so did ARM and it became the norm that writel is also fully ordered vs. previous stores to memory *by the same CPU* of course (or protected by the same spinlock). Now it appears that this wasn't fully understood back then, and some people are now saying that x86 might not even provide that semantic always. So a number (fairly large) of drivers have been adding wmb() in those case, while others haven't, and it's a mess. The mess is compounded by the fact that if writel is now defined to *not* provide that ordering guarantee, then writel_relaxed() is pointless since all it is defined to relax is precisely the above ordering guarantee. So I want to get to the bottom of this once and for all so we can have well defined and documented semantics and stop having drivers do random things that may or may not work on some or all architectures (including x86 !). Quite note about the spinlock case... In fact this is the only case you did allow back then to be relaxed. In theory a writel followed by a spin_unlock requires an mmiowb (which is the only point of that barrier in fact). This was done because an arch (I think ia64) had a hard time getting MMIOs from multiple CPUs get in order vs. a lock and required an expensive access to the PCI host bridge to do so. Back then, on powerpc, we chose not to allow that relaxing and instead added code to our writel to set a per-cpu flag which would cause the next spin_unlock to use a stronger barrier than usual. We do need to clarify this as well, but let's start with the most basic one first, there is enough confusion already. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 1:03 ` Benjamin Herrenschmidt @ 2018-03-28 2:51 ` Linus Torvalds 2018-03-28 3:24 ` Sinan Kaya 2018-03-28 4:33 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 216+ messages in thread From: Linus Torvalds @ 2018-03-28 2:51 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > The discussion at hand is about > > dma_buffer->foo = 1; /* WB */ > writel(KICK, DMA_KICK_REGISTER); /* UC */ Yes. That certainly is ordered on x86. In fact, afaik it's ordered even if that writel() might be of type WC, because that only delays writes, it doesn't move them earlier. Whether people *do* that or not, I don't know. But I wouldn't be surprised if they do. So if it's a DMA buffer, it's "cached". And even cached accesses are ordered wrt MMIO. Basically, to get unordered writes on x86, you need to either use explicitly nontemporal stores, or have a writecombining region with back-to-back writes that actually combine. And nobody really does that nontemporal store thing any more because the hardware that cared pretty much doesn't exist any more. It was too much pain. People use DMA and maybe an UC store for starting the DMA (or possibly a WC buffer that gets multiple stores in ascending order as a stream of commands). Things like UC will force everything to be entirely ordered, but even without UC, loads won't pass loads, and stores won't pass stores. > Now it appears that this wasn't fully understood back then, and some > people are now saying that x86 might not even provide that semantic > always. Oh, the above UC case is absoutely guaranteed. And I think even if it's WC, the write to kick off the DMA is ordered wrt the cached write. On x86, I think you need barriers only if you do things like - do two non-temporal stores and require them to be ordered: put a sfence or mfence in between them. - do two WC stores, and make sure they do not combine: put a sfence or mfence between them. - do a store, and a subsequent from a different address, and neither of them is UC: put a mfence between them. But note that this is literally just "load after store". A "store after load" doesn't need one. I think that's pretty much it. For example, the "lfence" instruction is almost entirely pointless on x86 - it was designed back in the time when people *thought* they might re-order loads. But loads don't get re-ordered. At least Intel seems to document that only non-temporal *stores* can get re-ordered wrt each other. End result: lfence is a historical oddity that can now be used to guarantee that a previous load has finished, and that in turn meant that it is now used in the Spectre mitigations. But it basically has no real memory ordering meaning since nothing passes an earlier load anyway, it's more of a pipeline thing. But in the end, one question is just "how much do drivers actually _rely_ on the x86 strong ordering?" We so support "smp_wmb()" even though x86 has strong enough ordering that just a barrier suffices. Somebody might just say "screw the x86 memory ordering, we're relaxed, and we'll fix up the drivers we care about". The only issue really is that 99.9% of all testing gets done on x86 unless you look at specific SoC drivers. On ARM, for example, there is likely little reason to care about x86 memory ordering, because there is almost zero driver overlap between x86 and ARM. *Historically*, the reason for following the x86 IO ordering was simply that a lot of architectures used the drivers that were developed on x86. The alpha and powerpc workstations were *designed* with the x86 IO bus (PCI, then PCIe) and to work with the devices that came with it. ARM? PCIe is almost irrelevant. For ARM servers, if they ever take off, sure. But 99.99% of ARM is about their own SoC's, and so "x86 test coverage" is simply not an issue. How much of an issue is it for Power? Maybe you decide it's not a big deal. Then all the above is almost irrelevant. Linus ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 2:51 ` Linus Torvalds @ 2018-03-28 3:24 ` Sinan Kaya 2018-03-28 4:41 ` Benjamin Herrenschmidt 2018-03-28 6:14 ` Linus Torvalds 2018-03-28 4:33 ` Benjamin Herrenschmidt 1 sibling, 2 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-28 3:24 UTC (permalink / raw) To: Linus Torvalds, Benjamin Herrenschmidt Cc: Alexander Duyck, Will Deacon, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev On 3/27/2018 10:51 PM, Linus Torvalds wrote: >> The discussion at hand is about >> >> dma_buffer->foo = 1; /* WB */ >> writel(KICK, DMA_KICK_REGISTER); /* UC */ > Yes. That certainly is ordered on x86. In fact, afaik it's ordered > even if that writel() might be of type WC, because that only delays > writes, it doesn't move them earlier. Now that we clarified x86 myth, Is this guaranteed on all architectures? We keep getting IA64 exception example. Maybe, this is also corrected since then. Jose Abreu says "I don't know about x86 but arc architecture doesn't have a wmb() in the writel() function (in some configs)". As long as we have these exceptions, these wmb() in common drivers is not going anywhere and relaxed-arches will continue paying performance penalty. I see 15% performance loss on ARM64 servers using Intel i40e network drivers and an XL710 adapter due to CPU keeping itself busy doing barriers most of the time rather than real work because of sequences like this all over the place. dma_buffer->foo = 1; /* WB */ wmb() writel(KICK, DMA_KICK_REGISTER); /* UC */ I posted several patches last week to remove duplicate barriers on ARM while trying to make the code friendly with other architectures. Basically changing it to dma_buffer->foo = 1; /* WB */ wmb() writel_relaxed(KICK, DMA_KICK_REGISTER); /* UC */ mmiowb() This is a small step in the performance direction until we remove all exceptions. https://www.spinics.net/lists/netdev/msg491842.html https://www.spinics.net/lists/linux-rdma/msg62434.html https://www.spinics.net/lists/arm-kernel/msg642336.html Discussion started to move around the need for relaxed API on PPC and then why wmb() question came up. Sinan -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 3:24 ` Sinan Kaya @ 2018-03-28 4:41 ` Benjamin Herrenschmidt 2018-03-28 6:14 ` Linus Torvalds 1 sibling, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 4:41 UTC (permalink / raw) To: Sinan Kaya, Linus Torvalds Cc: Alexander Duyck, Will Deacon, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev On Tue, 2018-03-27 at 23:24 -0400, Sinan Kaya wrote: > On 3/27/2018 10:51 PM, Linus Torvalds wrote: > > > The discussion at hand is about > > > > > > dma_buffer->foo = 1; /* WB */ > > > writel(KICK, DMA_KICK_REGISTER); /* UC */ > > > > Yes. That certainly is ordered on x86. In fact, afaik it's ordered > > even if that writel() might be of type WC, because that only delays > > writes, it doesn't move them earlier. > > Now that we clarified x86 myth, Is this guaranteed on all architectures? If not we need to fix it. It's guaranteed on the "main" ones (arm, arm64, powerpc, i386, x86_64). We might need to check with other arch maintainers for the rest. We really want Linux to provide well defined "sane" semantics for the basic writel accessors. Note: We still have open questions about how readl() relates to surrounding memory accesses. It looks like ARM and powerpc do different things here. > We keep getting IA64 exception example. Maybe, this is also corrected since > then. I would think ia64 fixed it back when it was all discussed. I was under the impression all ia64 had "special" was the writel vs. spin_unlock which requires mmiowb, but maybe that was never completely fixed ? > Jose Abreu says "I don't know about x86 but arc architecture doesn't > have a wmb() in the writel() function (in some configs)". Well, it probably should then. > As long as we have these exceptions, these wmb() in common drivers is not > going anywhere and relaxed-arches will continue paying performance penalty. Well, let's fix them or leave them broken, at this point, it doesn't matter. We can give all arch maintainers a wakeup call and start making drivers work based on the documented assumptions. > I see 15% performance loss on ARM64 servers using Intel i40e network > drivers and an XL710 adapter due to CPU keeping itself busy doing barriers > most of the time rather than real work because of sequences like this all over > the place. > > dma_buffer->foo = 1; /* WB */ > wmb() > writel(KICK, DMA_KICK_REGISTER); /* UC */ > > I posted several patches last week to remove duplicate barriers on ARM while > trying to make the code friendly with other architectures. > > Basically changing it to > > dma_buffer->foo = 1; /* WB */ > wmb() > writel_relaxed(KICK, DMA_KICK_REGISTER); /* UC */ > mmiowb() > > This is a small step in the performance direction until we remove all exceptions. > > https://www.spinics.net/lists/netdev/msg491842.html > https://www.spinics.net/lists/linux-rdma/msg62434.html > https://www.spinics.net/lists/arm-kernel/msg642336.html > > Discussion started to move around the need for relaxed API on PPC and then > why wmb() question came up. I'm working on the problem of relaxed APIs for powerpc, but we can keep that orthogonal. As is, today, a wmb() + writel() and a wmb() + writel_relaxed() on powerpc are identical. So changing them will not break us. But I don't see the point of doing that transformation if we can just get the straying archs fixed. It's not like any of them has a significant market presence these days anyway. Cheers, Ben. > Sinan > ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 3:24 ` Sinan Kaya 2018-03-28 4:41 ` Benjamin Herrenschmidt @ 2018-03-28 6:14 ` Linus Torvalds 2018-03-28 11:41 ` okaya 1 sibling, 1 reply; 216+ messages in thread From: Linus Torvalds @ 2018-03-28 6:14 UTC (permalink / raw) To: Sinan Kaya Cc: Benjamin Herrenschmidt, Alexander Duyck, Will Deacon, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kaya <okaya@codeaurora.org> wrote: > > Basically changing it to > > dma_buffer->foo = 1; /* WB */ > wmb() > writel_relaxed(KICK, DMA_KICK_REGISTER); /* UC */ > mmiowb() Why? Why not just remove the wmb(), and keep the barrier in the writel()? The above code makes no sense, and just looks stupid to me. It also generates pointlessly bad code on x86, so it's bad there too. Linus ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 6:14 ` Linus Torvalds @ 2018-03-28 11:41 ` okaya 2018-03-28 15:13 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 216+ messages in thread From: okaya @ 2018-03-28 11:41 UTC (permalink / raw) To: Linus Torvalds Cc: Benjamin Herrenschmidt, Alexander Duyck, Will Deacon, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev, linus971 On 2018-03-28 02:14, Linus Torvalds wrote: > On Tue, Mar 27, 2018 at 5:24 PM, Sinan Kaya <okaya@codeaurora.org> > wrote: >> >> Basically changing it to >> >> dma_buffer->foo = 1; /* WB */ >> wmb() >> writel_relaxed(KICK, DMA_KICK_REGISTER); /* UC */ >> mmiowb() > > Why? > > Why not just remove the wmb(), and keep the barrier in the writel()? Yes, we want to get there indeed. It is because of some arch not implementing writel properly. Maintainers want to play safe. That is why I asked if IA64 and other well known archs follow the strongly ordered rule at this moment like PPC and ARM. Or should we go and inform every arch about this before yanking wmb()? Maintainers are afraid of introducing a regression. > > The above code makes no sense, and just looks stupid to me. It also > generates pointlessly bad code on x86, so it's bad there too. > > Linus ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 11:41 ` okaya @ 2018-03-28 15:13 ` Benjamin Herrenschmidt 2018-03-28 15:55 ` David Miller 0 siblings, 1 reply; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 15:13 UTC (permalink / raw) To: okaya, Linus Torvalds Cc: Alexander Duyck, Will Deacon, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev, linus971 On Wed, 2018-03-28 at 07:41 -0400, okaya@codeaurora.org wrote: > Yes, we want to get there indeed. It is because of some arch not > implementing writel properly. Maintainers want to play safe. > > That is why I asked if IA64 and other well known archs follow the > strongly ordered rule at this moment like PPC and ARM. > > Or should we go and inform every arch about this before yanking wmb()? > > Maintainers are afraid of introducing a regression. Let's fix all archs, it's way easier than fixing all drivers. Half of the archs are unused or dead anyway. > > > > The above code makes no sense, and just looks stupid to me. It also > > generates pointlessly bad code on x86, so it's bad there too. > > > > Linus ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 15:13 ` Benjamin Herrenschmidt @ 2018-03-28 15:55 ` David Miller 2018-03-28 16:23 ` Nicholas Piggin 2018-03-29 13:56 ` Sinan Kaya 0 siblings, 2 replies; 216+ messages in thread From: David Miller @ 2018-03-28 15:55 UTC (permalink / raw) To: benh Cc: okaya, torvalds, alexander.duyck, will.deacon, arnd, jgg, David.Laight, oohall, linuxppc-dev, linux-rdma, alexander.h.duyck, paulmck, netdev, linus971 From: Benjamin Herrenschmidt <benh@kernel.crashing.org> Date: Thu, 29 Mar 2018 02:13:16 +1100 > Let's fix all archs, it's way easier than fixing all drivers. Half of > the archs are unused or dead anyway. Agreed. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 15:55 ` David Miller @ 2018-03-28 16:23 ` Nicholas Piggin 2018-03-28 21:31 ` Benjamin Herrenschmidt 2018-03-29 13:56 ` Sinan Kaya 1 sibling, 1 reply; 216+ messages in thread From: Nicholas Piggin @ 2018-03-28 16:23 UTC (permalink / raw) To: David Miller Cc: benh, paulmck, arnd, linux-rdma, linuxppc-dev, linus971, will.deacon, alexander.duyck, okaya, jgg, David.Laight, oohall, netdev, alexander.h.duyck, torvalds On Wed, 28 Mar 2018 11:55:09 -0400 (EDT) David Miller <davem@davemloft.net> wrote: > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Date: Thu, 29 Mar 2018 02:13:16 +1100 > > > Let's fix all archs, it's way easier than fixing all drivers. Half of > > the archs are unused or dead anyway. > > Agreed. While we're making decrees here, can we do something about mmiowb? The semantics are basically indecipherable. This is a variation on the mandatory write barrier that causes writes to weakly ordered I/O regions to be partially ordered. Its effects may go beyond the CPU->Hardware interface and actually affect the hardware at some level. How can a driver writer possibly get that right? IIRC it was added for some big ia64 system that was really expensive to implement the proper wmb() semantics on. So wmb() semantics were quietly downgraded, then the subsequently broken drivers they cared about were fixed by adding the stronger mmiowb(). What should have happened was wmb and writel remained correct, sane, and expensive, and they add an mmio_wmb() to order MMIO stores made by the writel_relaxed accessors, then use that to speed up the few drivers they care about. Now that ia64 doesn't matter too much, can we deprecate mmiowb and just make wmb ordering talk about stores to the device, not to some intermediate stage of the interconnect where it can be subsequently reordered wrt the device? Drivers can be converted back to using wmb or writel gradually. Thanks, Nick ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 16:23 ` Nicholas Piggin @ 2018-03-28 21:31 ` Benjamin Herrenschmidt 2018-03-28 22:09 ` Nicholas Piggin 2018-03-29 9:20 ` Will Deacon 0 siblings, 2 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 21:31 UTC (permalink / raw) To: Nicholas Piggin, David Miller Cc: paulmck, arnd, linux-rdma, linuxppc-dev, linus971, will.deacon, alexander.duyck, okaya, jgg, David.Laight, oohall, netdev, alexander.h.duyck, torvalds On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote: > On Wed, 28 Mar 2018 11:55:09 -0400 (EDT) > David Miller <davem@davemloft.net> wrote: > > > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > > Date: Thu, 29 Mar 2018 02:13:16 +1100 > > > > > Let's fix all archs, it's way easier than fixing all drivers. Half of > > > the archs are unused or dead anyway. > > > > Agreed. > > While we're making decrees here, can we do something about mmiowb? > The semantics are basically indecipherable. I was going to tackle that next :-) > This is a variation on the mandatory write barrier that causes writes to weakly > ordered I/O regions to be partially ordered. Its effects may go beyond the > CPU->Hardware interface and actually affect the hardware at some level. > > How can a driver writer possibly get that right? > > IIRC it was added for some big ia64 system that was really expensive > to implement the proper wmb() semantics on. So wmb() semantics were > quietly downgraded, then the subsequently broken drivers they cared > about were fixed by adding the stronger mmiowb(). > > What should have happened was wmb and writel remained correct, sane, and > expensive, and they add an mmio_wmb() to order MMIO stores made by the > writel_relaxed accessors, then use that to speed up the few drivers they > care about. > > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just > make wmb ordering talk about stores to the device, not to some > intermediate stage of the interconnect where it can be subsequently > reordered wrt the device? Drivers can be converted back to using wmb > or writel gradually. I was under the impression that mmiowb was specifically about ordering writel's with a subsequent spin_unlock, without it, MMIOs from different CPUs (within the same lock) would still arrive OO. If that's indeed the case, I would suggest ia64 switches to a similar per-cpu flag trick powerpc uses. Cheers, Ben. > Thanks, > Nick ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 21:31 ` Benjamin Herrenschmidt @ 2018-03-28 22:09 ` Nicholas Piggin 2018-03-29 9:20 ` Will Deacon 1 sibling, 0 replies; 216+ messages in thread From: Nicholas Piggin @ 2018-03-28 22:09 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: David Miller, paulmck, arnd, linux-rdma, linuxppc-dev, linus971, will.deacon, alexander.duyck, okaya, jgg, David.Laight, oohall, netdev, alexander.h.duyck, torvalds On Thu, 29 Mar 2018 08:31:32 +1100 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote: > > On Wed, 28 Mar 2018 11:55:09 -0400 (EDT) > > David Miller <davem@davemloft.net> wrote: > > > > > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > > > Date: Thu, 29 Mar 2018 02:13:16 +1100 > > > > > > > Let's fix all archs, it's way easier than fixing all drivers. Half of > > > > the archs are unused or dead anyway. > > > > > > Agreed. > > > > While we're making decrees here, can we do something about mmiowb? > > The semantics are basically indecipherable. > > I was going to tackle that next :-) > > > This is a variation on the mandatory write barrier that causes writes to weakly > > ordered I/O regions to be partially ordered. Its effects may go beyond the > > CPU->Hardware interface and actually affect the hardware at some level. > > > > How can a driver writer possibly get that right? > > > > IIRC it was added for some big ia64 system that was really expensive > > to implement the proper wmb() semantics on. So wmb() semantics were > > quietly downgraded, then the subsequently broken drivers they cared > > about were fixed by adding the stronger mmiowb(). > > > > What should have happened was wmb and writel remained correct, sane, and > > expensive, and they add an mmio_wmb() to order MMIO stores made by the > > writel_relaxed accessors, then use that to speed up the few drivers they > > care about. > > > > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just > > make wmb ordering talk about stores to the device, not to some > > intermediate stage of the interconnect where it can be subsequently > > reordered wrt the device? Drivers can be converted back to using wmb > > or writel gradually. > > I was under the impression that mmiowb was specifically about ordering > writel's with a subsequent spin_unlock, without it, MMIOs from > different CPUs (within the same lock) would still arrive OO. Yes more or less, and I think that until mmiowb was introduced, wmb or writel was sufficient for this. Thanks, Nick ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 21:31 ` Benjamin Herrenschmidt 2018-03-28 22:09 ` Nicholas Piggin @ 2018-03-29 9:20 ` Will Deacon 1 sibling, 0 replies; 216+ messages in thread From: Will Deacon @ 2018-03-29 9:20 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Nicholas Piggin, David Miller, paulmck, arnd, linux-rdma, linuxppc-dev, linus971, alexander.duyck, okaya, jgg, David.Laight, oohall, netdev, alexander.h.duyck, torvalds On Thu, Mar 29, 2018 at 08:31:32AM +1100, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-29 at 02:23 +1000, Nicholas Piggin wrote: > > This is a variation on the mandatory write barrier that causes writes to weakly > > ordered I/O regions to be partially ordered. Its effects may go beyond the > > CPU->Hardware interface and actually affect the hardware at some level. > > > > How can a driver writer possibly get that right? > > > > IIRC it was added for some big ia64 system that was really expensive > > to implement the proper wmb() semantics on. So wmb() semantics were > > quietly downgraded, then the subsequently broken drivers they cared > > about were fixed by adding the stronger mmiowb(). > > > > What should have happened was wmb and writel remained correct, sane, and > > expensive, and they add an mmio_wmb() to order MMIO stores made by the > > writel_relaxed accessors, then use that to speed up the few drivers they > > care about. > > > > Now that ia64 doesn't matter too much, can we deprecate mmiowb and just > > make wmb ordering talk about stores to the device, not to some > > intermediate stage of the interconnect where it can be subsequently > > reordered wrt the device? Drivers can be converted back to using wmb > > or writel gradually. > > I was under the impression that mmiowb was specifically about ordering > writel's with a subsequent spin_unlock, without it, MMIOs from > different CPUs (within the same lock) would still arrive OO. > > If that's indeed the case, I would suggest ia64 switches to a similar > per-cpu flag trick powerpc uses. ... or we could remove ia64. /me runs for cover Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 15:55 ` David Miller 2018-03-28 16:23 ` Nicholas Piggin @ 2018-03-29 13:56 ` Sinan Kaya 2018-03-29 14:04 ` David Miller ` (2 more replies) 1 sibling, 3 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-29 13:56 UTC (permalink / raw) To: David Miller, benh Cc: torvalds, alexander.duyck, will.deacon, arnd, jgg, David.Laight, oohall, linuxppc-dev, linux-rdma, alexander.h.duyck, paulmck, netdev, linus971 On 3/28/2018 11:55 AM, David Miller wrote: > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > Date: Thu, 29 Mar 2018 02:13:16 +1100 > >> Let's fix all archs, it's way easier than fixing all drivers. Half of >> the archs are unused or dead anyway. > > Agreed. > I pinged most of the maintainers yesterday. Which arches do we care about these days? I have not been paying attention any other architecture besides arm64. arch status detail ------ ------------- ------------------------------------ alpha question sent arc question sent ysato@users.sourceforge.jp will fix it. arm no issues arm64 no issues blackfin question sent about to be removed c6x question sent cris question sent frv h8300 question sent hexagon question sent ia64 no issues confirmed by Tony Luck m32r m68k question sent metag microblaze question sent mips question sent mn10300 question sent nios2 question sent openrisc no issues shorne@gmail.com says should no issues parisc no issues grantgrundler@gmail.com says most probably no problem but still looking powerpc no issues riscv question sent s390 question sent score question sent sh question sent sparc question sent tile question sent unicore32 question sent x86 no issues xtensa question sent -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-29 13:56 ` Sinan Kaya @ 2018-03-29 14:04 ` David Miller 2018-03-29 16:29 ` Arnd Bergmann 2018-03-30 1:40 ` Benjamin Herrenschmidt 2 siblings, 0 replies; 216+ messages in thread From: David Miller @ 2018-03-29 14:04 UTC (permalink / raw) To: okaya Cc: benh, torvalds, alexander.duyck, will.deacon, arnd, jgg, David.Laight, oohall, linuxppc-dev, linux-rdma, alexander.h.duyck, paulmck, netdev, linus971 From: Sinan Kaya <okaya@codeaurora.org> Date: Thu, 29 Mar 2018 09:56:01 -0400 > sparc question sent Sparc never lets physical memory accesses pass MMIO, and vice versa. They are always strongly ordered amongst eachother. Therefore no explicit barrier instructions are necessary. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-29 13:56 ` Sinan Kaya 2018-03-29 14:04 ` David Miller @ 2018-03-29 16:29 ` Arnd Bergmann 2018-03-29 16:59 ` Sinan Kaya 2018-03-30 1:40 ` Benjamin Herrenschmidt 2 siblings, 1 reply; 216+ messages in thread From: Arnd Bergmann @ 2018-03-29 16:29 UTC (permalink / raw) To: Sinan Kaya Cc: David Miller, Benjamin Herrenschmidt, Linus Torvalds, Alexander Duyck, Will Deacon, Jason Gunthorpe, David Laight, Oliver O'Halloran, linuxppc-dev, linux-rdma, Alexander Duyck, Paul E. McKenney, Networking, Linus Torvalds On Thu, Mar 29, 2018 at 3:56 PM, Sinan Kaya <okaya@codeaurora.org> wrote: > On 3/28/2018 11:55 AM, David Miller wrote: >> From: Benjamin Herrenschmidt <benh@kernel.crashing.org> >> Date: Thu, 29 Mar 2018 02:13:16 +1100 >> >>> Let's fix all archs, it's way easier than fixing all drivers. Half of >>> the archs are unused or dead anyway. >> >> Agreed. >> > > I pinged most of the maintainers yesterday. > Which arches do we care about these days? > I have not been paying attention any other architecture besides arm64. > > arch status detail > ------ ------------- ------------------------------------ > alpha question sent I'm guessing alpha has problems extern inline u32 readl(const volatile void __iomem *addr) { u32 ret = __raw_readl(addr); mb(); return ret; } extern inline void writel(u32 b, volatile void __iomem *addr) { __raw_writel(b, addr); mb(); } There is a barrier in writel /after/ the acess but not before. > arc question sent ysato@users.sourceforge.jp will fix it. > arm no issues > arm64 no issues > blackfin question sent about to be removed > c6x question sent no PCI, so it might not matter that much -- all drivers are platform specific in the end. > cris question sent > frv cris and frv are getting removed > h8300 question sent no PCI > hexagon question sent no PCI > ia64 no issues confirmed by Tony Luck > m32r removed > m68k question sent > metag removed > microblaze question sent > mips question sent I'm guessing that some mips platforms have problems, but others don't. > mn10300 question sent removed > nios2 question sent no PCI > openrisc no issues shorne@gmail.com says should no issues > parisc no issues grantgrundler@gmail.com says most probably no problem but still looking > powerpc no issues > riscv question sent riscv should be fine > s390 question sent Pretty sure this is also fine > score question sent removed > sh question sent > sparc question sent > tile question sent removed > unicore32 question sent Note the maintainer's new email address in linux-next. > x86 no issues > xtensa question sent removed. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-29 16:29 ` Arnd Bergmann @ 2018-03-29 16:59 ` Sinan Kaya 0 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-29 16:59 UTC (permalink / raw) To: Arnd Bergmann Cc: David Miller, Benjamin Herrenschmidt, Linus Torvalds, Alexander Duyck, Will Deacon, Jason Gunthorpe, David Laight, Oliver O'Halloran, linuxppc-dev, linux-rdma, Alexander Duyck, Paul E. McKenney, Networking, Linus Torvalds On 3/29/2018 12:29 PM, Arnd Bergmann wrote: > On Thu, Mar 29, 2018 at 3:56 PM, Sinan Kaya <okaya@codeaurora.org> wrote: >> On 3/28/2018 11:55 AM, David Miller wrote: >>> From: Benjamin Herrenschmidt <benh@kernel.crashing.org> >>> Date: Thu, 29 Mar 2018 02:13:16 +1100 >>> >>>> Let's fix all archs, it's way easier than fixing all drivers. Half of >>>> the archs are unused or dead anyway. >>> >>> Agreed. >>> >> >> I pinged most of the maintainers yesterday. >> Which arches do we care about these days? >> I have not been paying attention any other architecture besides arm64. >> >> arch status detail >> ------ ------------- ------------------------------------ >> alpha question sent Thanks for the detailed analysis. > > I'm guessing alpha has problems > > extern inline u32 readl(const volatile void __iomem *addr) > { > u32 ret = __raw_readl(addr); > mb(); > return ret; > } > extern inline void writel(u32 b, volatile void __iomem *addr) > { > __raw_writel(b, addr); > mb(); > } Looks like a problem to me too. I'll start a thread with the alpha people and CC you. > > There is a barrier in writel /after/ the acess but not before. > This is the consolidated list. I also heart back from m68k and corrected contacts for arc and h8300. arch status detail ------ ------------- ------------------------------------ alpha question sent Arnd: alpha has problems arc question sent Vineet.Gupta1@synopsys.com says he'll get to this in the next few days arm no issues arm64 no issues c6x no issues no PCI h8300 no issues no PCI: ysato@users.sourceforge.jp will fix it. hexagon no issues no PCI ia64 no issues confirmed by Tony Luck m68k no issues geert@linux-m68k.org says no problem metag no issues arnd: removed microblaze question sent arnd: some mips platforms have problems mips question sent arnd: some mips platforms have problems nds32 question sent nios2 no issues no PCI openrisc no issues shorne@gmail.com says should no issues parisc no issues grantgrundler@gmail.com says most probably no problem but still looking powerpc no issues riscv no issues arnd: riscv should be fine s390 no issues arnd: Pretty sure this is also fine sh question sent sparc no issues davem@davemloft.net says always strongly ordered unicore32 question sent resent to gxt@pku.edu.cn x86 no issues x86_64 no issues > > Arnd > -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-29 13:56 ` Sinan Kaya 2018-03-29 14:04 ` David Miller 2018-03-29 16:29 ` Arnd Bergmann @ 2018-03-30 1:40 ` Benjamin Herrenschmidt 2018-04-02 13:01 ` Sinan Kaya 2 siblings, 1 reply; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-30 1:40 UTC (permalink / raw) To: Sinan Kaya, David Miller Cc: torvalds, alexander.duyck, will.deacon, arnd, jgg, David.Laight, oohall, linuxppc-dev, linux-rdma, alexander.h.duyck, paulmck, netdev, linus971 On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote: > On 3/28/2018 11:55 AM, David Miller wrote: > > From: Benjamin Herrenschmidt <benh@kernel.crashing.org> > > Date: Thu, 29 Mar 2018 02:13:16 +1100 > > > > > Let's fix all archs, it's way easier than fixing all drivers. Half of > > > the archs are unused or dead anyway. > > > > Agreed. > > > > I pinged most of the maintainers yesterday. > Which arches do we care about these days? > I have not been paying attention any other architecture besides arm64. Thanks for going through that exercise ! Once sparc, s390, microblaze and mips reply, I think we'll have a good coverage, maybe riscv is to put in that lot too. Cheers, Ben. > > arch status detail > ------ ------------- ------------------------------------ > alpha question sent > arc question sent ysato@users.sourceforge.jp will fix it. > arm no issues > arm64 no issues > blackfin question sent about to be removed > c6x question sent > cris question sent > frv > h8300 question sent > hexagon question sent > ia64 no issues confirmed by Tony Luck > m32r > m68k question sent > metag > microblaze question sent > mips question sent > mn10300 question sent > nios2 question sent > openrisc no issues shorne@gmail.com says should no issues > parisc no issues grantgrundler@gmail.com says most probably no problem but still looking > powerpc no issues > riscv question sent > s390 question sent > score question sent > sh question sent > sparc question sent > tile question sent > unicore32 question sent > x86 no issues > xtensa question sent > > ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-30 1:40 ` Benjamin Herrenschmidt @ 2018-04-02 13:01 ` Sinan Kaya 0 siblings, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-04-02 13:01 UTC (permalink / raw) To: Benjamin Herrenschmidt, David Miller Cc: torvalds, alexander.duyck, will.deacon, arnd, jgg, David.Laight, oohall, linuxppc-dev, linux-rdma, alexander.h.duyck, paulmck, netdev, linus971 On 3/29/2018 9:40 PM, Benjamin Herrenschmidt wrote: > On Thu, 2018-03-29 at 09:56 -0400, Sinan Kaya wrote: >> On 3/28/2018 11:55 AM, David Miller wrote: >>> From: Benjamin Herrenschmidt <benh@kernel.crashing.org> >>> Date: Thu, 29 Mar 2018 02:13:16 +1100 >>> >>>> Let's fix all archs, it's way easier than fixing all drivers. Half of >>>> the archs are unused or dead anyway. >>> >>> Agreed. >>> >> >> I pinged most of the maintainers yesterday. >> Which arches do we care about these days? >> I have not been paying attention any other architecture besides arm64. > > Thanks for going through that exercise ! > > Once sparc, s390, microblaze and mips reply, I think we'll have a good > coverage, maybe riscv is to put in that lot too. I posted the following two patches for supporting microblaze and unicore32. [PATCH v2 1/2] io: prevent compiler reordering on the default writeX() implementation [PATCH v2 2/2] io: prevent compiler reordering on the default readX() implementation The rest of the arches except mips and alpha seem OK. I sent a question email on Friday to mips and alpha mailing lists. I'll follow up with an actual patch today. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 2:51 ` Linus Torvalds 2018-03-28 3:24 ` Sinan Kaya @ 2018-03-28 4:33 ` Benjamin Herrenschmidt 2018-03-28 6:26 ` Linus Torvalds 1 sibling, 1 reply; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 4:33 UTC (permalink / raw) To: Linus Torvalds Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev On Tue, 2018-03-27 at 16:51 -1000, Linus Torvalds wrote: > On Tue, Mar 27, 2018 at 3:03 PM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > > > The discussion at hand is about > > > > dma_buffer->foo = 1; /* WB */ > > writel(KICK, DMA_KICK_REGISTER); /* UC */ > > Yes. That certainly is ordered on x86. In fact, afaik it's ordered > even if that writel() might be of type WC, because that only delays > writes, it doesn't move them earlier. Ok so this is our answer ... ... snip ... (thanks for the background info !) > Oh, the above UC case is absoutely guaranteed. Good. Then.... > The only issue really is that 99.9% of all testing gets done on x86 > unless you look at specific SoC drivers. > > On ARM, for example, there is likely little reason to care about x86 > memory ordering, because there is almost zero driver overlap between > x86 and ARM. > > *Historically*, the reason for following the x86 IO ordering was > simply that a lot of architectures used the drivers that were > developed on x86. The alpha and powerpc workstations were *designed* > with the x86 IO bus (PCI, then PCIe) and to work with the devices that > came with it. > > ARM? PCIe is almost irrelevant. For ARM servers, if they ever take > off, sure. But 99.99% of ARM is about their own SoC's, and so "x86 > test coverage" is simply not an issue. > > How much of an issue is it for Power? Maybe you decide it's not a big deal. > > Then all the above is almost irrelevant. So the overlap may not be that NIL in practice :-) But even then that doesn't matter as ARM has been happily implementing the same semantic you describe above for years, as do we powerpc. This is why, I want (with your agreement) to define clearly and once and for all, that the Linux semantics of writel are that it is ordered with previous writes to coherent memory (*) This is already what ARM and powerpc provide, from what you say, what x86 provides, I don't see any reason to keep that badly documented and have drivers randomly growing useless wmb()'s because they don't think it works on x86 without them ! Once that's sorted, let's tackle the problem of mmiowb vs. spin_unlock and the problem of writel_relaxed semantics but as separate issues :-) Also, can I assume the above ordering with writel() equally applies to readl() or not ? IE: dma_buf->foo = 1; readl(STUPID_DEVICE_DMA_KICK_ON_READ); Also works on x86 ? (It does on power, maybe not on ARM). Cheers, Ben. (*) From an Linux API perspective, all of this is only valid if the memory was allocated by dma_alloc_coherent(). Anything obtained by dma_map_something() might have been bounced bufferred or might require extra cache flushes on some architectures, and thus needs dma_sync_for_{cpu,device} calls. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 4:33 ` Benjamin Herrenschmidt @ 2018-03-28 6:26 ` Linus Torvalds 2018-03-28 6:42 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 216+ messages in thread From: Linus Torvalds @ 2018-03-28 6:26 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > This is why, I want (with your agreement) to define clearly and once > and for all, that the Linux semantics of writel are that it is ordered > with previous writes to coherent memory (*) Honestly, I think those are the sane semantics. In fact, make it "ordered with previous writes" full stop, since it's not only ordered wrt previous writes to memory, but also previous writel's. > Also, can I assume the above ordering with writel() equally applies to > readl() or not ? > > IE: > dma_buf->foo = 1; > readl(STUPID_DEVICE_DMA_KICK_ON_READ); If that KICK_ON_READ is UC, then that's definitely the case. And honestly, status registers like that really should always be UC. But if somebody sets the area WC (which is crazy), then I think it might be at least debatable. x86 semantics does allow reads to be done before previous writes (or, put another way, writes to be buffered - the buffers are ordered so writes don't get re-ordered, but reads can happen during the buffering). But UC accesses are always done entirely ordered, and honestly, any status register that starts a DMA would not make sense any other way. Of course, you'd have to be pretty odd to want to start a DMA with a read anyway - partly exactly because it's bad for performance since reads will be synchronous and not buffered like a write). Linus ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 6:26 ` Linus Torvalds @ 2018-03-28 6:42 ` Benjamin Herrenschmidt 2018-03-28 6:53 ` Linus Torvalds 2018-03-28 9:07 ` Will Deacon 0 siblings, 2 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 6:42 UTC (permalink / raw) To: Linus Torvalds Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev On Tue, 2018-03-27 at 20:26 -1000, Linus Torvalds wrote: > On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > > > This is why, I want (with your agreement) to define clearly and once > > and for all, that the Linux semantics of writel are that it is ordered > > with previous writes to coherent memory (*) > > Honestly, I think those are the sane semantics. In fact, make it > "ordered with previous writes" full stop, since it's not only ordered > wrt previous writes to memory, but also previous writel's. Of course. It was somewhat a given that it's ordered vs. any previous MMIO actually, but it doesn't hurt to spell it out once more. > > Also, can I assume the above ordering with writel() equally applies to > > readl() or not ? > > > > IE: > > dma_buf->foo = 1; > > readl(STUPID_DEVICE_DMA_KICK_ON_READ); > > If that KICK_ON_READ is UC, then that's definitely the case. And > honestly, status registers like that really should always be UC. > > But if somebody sets the area WC (which is crazy), then I think it > might be at least debatable. x86 semantics does allow reads to be done > before previous writes (or, put another way, writes to be buffered - > the buffers are ordered so writes don't get re-ordered, but reads can > happen during the buffering). Right, for now I worry about UC semantics. Once we have nailed that, we can look at WC, which is a lot more tricky as archs differs more widely, but one thing at a time. > But UC accesses are always done entirely ordered, and honestly, any > status register that starts a DMA would not make sense any other way. > > Of course, you'd have to be pretty odd to want to start a DMA with a > read anyway - partly exactly because it's bad for performance since > reads will be synchronous and not buffered like a write). I have bad memories of old adaptec controllers ... That said, I think the above might not be right on ARM if we want to make it the rule, Will, what do you reckon ? Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 6:42 ` Benjamin Herrenschmidt 2018-03-28 6:53 ` Linus Torvalds @ 2018-03-28 6:53 ` Linus Torvalds 1 sibling, 0 replies; 216+ messages in thread From: Linus Torvalds @ 2018-03-28 6:53 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Arnd Bergmann, linux-rdma, netdev, Will Deacon, Alexander Duyck, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) [-- Attachment #1: Type: text/plain, Size: 1086 bytes --] On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > > Of course, you'd have to be pretty odd to want to start a DMA with a > > read anyway - partly exactly because it's bad for performance since > > reads will be synchronous and not buffered like a write). > > I have bad memories of old adaptec controllers ... > *Old* adaptec controllers were likely to use the in/out instructions for status and command data. Those are actually even more ordered than UC reads and writes: the in/out instructions are not just fully ordered, but are fully *synchronous* on x86. So not just doing accesses in order, but actually waiting for everything to drain before they start executing, but they also wait for the operation itself to complete (ie "out" will not just queue the write, it will then wait for the queue to empty and the write data to hit the line). That's why in/out were *so* slow, and why nobody uses them any more (well, the address size limitations and the lack of any remapping of the address obviously also are a reason). Linus > [-- Attachment #2: Type: text/html, Size: 1798 bytes --] ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-28 6:53 ` Linus Torvalds 0 siblings, 0 replies; 216+ messages in thread From: Linus Torvalds @ 2018-03-28 6:53 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev [-- Attachment #1: Type: text/plain, Size: 1086 bytes --] On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > > Of course, you'd have to be pretty odd to want to start a DMA with a > > read anyway - partly exactly because it's bad for performance since > > reads will be synchronous and not buffered like a write). > > I have bad memories of old adaptec controllers ... > *Old* adaptec controllers were likely to use the in/out instructions for status and command data. Those are actually even more ordered than UC reads and writes: the in/out instructions are not just fully ordered, but are fully *synchronous* on x86. So not just doing accesses in order, but actually waiting for everything to drain before they start executing, but they also wait for the operation itself to complete (ie "out" will not just queue the write, it will then wait for the queue to empty and the write data to hit the line). That's why in/out were *so* slow, and why nobody uses them any more (well, the address size limitations and the lack of any remapping of the address obviously also are a reason). Linus > [-- Attachment #2: Type: text/html, Size: 1798 bytes --] ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-28 6:53 ` Linus Torvalds 0 siblings, 0 replies; 216+ messages in thread From: Linus Torvalds @ 2018-03-28 6:53 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Arnd Bergmann, linux-rdma, netdev, Will Deacon, Alexander Duyck, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver, Paul E. McKenney, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT) [-- Attachment #1: Type: text/plain, Size: 1086 bytes --] On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > > Of course, you'd have to be pretty odd to want to start a DMA with a > > read anyway - partly exactly because it's bad for performance since > > reads will be synchronous and not buffered like a write). > > I have bad memories of old adaptec controllers ... > *Old* adaptec controllers were likely to use the in/out instructions for status and command data. Those are actually even more ordered than UC reads and writes: the in/out instructions are not just fully ordered, but are fully *synchronous* on x86. So not just doing accesses in order, but actually waiting for everything to drain before they start executing, but they also wait for the operation itself to complete (ie "out" will not just queue the write, it will then wait for the queue to empty and the write data to hit the line). That's why in/out were *so* slow, and why nobody uses them any more (well, the address size limitations and the lack of any remapping of the address obviously also are a reason). Linus > [-- Attachment #2: Type: text/html, Size: 1798 bytes --] ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 6:53 ` Linus Torvalds (?) (?) @ 2018-03-28 6:56 ` Benjamin Herrenschmidt 2018-03-28 7:11 ` Arnd Bergmann -1 siblings, 1 reply; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 6:56 UTC (permalink / raw) To: Linus Torvalds Cc: Alexander Duyck, Will Deacon, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev On Wed, 2018-03-28 at 06:53 +0000, Linus Torvalds wrote: > > > On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt <benh@kernel.crash > ing.org> wrote: > > > > > > Of course, you'd have to be pretty odd to want to start a DMA > > with a > > > read anyway - partly exactly because it's bad for performance > > since > > > reads will be synchronous and not buffered like a write). > > > > I have bad memories of old adaptec controllers ... > > *Old* adaptec controllers were likely to use the in/out instructions > for status and command data. > > Those are actually even more ordered than UC reads and writes: the > in/out instructions are not just fully ordered, but are fully > *synchronous* on x86. > > So not just doing accesses in order, but actually waiting for > everything to drain before they start executing, but they also wait > for the operation itself to complete (ie "out" will not just queue > the write, it will then wait for the queue to empty and the write > data to hit the line). > > That's why in/out were *so* slow, and why nobody uses them any more > (well, the address size limitations and the lack of any remapping of > the address obviously also are a reason). All true indeed, though a lot of other archs never quite made them fully synchronous, which was another can of worms ... oh well. As for Adaptec, you might be right, I do remember having cases of old stuff triggering DMA on reads, it might have been "Mac" variants of Adaptec using MMIO or something... Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 6:56 ` Benjamin Herrenschmidt @ 2018-03-28 7:11 ` Arnd Bergmann 2018-03-28 7:42 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 216+ messages in thread From: Arnd Bergmann @ 2018-03-28 7:11 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Linus Torvalds, Alexander Duyck, Will Deacon, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev On Wed, Mar 28, 2018 at 8:56 AM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Wed, 2018-03-28 at 06:53 +0000, Linus Torvalds wrote: >> On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: >> That's why in/out were *so* slow, and why nobody uses them any more >> (well, the address size limitations and the lack of any remapping of >> the address obviously also are a reason). > > All true indeed, though a lot of other archs never quite made them > fully synchronous, which was another can of worms ... oh well. Many architectures have no way of providing PCI compliant semantics for outb, as their instruction set and/or bus interconnect lacks a method of waiting for completion of an outb. In practice, it doesn't seem to matter for any of the devices one would encounter these days: very few use I/O space, and those that do don't actually rely on the strict ordering. Some architectures (in particular s390, but I remember seeing the same thing elsewhere) explicitly disallow I/O space access on PCI because of this. On ARM, the typical PCI implementations have other problems that are worse than this one, so most drivers are fine with the almost-working semantics. Arnd ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 7:11 ` Arnd Bergmann @ 2018-03-28 7:42 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 7:42 UTC (permalink / raw) To: Arnd Bergmann Cc: Linus Torvalds, Alexander Duyck, Will Deacon, Sinan Kaya, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev On Wed, 2018-03-28 at 09:11 +0200, Arnd Bergmann wrote: > On Wed, Mar 28, 2018 at 8:56 AM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > On Wed, 2018-03-28 at 06:53 +0000, Linus Torvalds wrote: > > > On Tue, Mar 27, 2018, 20:43 Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > > That's why in/out were *so* slow, and why nobody uses them any more > > > (well, the address size limitations and the lack of any remapping of > > > the address obviously also are a reason). > > > > All true indeed, though a lot of other archs never quite made them > > fully synchronous, which was another can of worms ... oh well. > > Many architectures have no way of providing PCI compliant semantics > for outb, as their instruction set and/or bus interconnect lacks a > method of waiting for completion of an outb. Yup, that includes powerpc. Note that since POWER8 we don't even genetate IO space anymore :-) > In practice, it doesn't seem to matter for any of the devices one would > encounter these days: very few use I/O space, and those that do don't > actually rely on the strict ordering. Some architectures (in particular > s390, but I remember seeing the same thing elsewhere) explicitly > disallow I/O space access on PCI because of this. On ARM, the typical > PCI implementations have other problems that are worse than this > one, so most drivers are fine with the almost-working semantics. /me cries... Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 6:42 ` Benjamin Herrenschmidt 2018-03-28 6:53 ` Linus Torvalds @ 2018-03-28 9:07 ` Will Deacon 2018-03-28 9:56 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 216+ messages in thread From: Will Deacon @ 2018-03-28 9:07 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev On Wed, Mar 28, 2018 at 05:42:56PM +1100, Benjamin Herrenschmidt wrote: > On Tue, 2018-03-27 at 20:26 -1000, Linus Torvalds wrote: > > On Tue, Mar 27, 2018 at 6:33 PM, Benjamin Herrenschmidt > > <benh@kernel.crashing.org> wrote: > > > > > > This is why, I want (with your agreement) to define clearly and once > > > and for all, that the Linux semantics of writel are that it is ordered > > > with previous writes to coherent memory (*) > > > > Honestly, I think those are the sane semantics. In fact, make it > > "ordered with previous writes" full stop, since it's not only ordered > > wrt previous writes to memory, but also previous writel's. > > Of course. It was somewhat a given that it's ordered vs. any previous > MMIO actually, but it doesn't hurt to spell it out once more. Good. So I think this confirms our understanding so far. > > > > Also, can I assume the above ordering with writel() equally applies to > > > readl() or not ? > > > > > > IE: > > > dma_buf->foo = 1; > > > readl(STUPID_DEVICE_DMA_KICK_ON_READ); > > > > If that KICK_ON_READ is UC, then that's definitely the case. And > > honestly, status registers like that really should always be UC. > > > > But if somebody sets the area WC (which is crazy), then I think it > > might be at least debatable. x86 semantics does allow reads to be done > > before previous writes (or, put another way, writes to be buffered - > > the buffers are ordered so writes don't get re-ordered, but reads can > > happen during the buffering). > > Right, for now I worry about UC semantics. Once we have nailed that, we > can look at WC, which is a lot more tricky as archs differs more > widely, but one thing at a time. > > > But UC accesses are always done entirely ordered, and honestly, any > > status register that starts a DMA would not make sense any other way. > > > > Of course, you'd have to be pretty odd to want to start a DMA with a > > read anyway - partly exactly because it's bad for performance since > > reads will be synchronous and not buffered like a write). > > I have bad memories of old adaptec controllers ... > > That said, I think the above might not be right on ARM if we want to > make it the rule, Will, what do you reckon ? So there are two cases to consider: 1. if (readl(DEVICE_DMA_STATUS) == DMA_DONE) mydata = *dma_bufp; 2. *dma_bufp = 42; readl(DEVICE_DMA_KICK_ON_READ); For arm/arm64 we guarantee ordering for (1) but not for (2) -- you'd need to add an mb() to make it work. Do both of these work on power? If so, I guess I can make readl even more expensive :/ Feels a bit like the tail wagging the dog, though. Another thing I just realised is that we restrict the barriers we use in readl/writel on arm64 so that they don't necessary apply to both loads and stores. To be specific: writel is ordered against prior writes to memory, but not reads readl is ordered against subsequent reads of memory, but not writes (but note that in example (1) above, the control dependency ensures that). If necessary, I could move the barrier in our readl implementation to be before the read, then play the control-dependency + instruction-sync (ISB) trick that you do on power. Will ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 9:07 ` Will Deacon @ 2018-03-28 9:56 ` Benjamin Herrenschmidt 2018-03-28 10:13 ` Aw: " Lino Sanfilippo 2018-03-28 11:30 ` David Laight 0 siblings, 2 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 9:56 UTC (permalink / raw) To: Will Deacon Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev On Wed, 2018-03-28 at 10:07 +0100, Will Deacon wrote: > > For arm/arm64 we guarantee ordering for (1) but not for (2) -- you'd need to > add an mb() to make it work. > > Do both of these work on power? Yes. There's even another quirk, see further down ;-) > If so, I guess I can make readl even more > expensive :/ Feels a bit like the tail wagging the dog, though. Maybe, but then readl is always horribly slow anyway so you may not necessarily be losing that much. > Another thing I just realised is that we restrict the barriers we use in > readl/writel on arm64 so that they don't necessary apply to both loads and > stores. To be specific: > > writel is ordered against prior writes to memory, but not reads That could be tricky... You may end up with something that reads before triggering a DMA and ends up with the post-DMA value ... ugh. > readl is ordered against subsequent reads of memory, but not writes (but > note that in example (1) above, the control dependency ensures that). > > If necessary, I could move the barrier in our readl implementation to be > before the read, then play the control-dependency + instruction-sync (ISB) > trick that you do on power. Yeah so that other trick I'm talking about is also used for timing accuracy. For example, let's say I have a device with a reset bit and the spec says the reset bit needs to be set for at least 10us. This is wrong: writel(1, RESET_REG); usleep(10); writel(0, RESET_REG); Because of write posting, the first write might arrive to the device right before the second one. The typical "fix" is to turn that into: writel(1, RESET_REG); readl(RESET_REG); /* Flush posted writes */ usleep(10); writel(0, RESET_REG); *However* the issue here, at least on power, is that the CPU can issue that readl but doesn't necessarily wait for it to complete (ie, the data to return), before proceeding to the usleep. Now a usleep contains a bunch of loads and stores and is probably fine, but a udelay which just loops on the timebase may not be. Thus we may still violate the timing requirement. What we did inside readl, with the twi;isync sequence (which basically means, trap on return value with "trap never" as a condition, followed by isync that ensures all excpetion conditions are resolved), is force the CPU to "consume" the data from the read before moving on. This effectively makes readl fully synchronous (we would probably avoid that if we were to implement a readl_relaxed). Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Aw: Re: RFC on writel and writel_relaxed 2018-03-28 9:56 ` Benjamin Herrenschmidt @ 2018-03-28 10:13 ` Lino Sanfilippo 2018-03-28 10:20 ` Benjamin Herrenschmidt 2018-03-28 11:30 ` David Laight 1 sibling, 1 reply; 216+ messages in thread From: Lino Sanfilippo @ 2018-03-28 10:13 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Will Deacon, Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev Hi, > > Yeah so that other trick I'm talking about is also used for timing > accuracy. > > For example, let's say I have a device with a reset bit and the spec > says the reset bit needs to be set for at least 10us. > > This is wrong: > > writel(1, RESET_REG); > usleep(10); > writel(0, RESET_REG); > > Because of write posting, the first write might arrive to the device > right before the second one. > Does not write posting only concern PCI? This seems to be a different topic. Furthermore write posting should not include write reordering... Regards, Lino ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: Aw: Re: RFC on writel and writel_relaxed 2018-03-28 10:13 ` Aw: " Lino Sanfilippo @ 2018-03-28 10:20 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 10:20 UTC (permalink / raw) To: Lino Sanfilippo Cc: Will Deacon, Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev On Wed, 2018-03-28 at 12:13 +0200, Lino Sanfilippo wrote: > Hi, > > > > > > Yeah so that other trick I'm talking about is also used for timing > > accuracy. > > > > For example, let's say I have a device with a reset bit and the spec > > says the reset bit needs to be set for at least 10us. > > > > This is wrong: > > > > writel(1, RESET_REG); > > usleep(10); > > writel(0, RESET_REG); > > > > Because of write posting, the first write might arrive to the device > > right before the second one. > > > > Does not write posting only concern PCI? This seems to be a different topic. Furthermore > write posting should not include write reordering... Nobody's talking about re-ordering and no, write posting is rather common practice on a whole lot of different busses, not just PCI(e). Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed 2018-03-28 9:56 ` Benjamin Herrenschmidt @ 2018-03-28 11:30 ` David Laight 2018-03-28 11:30 ` David Laight 1 sibling, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-28 11:30 UTC (permalink / raw) To: 'Benjamin Herrenschmidt', Will Deacon Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev From: Benjamin Herrenschmidt > Sent: 28 March 2018 10:56 ... > For example, let's say I have a device with a reset bit and the spec > says the reset bit needs to be set for at least 10us. > > This is wrong: > > writel(1, RESET_REG); > usleep(10); > writel(0, RESET_REG); > > Because of write posting, the first write might arrive to the device > right before the second one. > > The typical "fix" is to turn that into: > > writel(1, RESET_REG); > readl(RESET_REG); /* Flush posted writes */ Would a writel(1, RESET_REG) here provide enough synchronsiation? > usleep(10); > writel(0, RESET_REG); > > *However* the issue here, at least on power, is that the CPU can issue > that readl but doesn't necessarily wait for it to complete (ie, the > data to return), before proceeding to the usleep. Now a usleep contains > a bunch of loads and stores and is probably fine, but a udelay which > just loops on the timebase may not be. > > Thus we may still violate the timing requirement. I've seem that sort of code (with udelay() and no read back) quite often. How many were in linux I don't know. For small delays I usually fix it by repeated writes (of the same value) to the device register. That can guarantee very short intervals. The only time I've actually seen buffered writes break timing was between a 286 and an 8859 interrupt controller. If you wrote to the mask then enabled interrupts the first IACK cycle could be too close to write and break the cycle recovery time. That clobbered most of the interrupt controller registers. That probably affected every 286 board ever built! Not sure how much software added the required extra bus cycle. > What we did inside readl, with the twi;isync sequence (which basically > means, trap on return value with "trap never" as a condition, followed > by isync that ensures all excpetion conditions are resolved), is force > the CPU to "consume" the data from the read before moving on. > > This effectively makes readl fully synchronous (we would probably avoid > that if we were to implement a readl_relaxed). I've always wondered exactly what the twi;isync were for - always seemed very heavy handed for most mmio reads. Particularly if you are doing mmio reads from a data fifo. Perhaps there should be a writel_wait() that is allowed to do a read back for such code paths? David ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed @ 2018-03-28 11:30 ` David Laight 0 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-28 11:30 UTC (permalink / raw) To: 'Benjamin Herrenschmidt', Will Deacon Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev RnJvbTogQmVuamFtaW4gSGVycmVuc2NobWlkdA0KPiBTZW50OiAyOCBNYXJjaCAyMDE4IDEwOjU2 DQouLi4NCj4gRm9yIGV4YW1wbGUsIGxldCdzIHNheSBJIGhhdmUgYSBkZXZpY2Ugd2l0aCBhIHJl c2V0IGJpdCBhbmQgdGhlIHNwZWMNCj4gc2F5cyB0aGUgcmVzZXQgYml0IG5lZWRzIHRvIGJlIHNl dCBmb3IgYXQgbGVhc3QgMTB1cy4NCj4gDQo+IFRoaXMgaXMgd3Jvbmc6DQo+IA0KPiAJd3JpdGVs KDEsIFJFU0VUX1JFRyk7DQo+IAl1c2xlZXAoMTApOw0KPiAJd3JpdGVsKDAsIFJFU0VUX1JFRyk7 DQo+IA0KPiBCZWNhdXNlIG9mIHdyaXRlIHBvc3RpbmcsIHRoZSBmaXJzdCB3cml0ZSBtaWdodCBh cnJpdmUgdG8gdGhlIGRldmljZQ0KPiByaWdodCBiZWZvcmUgdGhlIHNlY29uZCBvbmUuDQo+IA0K PiBUaGUgdHlwaWNhbCAiZml4IiBpcyB0byB0dXJuIHRoYXQgaW50bzoNCj4gDQo+IAl3cml0ZWwo MSwgUkVTRVRfUkVHKTsNCj4gCXJlYWRsKFJFU0VUX1JFRyk7IC8qIEZsdXNoIHBvc3RlZCB3cml0 ZXMgKi8NCg0KV291bGQgYSB3cml0ZWwoMSwgUkVTRVRfUkVHKSBoZXJlIHByb3ZpZGUgZW5vdWdo IHN5bmNocm9uc2lhdGlvbj8NCg0KPiAJdXNsZWVwKDEwKTsNCj4gCXdyaXRlbCgwLCBSRVNFVF9S RUcpOw0KPiANCj4gKkhvd2V2ZXIqIHRoZSBpc3N1ZSBoZXJlLCBhdCBsZWFzdCBvbiBwb3dlciwg aXMgdGhhdCB0aGUgQ1BVIGNhbiBpc3N1ZQ0KPiB0aGF0IHJlYWRsIGJ1dCBkb2Vzbid0IG5lY2Vz c2FyaWx5IHdhaXQgZm9yIGl0IHRvIGNvbXBsZXRlIChpZSwgdGhlDQo+IGRhdGEgdG8gcmV0dXJu KSwgYmVmb3JlIHByb2NlZWRpbmcgdG8gdGhlIHVzbGVlcC4gTm93IGEgdXNsZWVwIGNvbnRhaW5z DQo+IGEgYnVuY2ggb2YgbG9hZHMgYW5kIHN0b3JlcyBhbmQgaXMgcHJvYmFibHkgZmluZSwgYnV0 IGEgdWRlbGF5IHdoaWNoDQo+IGp1c3QgbG9vcHMgb24gdGhlIHRpbWViYXNlIG1heSBub3QgYmUu DQo+IA0KPiBUaHVzIHdlIG1heSBzdGlsbCB2aW9sYXRlIHRoZSB0aW1pbmcgcmVxdWlyZW1lbnQu DQoNCkkndmUgc2VlbSB0aGF0IHNvcnQgb2YgY29kZSAod2l0aCB1ZGVsYXkoKSBhbmQgbm8gcmVh ZCBiYWNrKSBxdWl0ZSBvZnRlbi4NCkhvdyBtYW55IHdlcmUgaW4gbGludXggSSBkb24ndCBrbm93 Lg0KDQpGb3Igc21hbGwgZGVsYXlzIEkgdXN1YWxseSBmaXggaXQgYnkgcmVwZWF0ZWQgd3JpdGVz IChvZiB0aGUgc2FtZSB2YWx1ZSkNCnRvIHRoZSBkZXZpY2UgcmVnaXN0ZXIuIFRoYXQgY2FuIGd1 YXJhbnRlZSB2ZXJ5IHNob3J0IGludGVydmFscy4NCg0KVGhlIG9ubHkgdGltZSBJJ3ZlIGFjdHVh bGx5IHNlZW4gYnVmZmVyZWQgd3JpdGVzIGJyZWFrIHRpbWluZyB3YXMNCmJldHdlZW4gYSAyODYg YW5kIGFuIDg4NTkgaW50ZXJydXB0IGNvbnRyb2xsZXIuDQpJZiB5b3Ugd3JvdGUgdG8gdGhlIG1h c2sgdGhlbiBlbmFibGVkIGludGVycnVwdHMgdGhlIGZpcnN0IElBQ0sgY3ljbGUNCmNvdWxkIGJl IHRvbyBjbG9zZSB0byB3cml0ZSBhbmQgYnJlYWsgdGhlIGN5Y2xlIHJlY292ZXJ5IHRpbWUuDQpU aGF0IGNsb2JiZXJlZCBtb3N0IG9mIHRoZSBpbnRlcnJ1cHQgY29udHJvbGxlciByZWdpc3RlcnMu DQpUaGF0IHByb2JhYmx5IGFmZmVjdGVkIGV2ZXJ5IDI4NiBib2FyZCBldmVyIGJ1aWx0IQ0KTm90 IHN1cmUgaG93IG11Y2ggc29mdHdhcmUgYWRkZWQgdGhlIHJlcXVpcmVkIGV4dHJhIGJ1cyBjeWNs ZS4NCg0KPiBXaGF0IHdlIGRpZCBpbnNpZGUgcmVhZGwsIHdpdGggdGhlIHR3aTtpc3luYyBzZXF1 ZW5jZSAod2hpY2ggYmFzaWNhbGx5DQo+IG1lYW5zLCB0cmFwIG9uIHJldHVybiB2YWx1ZSB3aXRo ICJ0cmFwIG5ldmVyIiBhcyBhIGNvbmRpdGlvbiwgZm9sbG93ZWQNCj4gYnkgaXN5bmMgdGhhdCBl bnN1cmVzIGFsbCBleGNwZXRpb24gY29uZGl0aW9ucyBhcmUgcmVzb2x2ZWQpLCBpcyBmb3JjZQ0K PiB0aGUgQ1BVIHRvICJjb25zdW1lIiB0aGUgZGF0YSBmcm9tIHRoZSByZWFkIGJlZm9yZSBtb3Zp bmcgb24uDQo+IA0KPiBUaGlzIGVmZmVjdGl2ZWx5IG1ha2VzIHJlYWRsIGZ1bGx5IHN5bmNocm9u b3VzICh3ZSB3b3VsZCBwcm9iYWJseSBhdm9pZA0KPiB0aGF0IGlmIHdlIHdlcmUgdG8gaW1wbGVt ZW50IGEgcmVhZGxfcmVsYXhlZCkuDQoNCkkndmUgYWx3YXlzIHdvbmRlcmVkIGV4YWN0bHkgd2hh dCB0aGUgdHdpO2lzeW5jIHdlcmUgZm9yIC0gYWx3YXlzIHNlZW1lZA0KdmVyeSBoZWF2eSBoYW5k ZWQgZm9yIG1vc3QgbW1pbyByZWFkcy4NClBhcnRpY3VsYXJseSBpZiB5b3UgYXJlIGRvaW5nIG1t aW8gcmVhZHMgZnJvbSBhIGRhdGEgZmlmby4NCg0KUGVyaGFwcyB0aGVyZSBzaG91bGQgYmUgYSB3 cml0ZWxfd2FpdCgpIHRoYXQgaXMgYWxsb3dlZCB0byBkbyBhIHJlYWQgYmFjaw0KZm9yIHN1Y2gg Y29kZSBwYXRocz8NCg0KCURhdmlkDQoNCg== ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-28 11:30 ` David Laight (?) @ 2018-03-28 15:12 ` Benjamin Herrenschmidt 2018-03-28 16:16 ` David Laight -1 siblings, 1 reply; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 15:12 UTC (permalink / raw) To: David Laight, Will Deacon Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev On Wed, 2018-03-28 at 11:30 +0000, David Laight wrote: > From: Benjamin Herrenschmidt > > Sent: 28 March 2018 10:56 > > ... > > For example, let's say I have a device with a reset bit and the spec > > says the reset bit needs to be set for at least 10us. > > > > This is wrong: > > > > writel(1, RESET_REG); > > usleep(10); > > writel(0, RESET_REG); > > > > Because of write posting, the first write might arrive to the device > > right before the second one. > > > > The typical "fix" is to turn that into: > > > > writel(1, RESET_REG); > > readl(RESET_REG); /* Flush posted writes */ > > Would a writel(1, RESET_REG) here provide enough synchronsiation? Probably yes. It's one of those things where you try to deal with the fact that 90% of driver writers barely understand the basic stuff and so you need the "default" accessors to be hardened as much as possible. We still need to get a reasonably definition of the semantics of the relaxed ones vs. WC memory but let's get through that exercise first and hopefully for the last time. > > usleep(10); > > writel(0, RESET_REG); > > > > *However* the issue here, at least on power, is that the CPU can issue > > that readl but doesn't necessarily wait for it to complete (ie, the > > data to return), before proceeding to the usleep. Now a usleep contains > > a bunch of loads and stores and is probably fine, but a udelay which > > just loops on the timebase may not be. > > > > Thus we may still violate the timing requirement. > > I've seem that sort of code (with udelay() and no read back) quite often. > How many were in linux I don't know. > > For small delays I usually fix it by repeated writes (of the same value) > to the device register. That can guarantee very short intervals. As long as you know the bus frequency... > The only time I've actually seen buffered writes break timing was > between a 286 and an 8859 interrupt controller. :-) The problem for me is not so much what I've seen, I realize that most of the issues we are talking about are the kind that will hit once in a thousand times or less. But we *can* reason about them in a way that can effectively prevent the problem completely and when your cluster has 10000 machine, 1/1000 starts becoming significant. These days the vast majority of IO devices either are 99% DMA driven so that a bit of overhead on MMIOs is irrelevant, or have one fast path (low latency IB etc...) that needs some finer control, and the rest is all setup which can be paranoid at will. So I think we should aim for the "default" accessors most people use to be as hadened as we can think of. I favor correctness over performance in all cases. But then we also define a reasonable semantic for the relaxed ones (well, we sort-of do have one, we might have to make it a bit more precise in some areas) that allows the few MMIO fast path that care to be optimized. > If you wrote to the mask then enabled interrupts the first IACK cycle > could be too close to write and break the cycle recovery time. > That clobbered most of the interrupt controller registers. > That probably affected every 286 board ever built! > Not sure how much software added the required extra bus cycle. > > > What we did inside readl, with the twi;isync sequence (which basically > > means, trap on return value with "trap never" as a condition, followed > > by isync that ensures all excpetion conditions are resolved), is force > > the CPU to "consume" the data from the read before moving on. > > > > This effectively makes readl fully synchronous (we would probably avoid > > that if we were to implement a readl_relaxed). > > I've always wondered exactly what the twi;isync were for - always seemed > very heavy handed for most mmio reads. > Particularly if you are doing mmio reads from a data fifo. If you do that you should use the "s" version of the accessors. Those will only do the above trick at the end of the access series. Also a FIFO needs special care about endianness anyway, so you should use those accessors regardless. (Hint: you never endian swap a FIFO even on BE on a LE device, unless something's been wired very badly in HW). > Perhaps there should be a writel_wait() that is allowed to do a read back > for such code paths? I think what we have is fine, we just define that the standard writel/readl as fairly simple and hardened, and we look at providing a somewhat reasonable set of relaxed variants for optimizing fast path. We pretty much already are there, we just need to be better at defining the semantics. And for the super high perf case, which thankfully is either seldom (server high perf network stuff) or very arch specific (ARM SoC stuff), then arch specific driver hacks will always remain the norm. Cheers, Ben. > David > ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed 2018-03-28 15:12 ` Benjamin Herrenschmidt @ 2018-03-28 16:16 ` David Laight 0 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-28 16:16 UTC (permalink / raw) To: 'Benjamin Herrenschmidt', Will Deacon Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev From: Benjamin Herrenschmidt > Sent: 28 March 2018 16:13 ... > > I've always wondered exactly what the twi;isync were for - always seemed > > very heavy handed for most mmio reads. > > Particularly if you are doing mmio reads from a data fifo. > > If you do that you should use the "s" version of the accessors. Those > will only do the above trick at the end of the access series. Also a > FIFO needs special care about endianness anyway, so you should use > those accessors regardless. (Hint: you never endian swap a FIFO even on > BE on a LE device, unless something's been wired very badly in HW). That was actually a 64 bit wide fifo connected to a 16bit wide PIO interface. Reading the high address 'clocked' the fifo. So the first 3 reads could happen in any order, but the 4th had to be last. This is a small ppc and we shovel a lot of data through that fifo. Whether it needed byteswapping depended completely on how our hardware people had built the pcb (not made easy by some docs using the ibm bit numbering). In fact it didn't.... While that driver only had to run on a very specific small ppc, generic drivers might have similar issues. I suspect that writel() is always (or should always be): barrier_before_writel() writel_relaxed() barrier_after_writel() So if a driver needs to do multiple writes (without strong ordering) it should be able to repeat the writel_relaxed() with only one set of barriers. Similarly for readl(). In addition a lesser barrier is probably enough between a readl_relaxed() and a writel_relaxed() that is conditional on the read value. David ^ permalink raw reply [flat|nested] 216+ messages in thread
* RE: RFC on writel and writel_relaxed @ 2018-03-28 16:16 ` David Laight 0 siblings, 0 replies; 216+ messages in thread From: David Laight @ 2018-03-28 16:16 UTC (permalink / raw) To: 'Benjamin Herrenschmidt', Will Deacon Cc: Linus Torvalds, Alexander Duyck, Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Paul E. McKenney, netdev RnJvbTogQmVuamFtaW4gSGVycmVuc2NobWlkdA0KPiBTZW50OiAyOCBNYXJjaCAyMDE4IDE2OjEz DQouLi4NCj4gPiBJJ3ZlIGFsd2F5cyB3b25kZXJlZCBleGFjdGx5IHdoYXQgdGhlIHR3aTtpc3lu YyB3ZXJlIGZvciAtIGFsd2F5cyBzZWVtZWQNCj4gPiB2ZXJ5IGhlYXZ5IGhhbmRlZCBmb3IgbW9z dCBtbWlvIHJlYWRzLg0KPiA+IFBhcnRpY3VsYXJseSBpZiB5b3UgYXJlIGRvaW5nIG1taW8gcmVh ZHMgZnJvbSBhIGRhdGEgZmlmby4NCj4gDQo+IElmIHlvdSBkbyB0aGF0IHlvdSBzaG91bGQgdXNl IHRoZSAicyIgdmVyc2lvbiBvZiB0aGUgYWNjZXNzb3JzLiBUaG9zZQ0KPiB3aWxsIG9ubHkgZG8g dGhlIGFib3ZlIHRyaWNrIGF0IHRoZSBlbmQgb2YgdGhlIGFjY2VzcyBzZXJpZXMuIEFsc28gYQ0K PiBGSUZPIG5lZWRzIHNwZWNpYWwgY2FyZSBhYm91dCBlbmRpYW5uZXNzIGFueXdheSwgc28geW91 IHNob3VsZCB1c2UNCj4gdGhvc2UgYWNjZXNzb3JzIHJlZ2FyZGxlc3MuIChIaW50OiB5b3UgbmV2 ZXIgZW5kaWFuIHN3YXAgYSBGSUZPIGV2ZW4gb24NCj4gQkUgb24gYSBMRSBkZXZpY2UsIHVubGVz cyBzb21ldGhpbmcncyBiZWVuIHdpcmVkIHZlcnkgYmFkbHkgaW4gSFcpLg0KDQpUaGF0IHdhcyBh Y3R1YWxseSBhIDY0IGJpdCB3aWRlIGZpZm8gY29ubmVjdGVkIHRvIGEgMTZiaXQgd2lkZSBQSU8g aW50ZXJmYWNlLg0KUmVhZGluZyB0aGUgaGlnaCBhZGRyZXNzICdjbG9ja2VkJyB0aGUgZmlmby4N ClNvIHRoZSBmaXJzdCAzIHJlYWRzIGNvdWxkIGhhcHBlbiBpbiBhbnkgb3JkZXIsIGJ1dCB0aGUg NHRoIGhhZCB0byBiZSBsYXN0Lg0KVGhpcyBpcyBhIHNtYWxsIHBwYyBhbmQgd2Ugc2hvdmVsIGEg bG90IG9mIGRhdGEgdGhyb3VnaCB0aGF0IGZpZm8uDQoNCldoZXRoZXIgaXQgbmVlZGVkIGJ5dGVz d2FwcGluZyBkZXBlbmRlZCBjb21wbGV0ZWx5IG9uIGhvdyBvdXIgaGFyZHdhcmUgcGVvcGxlDQpo YWQgYnVpbHQgdGhlIHBjYiAobm90IG1hZGUgZWFzeSBieSBzb21lIGRvY3MgdXNpbmcgdGhlIGli bSBiaXQgbnVtYmVyaW5nKS4NCkluIGZhY3QgaXQgZGlkbid0Li4uLg0KDQpXaGlsZSB0aGF0IGRy aXZlciBvbmx5IGhhZCB0byBydW4gb24gYSB2ZXJ5IHNwZWNpZmljIHNtYWxsIHBwYywgZ2VuZXJp YyBkcml2ZXJzDQptaWdodCBoYXZlIHNpbWlsYXIgaXNzdWVzLg0KDQpJIHN1c3BlY3QgdGhhdCB3 cml0ZWwoKSBpcyBhbHdheXMgKG9yIHNob3VsZCBhbHdheXMgYmUpOg0KCWJhcnJpZXJfYmVmb3Jl X3dyaXRlbCgpDQoJd3JpdGVsX3JlbGF4ZWQoKQ0KCWJhcnJpZXJfYWZ0ZXJfd3JpdGVsKCkNClNv IGlmIGEgZHJpdmVyIG5lZWRzIHRvIGRvIG11bHRpcGxlIHdyaXRlcyAod2l0aG91dCBzdHJvbmcg b3JkZXJpbmcpDQppdCBzaG91bGQgYmUgYWJsZSB0byByZXBlYXQgdGhlIHdyaXRlbF9yZWxheGVk KCkgd2l0aCBvbmx5IG9uZSBzZXQNCm9mIGJhcnJpZXJzLg0KU2ltaWxhcmx5IGZvciByZWFkbCgp Lg0KSW4gYWRkaXRpb24gYSBsZXNzZXIgYmFycmllciBpcyBwcm9iYWJseSBlbm91Z2ggYmV0d2Vl biBhIHJlYWRsX3JlbGF4ZWQoKQ0KYW5kIGEgd3JpdGVsX3JlbGF4ZWQoKSB0aGF0IGlzIGNvbmRp dGlvbmFsIG9uIHRoZSByZWFkIHZhbHVlLg0KDQoJRGF2aWQNCg0K ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 15:10 ` Will Deacon 2018-03-27 18:54 ` Alexander Duyck @ 2018-03-28 1:21 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-28 1:21 UTC (permalink / raw) To: Will Deacon, Sinan Kaya Cc: Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Paul E. McKenney, netdev, Alexander Duyck, torvalds On Tue, 2018-03-27 at 16:10 +0100, Will Deacon wrote: > To clarify: are you saying that on x86 you need a wmb() prior to a writel > if you want that writel to be ordered after prior writes to memory? Is this > specific to WC memory or some other non-standard attribute? > > The only reason we have wmb() inside writel() on arm, arm64 and power is for > parity with x86 because Linus (CC'd) wanted architectures to order I/O vs > memory by default so that it was easier to write portable drivers. The > performance impact of that implicit barrier is non-trivial, but we want the > driver portability and I went as far as adding generic _relaxed versions for > the cases where ordering isn't required. You seem to be suggesting that none > of this is necessary and drivers would already run into problems on x86 if > they didn't use wmb() explicitly in conjunction with writel, which I find > hard to believe and is in direct contradiction with the current Linux I/O > memory model (modulo the broken example in the dma_*mb section of > memory-barriers.txt). Another clarification while we are at it .... All of this only applies to concurrent access by the CPU and the device to memory allocate with dma_alloc_coherent(). For memory "mapped" into the DMA domain via dma_map_* then an extra dma_sync_for_* is needed. In most useful server cases etc... these latter are NOPs, but architecture without full DMA cache coherency or using swiotlb, dma_map_* might maintain bounce buffers or play additional cache flushing tricks. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 14:46 ` Sinan Kaya 2018-03-27 15:01 ` Jose Abreu 2018-03-27 15:10 ` Will Deacon @ 2018-03-27 21:35 ` Benjamin Herrenschmidt 2 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 21:35 UTC (permalink / raw) To: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe Cc: David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Alexander Duyck, Will Deacon, Paul E. McKenney, netdev, Alexander Duyck On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote: > combined buffers. > > Alex: > "Don't bother. I can tell you right now that for x86 you have to have a > wmb() before the writel(). No, this isn't the semantics of writel. You shouldn't need it unless something changed and we need to revisit our complete understanding of *all* MMIO accessor semantics. At least for UC space, it has always been accepted (and enforced) that writel would not require any other barrier to order vs. previous stores to memory. > Based on the comment in > (https://www.spinics.net/lists/linux-rdma/msg62666.html): > Replacing wmb() + writel() with wmb() + writel_relaxed() will work on > PPC, it will just not give you a benefit today. > > I say the patch set stays. This gives benefit on ARM, and has no > effect on x86 and PowerPC. If you want to look at trying to optimize > things further on PowerPC and such then go for it in terms of trying > to implement the writel_relaxed(). Otherwise I say we call the ARM > goodness a win and don't get ourselves too wrapped up in trying to fix > this for all architectures." ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-26 16:54 ` Jason Gunthorpe @ 2018-03-26 21:26 ` Benjamin Herrenschmidt -1 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 21:26 UTC (permalink / raw) To: Jason Gunthorpe, David Laight Cc: Sinan Kaya, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma On Mon, 2018-03-26 at 10:54 -0600, Jason Gunthorpe wrote: > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: > > > > This is a super performance critical operation for most drivers and > > > > directly impacts network performance. > > > > Perhaps there ought to be writel_nobarrier() (etc) that never contain > > any barriers at all. > > This might mean that they are always just the memory operation, > > but it would make it more obvious what the driver was doing. > > I think that is what writel_relaxed is supposed to be. > > The only restriction it has is that the writes to a single device > using UC memory must be kept in program order.. Which requires barriers on some architectures :-) Also we don't have a clear definition of what happens on WC memory. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-26 21:26 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-26 21:26 UTC (permalink / raw) To: Jason Gunthorpe, David Laight Cc: Sinan Kaya, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), Oliver, linux-rdma On Mon, 2018-03-26 at 10:54 -0600, Jason Gunthorpe wrote: > On Mon, Mar 26, 2018 at 11:08:45AM +0000, David Laight wrote: > > > > This is a super performance critical operation for most drivers and > > > > directly impacts network performance. > > > > Perhaps there ought to be writel_nobarrier() (etc) that never contain > > any barriers at all. > > This might mean that they are always just the memory operation, > > but it would make it more obvious what the driver was doing. > > I think that is what writel_relaxed is supposed to be. > > The only restriction it has is that the writes to a single device > using UC memory must be kept in program order.. Which requires barriers on some architectures :-) Also we don't have a clear definition of what happens on WC memory. Cheers, Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed @ 2018-03-27 21:54 Alexander Duyck 2018-03-27 22:35 ` Sinan Kaya 2018-03-27 23:43 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 216+ messages in thread From: Alexander Duyck @ 2018-03-27 21:54 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Will Deacon, Paul E. McKenney, netdev On Tue, Mar 27, 2018 at 2:35 PM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote: >> combined buffers. >> >> Alex: >> "Don't bother. I can tell you right now that for x86 you have to have a >> wmb() before the writel(). > > No, this isn't the semantics of writel. You shouldn't need it unless > something changed and we need to revisit our complete understanding of > *all* MMIO accessor semantics. The issue seems to be that there have been two different ways of dealing with this. There has historically been a number of different drivers that have been carrying this wmb() workaround since something like 2002. I get that the semantics for writel might have changed since then, but those of us who already have the wmb() in our drivers will be very wary of anyone wanting to go through and remove them since writel is supposed to be "good enough". I would much rather err on the side of caution here. I view the wmb() + writel_relaxed() as more of a driver owning and handling this itself. Besides in the Intel Ethernet driver case it is better performance as our wmb() placement for us also provides a secondary barrier so we don't need to add a separate smp_wmb() to deal with a potential race we have with the Tx cleanup. > At least for UC space, it has always been accepted (and enforced) that > writel would not require any other barrier to order vs. previous stores > to memory. So the one thing I would question here is if this is UC vs UC or if this extends to other types as well? So for x86 we could find references to Write Combining being flushed by a write to UC memory, however I have yet to find a clear explanation of what a write to UC does to WB. My personal inclination would be to err on the side of caution. I just don't want us going through and removing the wmb() calls because it "should" work. I would want to know for certain it will work. - Alex ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 21:54 Alexander Duyck @ 2018-03-27 22:35 ` Sinan Kaya 2018-03-27 23:43 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 216+ messages in thread From: Sinan Kaya @ 2018-03-27 22:35 UTC (permalink / raw) To: Alexander Duyck, Benjamin Herrenschmidt Cc: Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Will Deacon, Paul E. McKenney, netdev On 3/27/2018 5:54 PM, Alexander Duyck wrote: > I view the wmb() + writel_relaxed() as more of a driver owning and > handling this itself. Besides in the Intel Ethernet driver case it is > better performance as our wmb() placement for us also provides a > secondary barrier so we don't need to add a separate smp_wmb() to deal > with a potential race we have with the Tx cleanup. Thanks for the reminder. I forgot about the double barrier optimization. wmb() + writel_relaxed() seems to be the best option for Intel network drivers at this moment. Otherwise, we'll have to remove wmb() and throw in smp barriers there like you mentioned. I'll leave the changes in the Intel drivers alone. -- Sinan Kaya Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm Technologies, Inc. Qualcomm Technologies, Inc. is a member of the Code Aurora Forum, a Linux Foundation Collaborative Project. ^ permalink raw reply [flat|nested] 216+ messages in thread
* Re: RFC on writel and writel_relaxed 2018-03-27 21:54 Alexander Duyck 2018-03-27 22:35 ` Sinan Kaya @ 2018-03-27 23:43 ` Benjamin Herrenschmidt 1 sibling, 0 replies; 216+ messages in thread From: Benjamin Herrenschmidt @ 2018-03-27 23:43 UTC (permalink / raw) To: Alexander Duyck Cc: Sinan Kaya, Arnd Bergmann, Jason Gunthorpe, David Laight, Oliver, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT), linux-rdma, Will Deacon, Paul E. McKenney, netdev On Tue, 2018-03-27 at 14:54 -0700, Alexander Duyck wrote: > On Tue, Mar 27, 2018 at 2:35 PM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > On Tue, 2018-03-27 at 10:46 -0400, Sinan Kaya wrote: > > > combined buffers. > > > > > > Alex: > > > "Don't bother. I can tell you right now that for x86 you have to have a > > > wmb() before the writel(). > > > > No, this isn't the semantics of writel. You shouldn't need it unless > > something changed and we need to revisit our complete understanding of > > *all* MMIO accessor semantics. > > The issue seems to be that there have been two different ways of > dealing with this. There has historically been a number of different > drivers that have been carrying this wmb() workaround since something > like 2002. I get that the semantics for writel might have changed > since then, but those of us who already have the wmb() in our drivers > will be very wary of anyone wanting to go through and remove them > since writel is supposed to be "good enough". I would much rather err > on the side of caution here. > > I view the wmb() + writel_relaxed() as more of a driver owning and > handling this itself. Besides in the Intel Ethernet driver case it is > better performance as our wmb() placement for us also provides a > secondary barrier so we don't need to add a separate smp_wmb() to deal > with a potential race we have with the Tx cleanup. > > > At least for UC space, it has always been accepted (and enforced) that > > writel would not require any other barrier to order vs. previous stores > > to memory. > > So the one thing I would question here is if this is UC vs UC or if > this extends to other types as well? So for x86 we could find > references to Write Combining being flushed by a write to UC memory, > however I have yet to find a clear explanation of what a write to UC > does to WB. Well, this is the standard write memory + trigger DMA case, the one specific case for which Linus was adamant we don't need another barrier back then ... > My personal inclination would be to err on the side of > caution. Which means writel_relaxed is now pointless ? We need clear semantics here. In this case the "side of caution" means we are randomly doing things not understanding what really happens and that makes me *more* nervous. > I just don't want us going through and removing the wmb() > calls because it "should" work. I would want to know for certain it > will work. We need to know for certain anyway. Otherwise, all the drivers that do not have wmb's are potentially broken. So I dont agree with the status quo. We need to establish precisely what x86 does, decide what we want the semantic of writel to be, and implement things accordingly. Ben. ^ permalink raw reply [flat|nested] 216+ messages in thread
end of thread, other threads:[~2018-04-02 13:01 UTC | newest] Thread overview: 216+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-03-21 3:07 RFC on writel and writel_relaxed Sinan Kaya 2018-03-21 3:40 ` Oliver 2018-03-21 3:40 ` Oliver 2018-03-21 13:53 ` Sinan Kaya 2018-03-21 13:53 ` Sinan Kaya 2018-03-21 13:58 ` Sinan Kaya 2018-03-21 13:58 ` Sinan Kaya 2018-03-26 13:43 ` Arnd Bergmann 2018-03-26 13:43 ` Arnd Bergmann 2018-03-26 16:00 ` Sinan Kaya 2018-03-26 16:00 ` Sinan Kaya 2018-03-21 14:35 ` David Laight 2018-03-21 14:35 ` David Laight 2018-03-21 15:04 ` Sinan Kaya 2018-03-22 5:24 ` Oliver 2018-03-22 8:20 ` Gabriel Paubert 2018-03-22 8:20 ` Gabriel Paubert 2018-03-22 9:25 ` Oliver 2018-03-22 9:25 ` Oliver 2018-03-22 11:25 ` Gabriel Paubert 2018-03-22 11:25 ` Gabriel Paubert 2018-03-22 10:37 ` David Laight 2018-03-22 10:37 ` David Laight 2018-03-22 4:24 ` Benjamin Herrenschmidt 2018-03-22 4:24 ` Benjamin Herrenschmidt 2018-03-22 10:15 ` Oliver 2018-03-22 10:15 ` Oliver 2018-03-22 13:52 ` Benjamin Herrenschmidt 2018-03-22 13:52 ` Benjamin Herrenschmidt 2018-03-22 17:51 ` Sinan Kaya 2018-03-22 17:51 ` Sinan Kaya 2018-03-23 0:16 ` Benjamin Herrenschmidt 2018-03-23 0:16 ` Benjamin Herrenschmidt 2018-03-23 13:42 ` Sinan Kaya 2018-03-23 13:42 ` Sinan Kaya 2018-03-24 1:22 ` Benjamin Herrenschmidt 2018-03-24 1:22 ` Benjamin Herrenschmidt 2018-03-24 15:06 ` Sinan Kaya 2018-03-24 15:06 ` Sinan Kaya 2018-03-26 11:44 ` Will Deacon 2018-03-26 11:44 ` Will Deacon 2018-03-26 12:11 ` okaya 2018-03-26 12:11 ` okaya 2018-03-26 12:42 ` Sinan Kaya 2018-03-26 12:42 ` Sinan Kaya 2018-03-23 16:35 ` Jason Gunthorpe 2018-03-23 16:35 ` Jason Gunthorpe 2018-03-24 1:23 ` Benjamin Herrenschmidt 2018-03-24 1:23 ` Benjamin Herrenschmidt 2018-03-26 11:08 ` David Laight 2018-03-26 11:08 ` David Laight 2018-03-26 16:54 ` Jason Gunthorpe 2018-03-26 16:54 ` Jason Gunthorpe 2018-03-26 19:44 ` Arnd Bergmann 2018-03-26 19:44 ` Arnd Bergmann 2018-03-26 20:25 ` Jason Gunthorpe 2018-03-26 20:25 ` Jason Gunthorpe 2018-03-26 20:43 ` Arnd Bergmann 2018-03-26 20:43 ` Arnd Bergmann 2018-03-26 21:09 ` Jason Gunthorpe 2018-03-26 21:09 ` Jason Gunthorpe 2018-03-26 21:30 ` Arnd Bergmann 2018-03-26 21:30 ` Arnd Bergmann 2018-03-26 21:46 ` Sinan Kaya 2018-03-26 21:46 ` Sinan Kaya 2018-03-26 22:01 ` Benjamin Herrenschmidt 2018-03-26 22:01 ` Benjamin Herrenschmidt 2018-03-26 22:08 ` Sinan Kaya 2018-03-26 22:08 ` Sinan Kaya 2018-03-26 22:28 ` Benjamin Herrenschmidt 2018-03-26 22:28 ` Benjamin Herrenschmidt 2018-03-26 22:27 ` Jason Gunthorpe 2018-03-26 22:27 ` Jason Gunthorpe 2018-03-26 22:36 ` Benjamin Herrenschmidt 2018-03-26 22:36 ` Benjamin Herrenschmidt 2018-03-26 22:42 ` Benjamin Herrenschmidt 2018-03-26 22:42 ` Benjamin Herrenschmidt 2018-03-26 22:50 ` Jason Gunthorpe 2018-03-26 22:50 ` Jason Gunthorpe 2018-03-26 23:59 ` Benjamin Herrenschmidt 2018-03-26 23:59 ` Benjamin Herrenschmidt 2018-03-27 1:39 ` Jason Gunthorpe 2018-03-27 1:39 ` Jason Gunthorpe 2018-03-27 7:56 ` Arnd Bergmann 2018-03-27 7:56 ` Arnd Bergmann 2018-03-27 8:56 ` Benjamin Herrenschmidt 2018-03-27 8:56 ` Benjamin Herrenschmidt 2018-03-27 9:44 ` Arnd Bergmann 2018-03-27 9:44 ` Arnd Bergmann 2018-03-27 10:00 ` Will Deacon 2018-03-27 10:00 ` Will Deacon 2018-03-27 11:23 ` Benjamin Herrenschmidt 2018-03-27 11:23 ` Benjamin Herrenschmidt 2018-03-27 12:22 ` okaya 2018-03-27 12:22 ` okaya 2018-03-27 14:12 ` Jason Gunthorpe 2018-03-27 14:12 ` Jason Gunthorpe 2018-03-27 21:27 ` Benjamin Herrenschmidt 2018-03-27 21:27 ` Benjamin Herrenschmidt 2018-03-27 9:57 ` Will Deacon 2018-03-27 9:57 ` Will Deacon 2018-03-27 10:05 ` Arnd Bergmann 2018-03-27 10:05 ` Arnd Bergmann 2018-03-27 10:09 ` Will Deacon 2018-03-27 10:09 ` Will Deacon 2018-03-27 10:53 ` Arnd Bergmann 2018-03-27 10:53 ` Arnd Bergmann 2018-03-27 11:02 ` Will Deacon 2018-03-27 11:02 ` Will Deacon 2018-03-27 11:05 ` Arnd Bergmann 2018-03-27 11:05 ` Arnd Bergmann 2018-03-27 11:25 ` Benjamin Herrenschmidt 2018-03-27 11:25 ` Benjamin Herrenschmidt 2018-03-27 13:20 ` David Laight 2018-03-27 13:20 ` David Laight 2018-03-27 13:46 ` Sinan Kaya 2018-03-27 13:46 ` Sinan Kaya 2018-03-27 14:36 ` Will Deacon 2018-03-27 14:36 ` Will Deacon 2018-03-27 21:29 ` Benjamin Herrenschmidt 2018-03-27 21:29 ` Benjamin Herrenschmidt 2018-03-28 8:53 ` Will Deacon 2018-03-28 8:53 ` Will Deacon 2018-03-28 9:00 ` David Laight 2018-03-28 9:00 ` David Laight 2018-03-28 9:09 ` Will Deacon 2018-03-28 9:09 ` Will Deacon 2018-03-28 9:56 ` Benjamin Herrenschmidt 2018-03-28 9:56 ` Benjamin Herrenschmidt 2018-03-28 9:50 ` Benjamin Herrenschmidt 2018-03-28 9:50 ` Benjamin Herrenschmidt 2018-03-28 9:55 ` Arnd Bergmann 2018-03-28 9:55 ` Arnd Bergmann 2018-03-28 10:01 ` Benjamin Herrenschmidt 2018-03-28 10:01 ` Benjamin Herrenschmidt 2018-03-28 10:13 ` Will Deacon 2018-03-28 10:13 ` Will Deacon 2018-03-28 16:57 ` Jason Gunthorpe 2018-03-28 16:57 ` Jason Gunthorpe 2018-03-29 9:19 ` Will Deacon 2018-03-29 9:19 ` Will Deacon 2018-03-29 14:45 ` Jason Gunthorpe 2018-03-29 14:45 ` Jason Gunthorpe 2018-03-29 14:58 ` David Laight 2018-03-29 14:58 ` David Laight 2018-03-29 16:40 ` Jason Gunthorpe 2018-03-29 16:40 ` Jason Gunthorpe 2018-03-27 21:24 ` Benjamin Herrenschmidt 2018-03-27 21:24 ` Benjamin Herrenschmidt 2018-03-27 11:21 ` Benjamin Herrenschmidt 2018-03-27 11:21 ` Benjamin Herrenschmidt 2018-03-27 9:42 ` Will Deacon 2018-03-27 9:42 ` Will Deacon 2018-03-27 11:20 ` Benjamin Herrenschmidt 2018-03-27 11:20 ` Benjamin Herrenschmidt 2018-03-27 11:24 ` Will Deacon 2018-03-27 11:24 ` Will Deacon 2018-03-27 14:24 ` Jason Gunthorpe 2018-03-27 14:24 ` Jason Gunthorpe 2018-03-27 14:16 ` Jason Gunthorpe 2018-03-27 14:16 ` Jason Gunthorpe 2018-03-26 22:00 ` Benjamin Herrenschmidt 2018-03-26 22:00 ` Benjamin Herrenschmidt 2018-03-27 14:46 ` Sinan Kaya 2018-03-27 15:01 ` Jose Abreu 2018-03-27 15:10 ` Will Deacon 2018-03-27 18:54 ` Alexander Duyck 2018-03-27 19:54 ` Arnd Bergmann 2018-03-27 19:54 ` Arnd Bergmann 2018-03-27 20:46 ` Arnd Bergmann 2018-03-27 20:46 ` Arnd Bergmann 2018-03-27 21:33 ` Benjamin Herrenschmidt 2018-03-28 0:39 ` Linus Torvalds 2018-03-28 1:03 ` Benjamin Herrenschmidt 2018-03-28 2:51 ` Linus Torvalds 2018-03-28 3:24 ` Sinan Kaya 2018-03-28 4:41 ` Benjamin Herrenschmidt 2018-03-28 6:14 ` Linus Torvalds 2018-03-28 11:41 ` okaya 2018-03-28 15:13 ` Benjamin Herrenschmidt 2018-03-28 15:55 ` David Miller 2018-03-28 16:23 ` Nicholas Piggin 2018-03-28 21:31 ` Benjamin Herrenschmidt 2018-03-28 22:09 ` Nicholas Piggin 2018-03-29 9:20 ` Will Deacon 2018-03-29 13:56 ` Sinan Kaya 2018-03-29 14:04 ` David Miller 2018-03-29 16:29 ` Arnd Bergmann 2018-03-29 16:59 ` Sinan Kaya 2018-03-30 1:40 ` Benjamin Herrenschmidt 2018-04-02 13:01 ` Sinan Kaya 2018-03-28 4:33 ` Benjamin Herrenschmidt 2018-03-28 6:26 ` Linus Torvalds 2018-03-28 6:42 ` Benjamin Herrenschmidt 2018-03-28 6:53 ` Linus Torvalds 2018-03-28 6:53 ` Linus Torvalds 2018-03-28 6:53 ` Linus Torvalds 2018-03-28 6:56 ` Benjamin Herrenschmidt 2018-03-28 7:11 ` Arnd Bergmann 2018-03-28 7:42 ` Benjamin Herrenschmidt 2018-03-28 9:07 ` Will Deacon 2018-03-28 9:56 ` Benjamin Herrenschmidt 2018-03-28 10:13 ` Aw: " Lino Sanfilippo 2018-03-28 10:20 ` Benjamin Herrenschmidt 2018-03-28 11:30 ` David Laight 2018-03-28 11:30 ` David Laight 2018-03-28 15:12 ` Benjamin Herrenschmidt 2018-03-28 16:16 ` David Laight 2018-03-28 16:16 ` David Laight 2018-03-28 1:21 ` Benjamin Herrenschmidt 2018-03-27 21:35 ` Benjamin Herrenschmidt 2018-03-26 21:26 ` Benjamin Herrenschmidt 2018-03-26 21:26 ` Benjamin Herrenschmidt 2018-03-27 21:54 Alexander Duyck 2018-03-27 22:35 ` Sinan Kaya 2018-03-27 23:43 ` Benjamin Herrenschmidt
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.