All of lore.kernel.org
 help / color / mirror / Atom feed
* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
@ 2018-06-29 12:28 Lucas Stach
  2018-06-29 14:25 ` Russell King - ARM Linux
  0 siblings, 1 reply; 17+ messages in thread
From: Lucas Stach @ 2018-06-29 12:28 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

Oleksij was hunting a memory corruption issue on SocFPGA for the last
few days. While we still don't have a clear picture about what's going
wrong, I was reading some driver code and stumbled across the use of
dma_wmb, which left me with some questions about the intended use
and/or implementation. I'll try to describe by train of thought below.

Some drivers like the DWMAC driver
(drivers/net/ethernet/stmicro/stmmac/) are using dma_wmb to order
accesses to DMA descriptors, to make sure the own bit is only visible
to the DMA engine once the descriptor has been fully set up. This seems
to be correct and the intended use of this barrier as described in
Documentation/memory-barriers.txt.

The barrier documentation describes the dma barrier as "for use with
consistent memory", wich Documentation/DMA-API.txt says is memory returned by dma_alloc_coherent. As the DMA descriptors are located in memory allocated this way, it seems the driver side implementation is fine (or at least matches the documentation).

The SocFPGA platform, where this driver is used, is a Cortex A9 with a
PL310 outer cache. The peripherals being non coherent masters, as on
most ARMv7 systems.

Now the current implementation of dma_wmb maps to a dmb(oshst), which
to me seems like it is too weak to guarantee store visibility ordering
to the non-coherent DMA engine master. I would have expected that at
least a dmb(st) is necessary to provide the ordering, as specified in
the?dma_wmb Documentation on such a system.

Also with a outer cache present I'm unsure if a dmb(st) has the desired
effect. Do we need to have a full cache synchronization operation in
between the writes, as is done for the raw wmb()?

Can you confirm my understanding of the situation, or did I miss
something there?

Thanks,
Lucas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-06-29 12:28 Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches Lucas Stach
@ 2018-06-29 14:25 ` Russell King - ARM Linux
  2018-06-29 16:19   ` Lucas Stach
  2018-06-29 16:22   ` Will Deacon
  0 siblings, 2 replies; 17+ messages in thread
From: Russell King - ARM Linux @ 2018-06-29 14:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 29, 2018 at 02:28:10PM +0200, Lucas Stach wrote:
> Oleksij was hunting a memory corruption issue on SocFPGA for the last
> few days. While we still don't have a clear picture about what's going
> wrong, I was reading some driver code and stumbled across the use of
> dma_wmb, which left me with some questions about the intended use
> and/or implementation. I'll try to describe by train of thought below.

I don't have a full understanding of this (partly because of
implementation details, and I don't know the SocFPGA implementation.)

The dma_*() barriers were proposed by Alexander Duyck in 2014, and the
commit explicitly states that dma_wmb() will be outer-shareable only.

I think what constitutes "outer-shareable" in terms of DMB is also
something of a debate - the ARM ARM is vague on that subject, basically
saying its implementation dependent:

  In a VMSA implementation, for Shareable Normal memory, whether
  there is a distinction between Inner Shareable and Outer Shareable
  is IMPLEMENTATION DEFINED.

In the case of the three levels of shareability being implemented, a
"dmb ish" only reaches as far as the inner domain, "dmb osh" reaches
both the inner and outer domains, and "dmb sy" reaches outside the
outer domain.  Where there's no distinction between inner and outer,
it seems that both "dmb ish*" and "dmb osh*" will both have the same
reach.

That would mean that "dmb osh*" would not be visible outside of the
shareability domain, which surely is where most non-coherent DMA
masters in our SoCs lie, so on the face of it, it seems that it is
the wrong reach for this barrier.

However, it obviously works for most systems out there, which suggests
that "dmb osh*" reaches beyond "dmb ish*", but that seems to go against
the ARM ARM.

All our memory in Linux is mapped with NOS=1, which basically means
all shareable memory in the system is only inner shareable, and
nothing is shareable outside of the inner domain.  Does it make
sense to use "dmb osh*" if we have no outer-shareable memory?

Final bit of the puzzle is about which domain the PL310 ends up in.
It will come as no surprise that's also "implementation defined":

  "The relationship between these conceptual levels of cache and the
   implemented physical levels of cache is IMPLEMENTATION DEFINED,
   and can differ from the boundaries between the Inner and Outer
   Shareability domains."

were "conceptual levels" is talking about the inner/outer cacheability
level.

The only thing we can be sure about is that the L1 caches are in the
inner cacheability and inner shareability domains.  Everything else
is... implementation defined.

So, it seems to be possible that a L2 cache, such as the PL310, could
be in the inner cacheability domain and the outer shareability domain,
or vice versa!

"Implementation defined" is a nightmare when it comes to generic OSes
that have to work across multiple different implementations.

Maybe Will can shed some light on this topic.

> Some drivers like the DWMAC driver
> (drivers/net/ethernet/stmicro/stmmac/) are using dma_wmb to order
> accesses to DMA descriptors, to make sure the own bit is only visible
> to the DMA engine once the descriptor has been fully set up. This seems
> to be correct and the intended use of this barrier as described in
> Documentation/memory-barriers.txt.
> 
> The barrier documentation describes the dma barrier as "for use with
> consistent memory", wich Documentation/DMA-API.txt says is memory returned by dma_alloc_coherent. As the DMA descriptors are located in memory allocated this way, it seems the driver side implementation is fine (or at least matches the documentation).
> 
> The SocFPGA platform, where this driver is used, is a Cortex A9 with a
> PL310 outer cache. The peripherals being non coherent masters, as on
> most ARMv7 systems.
> 
> Now the current implementation of dma_wmb maps to a dmb(oshst), which
> to me seems like it is too weak to guarantee store visibility ordering
> to the non-coherent DMA engine master. I would have expected that at
> least a dmb(st) is necessary to provide the ordering, as specified in
> the?dma_wmb Documentation on such a system.
> 
> Also with a outer cache present I'm unsure if a dmb(st) has the desired
> effect. Do we need to have a full cache synchronization operation in
> between the writes, as is done for the raw wmb()?
> 
> Can you confirm my understanding of the situation, or did I miss
> something there?
> 
> Thanks,
> Lucas
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-06-29 14:25 ` Russell King - ARM Linux
@ 2018-06-29 16:19   ` Lucas Stach
  2018-06-29 16:22   ` Will Deacon
  1 sibling, 0 replies; 17+ messages in thread
From: Lucas Stach @ 2018-06-29 16:19 UTC (permalink / raw)
  To: linux-arm-kernel

Am Freitag, den 29.06.2018, 15:25 +0100 schrieb Russell King - ARM Linux:
> On Fri, Jun 29, 2018 at 02:28:10PM +0200, Lucas Stach wrote:
> > Oleksij was hunting a memory corruption issue on SocFPGA for the last
> > few days. While we still don't have a clear picture about what's going
> > wrong, I was reading some driver code and stumbled across the use of
> > dma_wmb, which left me with some questions about the intended use
> > and/or implementation. I'll try to describe by train of thought below.
> 
> I don't have a full understanding of this (partly because of
> implementation details, and I don't know the SocFPGA implementation.)

Thanks, at least I'm not the only one confused about those layers of 
IMPLEMENTATION DEFINED stuff.

> The dma_*() barriers were proposed by Alexander Duyck in 2014, and the
> commit explicitly states that dma_wmb() will be outer-shareable only.

Right, I think part of my confusion stems from the fact that this
commit introduces the dma_wmb() barriers as outer-sharable only, but at
the same time adds documentation, which states that the intended use of
those barriers is to guarantee write visibility?ordering to DMA
masters.

To my understanding, this is only true if the DMA master is at least in
the outer shareability domain. This probably isn't the case on most
generic ARMv7 systems where - wherever the implementation defined
boundary between inner and outer shareable domain might be - most
masters will be in the system domain.

So to me it seems that the use of dma_wmb() on a system with DMA
masters in the system shareability domain is unsafe, even if the
documentation states otherwise.

> I think what constitutes "outer-shareable" in terms of DMB is also
> something of a debate - the ARM ARM is vague on that subject, basically
> saying its implementation dependent:
> 
> ? In a VMSA implementation, for Shareable Normal memory, whether
> ? there is a distinction between Inner Shareable and Outer Shareable
> ? is IMPLEMENTATION DEFINED.
> 
> In the case of the three levels of shareability being implemented, a
> "dmb ish" only reaches as far as the inner domain, "dmb osh" reaches
> both the inner and outer domains, and "dmb sy" reaches outside the
> outer domain.??Where there's no distinction between inner and outer,
> it seems that both "dmb ish*" and "dmb osh*" will both have the same
> reach.
> 
> That would mean that "dmb osh*" would not be visible outside of the
> shareability domain, which surely is where most non-coherent DMA
> masters in our SoCs lie, so on the face of it, it seems that it is
> the wrong reach for this barrier.
> 
> However, it obviously works for most systems out there, which suggests
> that "dmb osh*" reaches beyond "dmb ish*", but that seems to go against
> the ARM ARM.
> 
> All our memory in Linux is mapped with NOS=1, which basically means
> all shareable memory in the system is only inner shareable, and
> nothing is shareable outside of the inner domain.??Does it make
> sense to use "dmb osh*" if we have no outer-shareable memory?
> 
> Final bit of the puzzle is about which domain the PL310 ends up in.
> It will come as no surprise that's also "implementation defined":
> 
> ? "The relationship between these conceptual levels of cache and the
> ???implemented physical levels of cache is IMPLEMENTATION DEFINED,
> ???and can differ from the boundaries between the Inner and Outer
> ???Shareability domains."
> 
> were "conceptual levels" is talking about the inner/outer cacheability
> level.
> 
> The only thing we can be sure about is that the L1 caches are in the
> inner cacheability and inner shareability domains.??Everything else
> is... implementation defined.
> 
> So, it seems to be possible that a L2 cache, such as the PL310, could
> be in the inner cacheability domain and the outer shareability domain,
> or vice versa!
> 
> "Implementation defined" is a nightmare when it comes to generic OSes
> that have to work across multiple different implementations.
> 
> Maybe Will can shed some light on this topic.
> 
> > Some drivers like the DWMAC driver
> > (drivers/net/ethernet/stmicro/stmmac/) are using dma_wmb to order
> > accesses to DMA descriptors, to make sure the own bit is only visible
> > to the DMA engine once the descriptor has been fully set up. This seems
> > to be correct and the intended use of this barrier as described in
> > Documentation/memory-barriers.txt.
> > 
> > The barrier documentation describes the dma barrier as "for use with
> > consistent memory", wich Documentation/DMA-API.txt says is memory returned by dma_alloc_coherent. As the DMA descriptors are located in memory allocated this way, it seems the driver side implementation is fine (or at least matches the documentation).
> > 
> > The SocFPGA platform, where this driver is used, is a Cortex A9 with a
> > PL310 outer cache. The peripherals being non coherent masters, as on
> > most ARMv7 systems.
> > 
> > Now the current implementation of dma_wmb maps to a dmb(oshst), which
> > to me seems like it is too weak to guarantee store visibility ordering
> > to the non-coherent DMA engine master. I would have expected that at
> > least a dmb(st) is necessary to provide the ordering, as specified in
> > the?dma_wmb Documentation on such a system.
> > 
> > Also with a outer cache present I'm unsure if a dmb(st) has the desired
> > effect. Do we need to have a full cache synchronization operation in
> > between the writes, as is done for the raw wmb()?
> > 
> > Can you confirm my understanding of the situation, or did I miss
> > something there?
> > 
> > Thanks,
> > Lucas
> > 
> > _______________________________________________
> > linux-arm-kernel mailing list
> > linux-arm-kernel at lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-06-29 14:25 ` Russell King - ARM Linux
  2018-06-29 16:19   ` Lucas Stach
@ 2018-06-29 16:22   ` Will Deacon
  2018-06-29 16:48     ` Russell King - ARM Linux
  2018-06-29 17:14     ` Lucas Stach
  1 sibling, 2 replies; 17+ messages in thread
From: Will Deacon @ 2018-06-29 16:22 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

On Fri, Jun 29, 2018 at 03:25:39PM +0100, Russell King - ARM Linux wrote:
> On Fri, Jun 29, 2018 at 02:28:10PM +0200, Lucas Stach wrote:
> > Oleksij was hunting a memory corruption issue on SocFPGA for the last
> > few days. While we still don't have a clear picture about what's going
> > wrong, I was reading some driver code and stumbled across the use of
> > dma_wmb, which left me with some questions about the intended use
> > and/or implementation. I'll try to describe by train of thought below.
> 
> I don't have a full understanding of this (partly because of
> implementation details, and I don't know the SocFPGA implementation.)
> 
> The dma_*() barriers were proposed by Alexander Duyck in 2014, and the
> commit explicitly states that dma_wmb() will be outer-shareable only.
> 
> I think what constitutes "outer-shareable" in terms of DMB is also
> something of a debate - the ARM ARM is vague on that subject, basically
> saying its implementation dependent:
> 
>   In a VMSA implementation, for Shareable Normal memory, whether
>   there is a distinction between Inner Shareable and Outer Shareable
>   is IMPLEMENTATION DEFINED.
> 
> In the case of the three levels of shareability being implemented, a
> "dmb ish" only reaches as far as the inner domain, "dmb osh" reaches
> both the inner and outer domains, and "dmb sy" reaches outside the
> outer domain.  Where there's no distinction between inner and outer,
> it seems that both "dmb ish*" and "dmb osh*" will both have the same
> reach.
> 
> That would mean that "dmb osh*" would not be visible outside of the
> shareability domain, which surely is where most non-coherent DMA
> masters in our SoCs lie, so on the face of it, it seems that it is
> the wrong reach for this barrier.
> 
> However, it obviously works for most systems out there, which suggests
> that "dmb osh*" reaches beyond "dmb ish*", but that seems to go against
> the ARM ARM.
> 
> All our memory in Linux is mapped with NOS=1, which basically means
> all shareable memory in the system is only inner shareable, and
> nothing is shareable outside of the inner domain.  Does it make
> sense to use "dmb osh*" if we have no outer-shareable memory?
> 
> Final bit of the puzzle is about which domain the PL310 ends up in.
> It will come as no surprise that's also "implementation defined":
> 
>   "The relationship between these conceptual levels of cache and the
>    implemented physical levels of cache is IMPLEMENTATION DEFINED,
>    and can differ from the boundaries between the Inner and Outer
>    Shareability domains."
> 
> were "conceptual levels" is talking about the inner/outer cacheability
> level.
> 
> The only thing we can be sure about is that the L1 caches are in the
> inner cacheability and inner shareability domains.  Everything else
> is... implementation defined.
> 
> So, it seems to be possible that a L2 cache, such as the PL310, could
> be in the inner cacheability domain and the outer shareability domain,
> or vice versa!
> 
> "Implementation defined" is a nightmare when it comes to generic OSes
> that have to work across multiple different implementations.
> 
> Maybe Will can shed some light on this topic.

You're right that cacheability and shareability are different things. For
the purposes of ordering and coherence, we care about shareability. Normal
non-cacheable is outer-shareable (which is a superset of inner-shareable),
so DMB OSH is sufficient to order accesses to that buffer from the
perspective of all observers. Since we use Normal non-cacheable mappings
for non-coherent DMA buffers (and there is no such thing as a
system-shareable memory type), then these barriers are sufficient to provide
ordering in this case too.

Ideally, we'd limit the DMA barriers to the inner-shareable domain when
dealing with coherent devices, but there's no way to determine that from
the barrier macros.

Lucas -- did changing the shareability of the DMA barriers actually solve
your problem?

Will

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-06-29 16:22   ` Will Deacon
@ 2018-06-29 16:48     ` Russell King - ARM Linux
  2018-06-29 17:43       ` Will Deacon
  2018-06-29 17:14     ` Lucas Stach
  1 sibling, 1 reply; 17+ messages in thread
From: Russell King - ARM Linux @ 2018-06-29 16:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 29, 2018 at 05:22:48PM +0100, Will Deacon wrote:
> Hi all,
> 
> On Fri, Jun 29, 2018 at 03:25:39PM +0100, Russell King - ARM Linux wrote:
> > On Fri, Jun 29, 2018 at 02:28:10PM +0200, Lucas Stach wrote:
> > > Oleksij was hunting a memory corruption issue on SocFPGA for the last
> > > few days. While we still don't have a clear picture about what's going
> > > wrong, I was reading some driver code and stumbled across the use of
> > > dma_wmb, which left me with some questions about the intended use
> > > and/or implementation. I'll try to describe by train of thought below.
> > 
> > I don't have a full understanding of this (partly because of
> > implementation details, and I don't know the SocFPGA implementation.)
> > 
> > The dma_*() barriers were proposed by Alexander Duyck in 2014, and the
> > commit explicitly states that dma_wmb() will be outer-shareable only.
> > 
> > I think what constitutes "outer-shareable" in terms of DMB is also
> > something of a debate - the ARM ARM is vague on that subject, basically
> > saying its implementation dependent:
> > 
> >   In a VMSA implementation, for Shareable Normal memory, whether
> >   there is a distinction between Inner Shareable and Outer Shareable
> >   is IMPLEMENTATION DEFINED.
> > 
> > In the case of the three levels of shareability being implemented, a
> > "dmb ish" only reaches as far as the inner domain, "dmb osh" reaches
> > both the inner and outer domains, and "dmb sy" reaches outside the
> > outer domain.  Where there's no distinction between inner and outer,
> > it seems that both "dmb ish*" and "dmb osh*" will both have the same
> > reach.
> > 
> > That would mean that "dmb osh*" would not be visible outside of the
> > shareability domain, which surely is where most non-coherent DMA
> > masters in our SoCs lie, so on the face of it, it seems that it is
> > the wrong reach for this barrier.
> > 
> > However, it obviously works for most systems out there, which suggests
> > that "dmb osh*" reaches beyond "dmb ish*", but that seems to go against
> > the ARM ARM.
> > 
> > All our memory in Linux is mapped with NOS=1, which basically means
> > all shareable memory in the system is only inner shareable, and
> > nothing is shareable outside of the inner domain.  Does it make
> > sense to use "dmb osh*" if we have no outer-shareable memory?
> > 
> > Final bit of the puzzle is about which domain the PL310 ends up in.
> > It will come as no surprise that's also "implementation defined":
> > 
> >   "The relationship between these conceptual levels of cache and the
> >    implemented physical levels of cache is IMPLEMENTATION DEFINED,
> >    and can differ from the boundaries between the Inner and Outer
> >    Shareability domains."
> > 
> > were "conceptual levels" is talking about the inner/outer cacheability
> > level.
> > 
> > The only thing we can be sure about is that the L1 caches are in the
> > inner cacheability and inner shareability domains.  Everything else
> > is... implementation defined.
> > 
> > So, it seems to be possible that a L2 cache, such as the PL310, could
> > be in the inner cacheability domain and the outer shareability domain,
> > or vice versa!
> > 
> > "Implementation defined" is a nightmare when it comes to generic OSes
> > that have to work across multiple different implementations.
> > 
> > Maybe Will can shed some light on this topic.
> 
> You're right that cacheability and shareability are different things. For
> the purposes of ordering and coherence, we care about shareability. Normal
> non-cacheable is outer-shareable (which is a superset of inner-shareable),

What does it mean when the "implementation" doesn't define two shareable
domains - does it mean that inner and outer shareable are combined into
just one "shareable" domain, or is outer shareable always treated as
"everything but inner shareable"?

>From the paragraph I quoted from the ARM ARM, it seems that the former
applies, which should also mean that "dmb ish*" and "dmb osh*" are
functionally equivalent, and only touch the inner shareable domain.

You also seem to be saying that PRRR.NOSn is ignored for any mapping
that indicates non-cacheable normal memory - I've not found that stated
in the ARM ARM.  It does say that NOSn doesn't apply if the region is
mapped as strongly-ordered, but that isn't the case for our DMA
coherent mappings.

> so DMB OSH is sufficient to order accesses to that buffer from the
> perspective of all observers. Since we use Normal non-cacheable mappings
> for non-coherent DMA buffers (and there is no such thing as a
> system-shareable memory type), then these barriers are sufficient to provide
> ordering in this case too.
> 
> Ideally, we'd limit the DMA barriers to the inner-shareable domain when
> dealing with coherent devices, but there's no way to determine that from
> the barrier macros.
> 
> Lucas -- did changing the shareability of the DMA barriers actually solve
> your problem?
> 
> Will

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-06-29 16:22   ` Will Deacon
  2018-06-29 16:48     ` Russell King - ARM Linux
@ 2018-06-29 17:14     ` Lucas Stach
  2018-06-29 17:46       ` Will Deacon
  1 sibling, 1 reply; 17+ messages in thread
From: Lucas Stach @ 2018-06-29 17:14 UTC (permalink / raw)
  To: linux-arm-kernel

Am Freitag, den 29.06.2018, 17:22 +0100 schrieb Will Deacon:
> Hi all,
> 
> On Fri, Jun 29, 2018 at 03:25:39PM +0100, Russell King - ARM Linux wrote:
> > On Fri, Jun 29, 2018 at 02:28:10PM +0200, Lucas Stach wrote:
> > > Oleksij was hunting a memory corruption issue on SocFPGA for the last
> > > few days. While we still don't have a clear picture about what's going
> > > wrong, I was reading some driver code and stumbled across the use of
> > > dma_wmb, which left me with some questions about the intended use
> > > and/or implementation. I'll try to describe by train of thought below.
> > 
> > I don't have a full understanding of this (partly because of
> > implementation details, and I don't know the SocFPGA implementation.)
> > 
> > The dma_*() barriers were proposed by Alexander Duyck in 2014, and the
> > commit explicitly states that dma_wmb() will be outer-shareable only.
> > 
> > I think what constitutes "outer-shareable" in terms of DMB is also
> > something of a debate - the ARM ARM is vague on that subject, basically
> > saying its implementation dependent:
> > 
> > ? In a VMSA implementation, for Shareable Normal memory, whether
> > ? there is a distinction between Inner Shareable and Outer Shareable
> > ? is IMPLEMENTATION DEFINED.
> > 
> > In the case of the three levels of shareability being implemented, a
> > "dmb ish" only reaches as far as the inner domain, "dmb osh" reaches
> > both the inner and outer domains, and "dmb sy" reaches outside the
> > outer domain.??Where there's no distinction between inner and outer,
> > it seems that both "dmb ish*" and "dmb osh*" will both have the same
> > reach.
> > 
> > That would mean that "dmb osh*" would not be visible outside of the
> > shareability domain, which surely is where most non-coherent DMA
> > masters in our SoCs lie, so on the face of it, it seems that it is
> > the wrong reach for this barrier.
> > 
> > However, it obviously works for most systems out there, which suggests
> > that "dmb osh*" reaches beyond "dmb ish*", but that seems to go against
> > the ARM ARM.
> > 
> > All our memory in Linux is mapped with NOS=1, which basically means
> > all shareable memory in the system is only inner shareable, and
> > nothing is shareable outside of the inner domain.??Does it make
> > sense to use "dmb osh*" if we have no outer-shareable memory?
> > 
> > Final bit of the puzzle is about which domain the PL310 ends up in.
> > It will come as no surprise that's also "implementation defined":
> > 
> > ? "The relationship between these conceptual levels of cache and the
> > ???implemented physical levels of cache is IMPLEMENTATION DEFINED,
> > ???and can differ from the boundaries between the Inner and Outer
> > ???Shareability domains."
> > 
> > were "conceptual levels" is talking about the inner/outer cacheability
> > level.
> > 
> > The only thing we can be sure about is that the L1 caches are in the
> > inner cacheability and inner shareability domains.??Everything else
> > is... implementation defined.
> > 
> > So, it seems to be possible that a L2 cache, such as the PL310, could
> > be in the inner cacheability domain and the outer shareability domain,
> > or vice versa!
> > 
> > "Implementation defined" is a nightmare when it comes to generic OSes
> > that have to work across multiple different implementations.
> > 
> > Maybe Will can shed some light on this topic.
> 
> You're right that cacheability and shareability are different things. For
> the purposes of ordering and coherence, we care about shareability. Normal
> non-cacheable is outer-shareable (which is a superset of inner-shareable),
> so DMB OSH is sufficient to order accesses to that buffer from the
> perspective of all observers.

Can you point me to something where this mapping from normal, non-
cacheable to outer-sharable is defined? Why is there a distinction
between OSH and SY in the barriers, if all non-coherent masters are in
the outer sharable domain?

/me is still confused

>  Since we use Normal non-cacheable mappings
> for non-coherent DMA buffers (and there is no such thing as a
> system-shareable memory type), then these barriers are sufficient to provide
> ordering in this case too.
> 
> Ideally, we'd limit the DMA barriers to the inner-shareable domain when
> dealing with coherent devices, but there's no way to determine that from
> the barrier macros.
> 
> Lucas -- did changing the shareability of the DMA barriers actually solve
> your problem?

There isn't enough data at this point to conclusively answer this. But
reading the code, this struck me as being odd enough to warrant some
clarifications from experts.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-06-29 16:48     ` Russell King - ARM Linux
@ 2018-06-29 17:43       ` Will Deacon
  2018-06-29 18:01         ` Russell King - ARM Linux
  2018-07-02 13:49         ` Lucas Stach
  0 siblings, 2 replies; 17+ messages in thread
From: Will Deacon @ 2018-06-29 17:43 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On Fri, Jun 29, 2018 at 05:48:01PM +0100, Russell King - ARM Linux wrote:
> On Fri, Jun 29, 2018 at 05:22:48PM +0100, Will Deacon wrote:
> > On Fri, Jun 29, 2018 at 03:25:39PM +0100, Russell King - ARM Linux wrote:
> > > Maybe Will can shed some light on this topic.
> > 
> > You're right that cacheability and shareability are different things. For
> > the purposes of ordering and coherence, we care about shareability. Normal
> > non-cacheable is outer-shareable (which is a superset of inner-shareable),
> 
> What does it mean when the "implementation" doesn't define two shareable
> domains - does it mean that inner and outer shareable are combined into
> just one "shareable" domain, or is outer shareable always treated as
> "everything but inner shareable"?

If there is only one shareability domain, then the inner and outer shareable
domains refer to that domain (i.e. they're the same).

> From the paragraph I quoted from the ARM ARM, it seems that the former
> applies, which should also mean that "dmb ish*" and "dmb osh*" are
> functionally equivalent, and only touch the inner shareable domain.

In this case, yes. In practice, there is usually one inner-shareable domain
which contains the CPUs and coherent DMA devices, and there is one
outer-shareable domain containing that inner-shareable domain, where
non-coherent DMA lives only in the outer-shareable domain. I don't know
of any systems where that isn't the case, and I'm not sure that our
interconnects even permit building anything else (I'd need to check).

> You also seem to be saying that PRRR.NOSn is ignored for any mapping
> that indicates non-cacheable normal memory - I've not found that stated
> in the ARM ARM.  It does say that NOSn doesn't apply if the region is
> mapped as strongly-ordered, but that isn't the case for our DMA
> coherent mappings.

Ah, it looks like it's not the case for short-descriptor without TEX remap,
where the S bit determines the shareability (i.e. shared or non-shared).
However, in Armv8 or short-descriptor using TEX remapping then non-cacheable
mappings are outer-shareable:

G4.7.3	Short-descriptor format memory region attributes, with TEX remap

If the TEX[0], C and B bits determine that the region is a Device memory
type, or is Normal Inner Non-cacheable, Outer Non-cacheable, then the
region is Outer Shareable.

Will

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-06-29 17:14     ` Lucas Stach
@ 2018-06-29 17:46       ` Will Deacon
  2018-07-02  9:58         ` Lucas Stach
  0 siblings, 1 reply; 17+ messages in thread
From: Will Deacon @ 2018-06-29 17:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 29, 2018 at 07:14:50PM +0200, Lucas Stach wrote:
> Am Freitag, den 29.06.2018, 17:22 +0100 schrieb Will Deacon:
> > You're right that cacheability and shareability are different things. For
> > the purposes of ordering and coherence, we care about shareability. Normal
> > non-cacheable is outer-shareable (which is a superset of inner-shareable),
> > so DMB OSH is sufficient to order accesses to that buffer from the
> > perspective of all observers.
> 
> Can you point me to something where this mapping from normal, non-
> cacheable to outer-sharable is defined? Why is there a distinction
> between OSH and SY in the barriers, if all non-coherent masters are in
> the outer sharable domain?
> 
> /me is still confused

Hopefully my reply to Russell will help you here. The distinction between
OSH and SY is for handling masters that are not coherent at all, even when
using non-cacheable mappings. But as I mentioned to Russell, I think this
is largely theoretical and nobody builds systems like this.

> >  Since we use Normal non-cacheable mappings
> > for non-coherent DMA buffers (and there is no such thing as a
> > system-shareable memory type), then these barriers are sufficient to provide
> > ordering in this case too.
> > 
> > Ideally, we'd limit the DMA barriers to the inner-shareable domain when
> > dealing with coherent devices, but there's no way to determine that from
> > the barrier macros.
> > 
> > Lucas -- did changing the shareability of the DMA barriers actually solve
> > your problem?
> 
> There isn't enough data at this point to conclusively answer this. But
> reading the code, this struck me as being odd enough to warrant some
> clarifications from experts.

Have you checked that you have bit 22 set in the PL310 aux ctrl register?

Will

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-06-29 17:43       ` Will Deacon
@ 2018-06-29 18:01         ` Russell King - ARM Linux
  2018-07-02 13:49         ` Lucas Stach
  1 sibling, 0 replies; 17+ messages in thread
From: Russell King - ARM Linux @ 2018-06-29 18:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Jun 29, 2018 at 06:43:22PM +0100, Will Deacon wrote:
> Hi Russell,
> 
> On Fri, Jun 29, 2018 at 05:48:01PM +0100, Russell King - ARM Linux wrote:
> > On Fri, Jun 29, 2018 at 05:22:48PM +0100, Will Deacon wrote:
> > > On Fri, Jun 29, 2018 at 03:25:39PM +0100, Russell King - ARM Linux wrote:
> > > > Maybe Will can shed some light on this topic.
> > > 
> > > You're right that cacheability and shareability are different things. For
> > > the purposes of ordering and coherence, we care about shareability. Normal
> > > non-cacheable is outer-shareable (which is a superset of inner-shareable),
> > 
> > What does it mean when the "implementation" doesn't define two shareable
> > domains - does it mean that inner and outer shareable are combined into
> > just one "shareable" domain, or is outer shareable always treated as
> > "everything but inner shareable"?
> 
> If there is only one shareability domain, then the inner and outer shareable
> domains refer to that domain (i.e. they're the same).
> 
> > From the paragraph I quoted from the ARM ARM, it seems that the former
> > applies, which should also mean that "dmb ish*" and "dmb osh*" are
> > functionally equivalent, and only touch the inner shareable domain.
> 
> In this case, yes. In practice, there is usually one inner-shareable domain
> which contains the CPUs and coherent DMA devices, and there is one
> outer-shareable domain containing that inner-shareable domain, where
> non-coherent DMA lives only in the outer-shareable domain. I don't know
> of any systems where that isn't the case, and I'm not sure that our
> interconnects even permit building anything else (I'd need to check).
> 
> > You also seem to be saying that PRRR.NOSn is ignored for any mapping
> > that indicates non-cacheable normal memory - I've not found that stated
> > in the ARM ARM.  It does say that NOSn doesn't apply if the region is
> > mapped as strongly-ordered, but that isn't the case for our DMA
> > coherent mappings.
> 
> Ah, it looks like it's not the case for short-descriptor without TEX remap,
> where the S bit determines the shareability (i.e. shared or non-shared).
> However, in Armv8 or short-descriptor using TEX remapping then non-cacheable
> mappings are outer-shareable:
> 
> G4.7.3	Short-descriptor format memory region attributes, with TEX remap
> 
> If the TEX[0], C and B bits determine that the region is a Device memory
> type, or is Normal Inner Non-cacheable, Outer Non-cacheable, then the
> region is Outer Shareable.

I think this is a danger of using the "latest" ARM ARM - the fact this
comes from Appendix G tells me that you're quoting from the ARMv8 ARM
ARM, and it's probably the differences to ARMv7 appendix.

The ARMv7 ARM ARMs do not state this - in DDI0406C, the section you
refer to is:

B3.8.3   Short-descriptor format memory region attributes, with TEX remap

which makes no mention of this.  There is an explicit section on the NOS
bits:

"Interpretation of the NOSn fields in the PRRR, with TEX remap

When all of the following apply, the NOSn fields in the PRRR distinguish
between Inner Shareable and Outer Shareable memory regions:

?     the SCTLR.TRE bit is set to 1
?     the region is mapped as Normal memory, or the implementation does
      not include the Large Physical Address Extension and the region is
      mapped as Device memory
?     the Normal memory remapping or Device memory remapping of the S bit
      value for the entry makes the region Shareable
?     the implementation supports the distinction between Inner Shareable
      and Outer Shareable.

...
The values of the NOSn fields in the PRRR have no effect if any of the
following apply:

?     the SCTLR.TRE bit is set to 0 and the IMPLEMENTATION DEFINED
      mechanism has not been invoked
?     the region is mapped as Strongly-ordered memory
?     the implementation includes the Large Physical Address Extension,
      and the region is mapped as Device memory
?     the Normal memory remapping or Device memory remapping of the S bit
      value for the entry makes the region Non-shareable.

The NOSn fields in the PRRR are RAZ/WI if the implementation does not
support the distinction between Inner Shareable and Outer Shareable
memory regions."

This seems to be pretty explicit, and in disagreement with what you
quote from G4.7.3.

I'm more likely to believe the ARM ARM produced for the ARMv7 architecture
rather than a later version, since that's the document that would've been
used at the time for such implementations.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-06-29 17:46       ` Will Deacon
@ 2018-07-02  9:58         ` Lucas Stach
  0 siblings, 0 replies; 17+ messages in thread
From: Lucas Stach @ 2018-07-02  9:58 UTC (permalink / raw)
  To: linux-arm-kernel

Am Freitag, den 29.06.2018, 18:46 +0100 schrieb Will Deacon:
> On Fri, Jun 29, 2018 at 07:14:50PM +0200, Lucas Stach wrote:
> > Am Freitag, den 29.06.2018, 17:22 +0100 schrieb Will Deacon:
> > > You're right that cacheability and shareability are different things. For
> > > the purposes of ordering and coherence, we care about shareability. Normal
> > > non-cacheable is outer-shareable (which is a superset of inner-shareable),
> > > so DMB OSH is sufficient to order accesses to that buffer from the
> > > perspective of all observers.
> > 
> > Can you point me to something where this mapping from normal, non-
> > cacheable to outer-sharable is defined? Why is there a distinction
> > between OSH and SY in the barriers, if all non-coherent masters are in
> > the outer sharable domain?
> > 
> > /me is still confused
> 
> Hopefully my reply to Russell will help you here. The distinction between
> OSH and SY is for handling masters that are not coherent at all, even when
> using non-cacheable mappings. But as I mentioned to Russell, I think this
> is largely theoretical and nobody builds systems like this.
> 
> > > ?Since we use Normal non-cacheable mappings
> > > for non-coherent DMA buffers (and there is no such thing as a
> > > system-shareable memory type), then these barriers are sufficient to provide
> > > ordering in this case too.
> > > 
> > > Ideally, we'd limit the DMA barriers to the inner-shareable domain when
> > > dealing with coherent devices, but there's no way to determine that from
> > > the barrier macros.
> > > 
> > > Lucas -- did changing the shareability of the DMA barriers actually solve
> > > your problem?
> > 
> > There isn't enough data at this point to conclusively answer this. But
> > reading the code, this struck me as being odd enough to warrant some
> > clarifications from experts.
> 
> Have you checked that you have bit 22 set in the PL310 aux ctrl register?

Yep, that was one of the first things we checked, as I know that not
setting this bit results in the PL310 being nonconformant to the ARM
ARM clarifications about mismatched page mappings.

Regards,
Lucas

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-06-29 17:43       ` Will Deacon
  2018-06-29 18:01         ` Russell King - ARM Linux
@ 2018-07-02 13:49         ` Lucas Stach
  2018-07-02 17:45           ` Will Deacon
  1 sibling, 1 reply; 17+ messages in thread
From: Lucas Stach @ 2018-07-02 13:49 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

Am Freitag, den 29.06.2018, 18:43 +0100 schrieb Will Deacon:
> Hi Russell,
> 
> On Fri, Jun 29, 2018 at 05:48:01PM +0100, Russell King - ARM Linux wrote:
> > On Fri, Jun 29, 2018 at 05:22:48PM +0100, Will Deacon wrote:
> > > On Fri, Jun 29, 2018 at 03:25:39PM +0100, Russell King - ARM Linux wrote:
> > > > Maybe Will can shed some light on this topic.
> > > 
> > > You're right that cacheability and shareability are different things. For
> > > the purposes of ordering and coherence, we care about shareability. Normal
> > > non-cacheable is outer-shareable (which is a superset of inner-shareable),
> > 
> > What does it mean when the "implementation" doesn't define two shareable
> > domains - does it mean that inner and outer shareable are combined into
> > just one "shareable" domain, or is outer shareable always treated as
> > "everything but inner shareable"?
> 
> If there is only one shareability domain, then the inner and outer shareable
> domains refer to that domain (i.e. they're the same).
> 
> > From the paragraph I quoted from the ARM ARM, it seems that the former
> > applies, which should also mean that "dmb ish*" and "dmb osh*" are
> > functionally equivalent, and only touch the inner shareable domain.
> 
> In this case, yes. In practice, there is usually one inner-shareable domain
> which contains the CPUs and coherent DMA devices, and there is one
> outer-shareable domain containing that inner-shareable domain, where
> non-coherent DMA lives only in the outer-shareable domain. I don't know
> of any systems where that isn't the case, and I'm not sure that our
> interconnects even permit building anything else (I'd need to check).
> 
Sorry for not citing from the ARM ARM, but [1] seems to partly
contradict what you said above. Especially the table detailing the
shareability domains and the related picture seem to clash with the
above statement that most systems just have an outer-shareable domain
covering the whole system.

Also the section "External caches", "Before AMBA4 ACE" seems to suggest
that the barriers aren't fully propagated to the PL310 write-caches and
master ports. I'm unsure if this is just an artifact of the mentioned
MMIO access, so handling of normal vs. device memory transactions in
the PL310.

Regards,
Lucas

[1] https://community.arm.com/processors/b/blog/posts/memory-access-ordering-part-3---memory-access-ordering-in-the-arm-architecture

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-07-02 13:49         ` Lucas Stach
@ 2018-07-02 17:45           ` Will Deacon
  2018-07-06 12:26             ` Will Deacon
  0 siblings, 1 reply; 17+ messages in thread
From: Will Deacon @ 2018-07-02 17:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 02, 2018 at 03:49:26PM +0200, Lucas Stach wrote:
> Am Freitag, den 29.06.2018, 18:43 +0100 schrieb Will Deacon:
> > On Fri, Jun 29, 2018 at 05:48:01PM +0100, Russell King - ARM Linux wrote:
> > > On Fri, Jun 29, 2018 at 05:22:48PM +0100, Will Deacon wrote:
> > > > On Fri, Jun 29, 2018 at 03:25:39PM +0100, Russell King - ARM Linux wrote:
> > > > > Maybe Will can shed some light on this topic.
> > > > 
> > > > You're right that cacheability and shareability are different things. For
> > > > the purposes of ordering and coherence, we care about shareability. Normal
> > > > non-cacheable is outer-shareable (which is a superset of inner-shareable),
> > > 
> > > What does it mean when the "implementation" doesn't define two shareable
> > > domains - does it mean that inner and outer shareable are combined into
> > > just one "shareable" domain, or is outer shareable always treated as
> > > "everything but inner shareable"?
> > 
> > If there is only one shareability domain, then the inner and outer shareable
> > domains refer to that domain (i.e. they're the same).
> > 
> > > From the paragraph I quoted from the ARM ARM, it seems that the former
> > > applies, which should also mean that "dmb ish*" and "dmb osh*" are
> > > functionally equivalent, and only touch the inner shareable domain.
> > 
> > In this case, yes. In practice, there is usually one inner-shareable domain
> > which contains the CPUs and coherent DMA devices, and there is one
> > outer-shareable domain containing that inner-shareable domain, where
> > non-coherent DMA lives only in the outer-shareable domain. I don't know
> > of any systems where that isn't the case, and I'm not sure that our
> > interconnects even permit building anything else (I'd need to check).
> > 
> Sorry for not citing from the ARM ARM, but [1] seems to partly
> contradict what you said above. Especially the table detailing the
> shareability domains and the related picture seem to clash with the
> above statement that most systems just have an outer-shareable domain
> covering the whole system.

I don't think the picture in that article is representative of a real SoC
design, but I'm pretty much just relaying the feedback from the architecture
team here.

> Also the section "External caches", "Before AMBA4 ACE" seems to suggest
> that the barriers aren't fully propagated to the PL310 write-caches and
> master ports. I'm unsure if this is just an artifact of the mentioned
> MMIO access, so handling of normal vs. device memory transactions in
> the PL310.

I'll check with the hardware folks about the PL310, but I don't see how
you could propagate barriers over AXI3 anyway because I don't think this
stuff existed back then.

Will

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-07-02 17:45           ` Will Deacon
@ 2018-07-06 12:26             ` Will Deacon
  2018-07-09  6:20               ` Oleksij Rempel
  2018-07-09  9:45               ` Lucas Stach
  0 siblings, 2 replies; 17+ messages in thread
From: Will Deacon @ 2018-07-06 12:26 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Lucas,

On Mon, Jul 02, 2018 at 06:45:52PM +0100, Will Deacon wrote:
> On Mon, Jul 02, 2018 at 03:49:26PM +0200, Lucas Stach wrote:
> > Also the section "External caches", "Before AMBA4 ACE" seems to suggest
> > that the barriers aren't fully propagated to the PL310 write-caches and
> > master ports. I'm unsure if this is just an artifact of the mentioned
> > MMIO access, so handling of normal vs. device memory transactions in
> > the PL310.
> 
> I'll check with the hardware folks about the PL310, but I don't see how
> you could propagate barriers over AXI3 anyway because I don't think this
> stuff existed back then.

Damn. As I feared, you can't propagate the barrier to the PL310, so the
A9 is unable to enforce ordering beyond the responses it gets back. The
PL310 will then re-order normal memory accesses (including non-cacheable)
in its store buffer, so we do actually need a bloody CACHE SYNC here.

Patch below. Does it help?

Will

--->8

>From 5a2769b45e94b212d21bc44f5940ed5f3bde5cc2 Mon Sep 17 00:00:00 2001
From: Will Deacon <will.deacon@arm.com>
Date: Wed, 4 Jul 2018 10:56:22 +0100
Subject: [PATCH] ARM: barrier: Kick outer-cache and SoC fabric in dma_wmb()

Prior to AMBA4-ACE, the AXI bus protocol had no support for barrier
transactions or shareability domains and was therefore unable to enforce
ordering between memory accesses either side of a barrier instruction,
requiring the CPU to handle all of the ordering itself.

Unfortunately, IP such as the venerable PL310 will re-order stores to
Normal memory inside its store-buffer, meaning that accesses to a coherent
DMA buffer (i.e. one allocated by dma_alloc_coherent()) can be observed
out-of-order by a non-coherent DMA engine unless the PL310 is kicked with
a CACHE_SYNC command between buffering the accesses. Whilst this is already
the case for mb() and wmb(), dma_wmb() just expands to a DMB OSHST, leading
to sequences such as:

	STR	Rdata, [Rbuf, #DATA_OFFSET]
	DMB	OSHST
	STR	Rvalid, [Rbuf, #VALID_OFFSET]

being reordered such that the DMA device sees the write of valid but not
the write of data.

This patch ensures that the SoC fabric and the outer-cache are kicked
by dma_wmb() if they are able to buffer transactions outside of the CPU
but before the Point of Coherency.

Reported-by: Lucas Stach <l.stach@pengutronix.de>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 arch/arm/include/asm/barrier.h    | 14 ++++++++++----
 arch/arm/include/asm/outercache.h |  6 ++++++
 arch/arm/mm/flush.c               | 24 +++++++++++++++++++++++-
 3 files changed, 39 insertions(+), 5 deletions(-)

diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
index 69772e742a0a..e31833f72b69 100644
--- a/arch/arm/include/asm/barrier.h
+++ b/arch/arm/include/asm/barrier.h
@@ -53,17 +53,23 @@
 #ifdef CONFIG_ARM_HEAVY_MB
 extern void (*soc_mb)(void);
 extern void arm_heavy_mb(void);
-#define __arm_heavy_mb(x...) do { dsb(x); arm_heavy_mb(); } while (0)
+extern void arm_heavy_wmb(void);
+extern void arm_heavy_dma_wmb(void);
+#define __arm_heavy_mb()	arm_heavy_mb()
+#define __arm_heavy_wmb()	arm_heavy_wmb()
+#define __arm_heavy_dma_wmb()	arm_heavy_dma_wmb()
 #else
-#define __arm_heavy_mb(x...) dsb(x)
+#define __arm_heavy_mb()	dsb()
+#define __arm_heavy_wmb()	dsb(st)
+#define __arm_heavy_dma_wmb()	dmb(oshst)
 #endif
 
 #if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
 #define mb()		__arm_heavy_mb()
 #define rmb()		dsb()
-#define wmb()		__arm_heavy_mb(st)
+#define wmb()		__arm_heavy_wmb()
 #define dma_rmb()	dmb(osh)
-#define dma_wmb()	dmb(oshst)
+#define dma_wmb()	__arm_heavy_dma_wmb()
 #else
 #define mb()		barrier()
 #define rmb()		barrier()
diff --git a/arch/arm/include/asm/outercache.h b/arch/arm/include/asm/outercache.h
index c2bf24f40177..e744efe6e2a7 100644
--- a/arch/arm/include/asm/outercache.h
+++ b/arch/arm/include/asm/outercache.h
@@ -43,6 +43,12 @@ struct outer_cache_fns {
 
 extern struct outer_cache_fns outer_cache;
 
+#ifdef CONFIG_OUTER_CACHE_SYNC
+static inline bool outer_cache_has_sync(void) { return outer_cache.sync; }
+#else
+static inline bool outer_cache_has_sync(void) { return 0; }
+#endif
+
 #ifdef CONFIG_OUTER_CACHE
 /**
  * outer_inv_range - invalidate range of outer cache lines
diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
index 58469623b015..1252689362d8 100644
--- a/arch/arm/mm/flush.c
+++ b/arch/arm/mm/flush.c
@@ -24,7 +24,7 @@
 #ifdef CONFIG_ARM_HEAVY_MB
 void (*soc_mb)(void);
 
-void arm_heavy_mb(void)
+static void __arm_heavy_sync(void)
 {
 #ifdef CONFIG_OUTER_CACHE_SYNC
 	if (outer_cache.sync)
@@ -33,7 +33,29 @@ void arm_heavy_mb(void)
 	if (soc_mb)
 		soc_mb();
 }
+
+void arm_heavy_mb(void)
+{
+	dsb();
+	__arm_heavy_sync();
+}
 EXPORT_SYMBOL(arm_heavy_mb);
+
+void arm_heavy_wmb(void)
+{
+	dsb(st);
+	__arm_heavy_sync();
+}
+EXPORT_SYMBOL(arm_heavy_wmb);
+
+void arm_heavy_dma_wmb(void)
+{
+	if (outer_cache_has_sync() || soc_mb)
+		arm_heavy_wmb();
+	else
+		dmb(oshst);
+}
+EXPORT_SYMBOL(arm_heavy_dma_wmb);
 #endif
 
 #ifdef CONFIG_CPU_CACHE_VIPT
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-07-06 12:26             ` Will Deacon
@ 2018-07-09  6:20               ` Oleksij Rempel
  2018-09-13 13:17                 ` Will Deacon
  2018-07-09  9:45               ` Lucas Stach
  1 sibling, 1 reply; 17+ messages in thread
From: Oleksij Rempel @ 2018-07-09  6:20 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On 06.07.2018 14:26, Will Deacon wrote:
> Hi Lucas,
> 
> On Mon, Jul 02, 2018 at 06:45:52PM +0100, Will Deacon wrote:
>> On Mon, Jul 02, 2018 at 03:49:26PM +0200, Lucas Stach wrote:
>>> Also the section "External caches", "Before AMBA4 ACE" seems to suggest
>>> that the barriers aren't fully propagated to the PL310 write-caches and
>>> master ports. I'm unsure if this is just an artifact of the mentioned
>>> MMIO access, so handling of normal vs. device memory transactions in
>>> the PL310.
>>
>> I'll check with the hardware folks about the PL310, but I don't see how
>> you could propagate barriers over AXI3 anyway because I don't think this
>> stuff existed back then.
> 
> Damn. As I feared, you can't propagate the barrier to the PL310, so the
> A9 is unable to enforce ordering beyond the responses it gets back. The
> PL310 will then re-order normal memory accesses (including non-cacheable)
> in its store buffer, so we do actually need a bloody CACHE SYNC here.
> 
> Patch below. Does it help?

Suddenly no.
Is it possible that I run in to this issue:
764369: Data or Unified Cache Line Maintenance by MVA Fails on
Inner-Shareable Memory
"...
Note that even when all three components of the workaround are in place,
this erratum might still occur. However, this occurrence would require
some extremely rare and complex timing conditions, so that the
probability of reaching the point of failure is extremely low. This low
probability, along with the fact that this erratum requires an uncommon
software scenario, explains why this workaround is likely to be a
reliable practical solution for most systems.
To ARM's knowledge, no failure has been observed in any system when all
three components of this workaround have been implemented.
For critical systems that cannot cope with the extremely low failure
risks associated with the above workaround, a second workaround is
possible which involves changing the mapping of the data being accessed
so that it is in a non-cacheable area. This ensures that the written
data remains uncached, which means it is always visible to non-coherent
agents in the system, or to the instruction side in the case of
self-modifying code, without any need for cache maintenance operation."


> Will
> 
> --->8
> 
> From 5a2769b45e94b212d21bc44f5940ed5f3bde5cc2 Mon Sep 17 00:00:00 2001
> From: Will Deacon <will.deacon@arm.com>
> Date: Wed, 4 Jul 2018 10:56:22 +0100
> Subject: [PATCH] ARM: barrier: Kick outer-cache and SoC fabric in dma_wmb()
> 
> Prior to AMBA4-ACE, the AXI bus protocol had no support for barrier
> transactions or shareability domains and was therefore unable to enforce
> ordering between memory accesses either side of a barrier instruction,
> requiring the CPU to handle all of the ordering itself.
> 
> Unfortunately, IP such as the venerable PL310 will re-order stores to
> Normal memory inside its store-buffer, meaning that accesses to a coherent
> DMA buffer (i.e. one allocated by dma_alloc_coherent()) can be observed
> out-of-order by a non-coherent DMA engine unless the PL310 is kicked with
> a CACHE_SYNC command between buffering the accesses. Whilst this is already
> the case for mb() and wmb(), dma_wmb() just expands to a DMB OSHST, leading
> to sequences such as:
> 
> 	STR	Rdata, [Rbuf, #DATA_OFFSET]
> 	DMB	OSHST
> 	STR	Rvalid, [Rbuf, #VALID_OFFSET]
> 
> being reordered such that the DMA device sees the write of valid but not
> the write of data.
> 
> This patch ensures that the SoC fabric and the outer-cache are kicked
> by dma_wmb() if they are able to buffer transactions outside of the CPU
> but before the Point of Coherency.
> 
> Reported-by: Lucas Stach <l.stach@pengutronix.de>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  arch/arm/include/asm/barrier.h    | 14 ++++++++++----
>  arch/arm/include/asm/outercache.h |  6 ++++++
>  arch/arm/mm/flush.c               | 24 +++++++++++++++++++++++-
>  3 files changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index 69772e742a0a..e31833f72b69 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -53,17 +53,23 @@
>  #ifdef CONFIG_ARM_HEAVY_MB
>  extern void (*soc_mb)(void);
>  extern void arm_heavy_mb(void);
> -#define __arm_heavy_mb(x...) do { dsb(x); arm_heavy_mb(); } while (0)
> +extern void arm_heavy_wmb(void);
> +extern void arm_heavy_dma_wmb(void);
> +#define __arm_heavy_mb()	arm_heavy_mb()
> +#define __arm_heavy_wmb()	arm_heavy_wmb()
> +#define __arm_heavy_dma_wmb()	arm_heavy_dma_wmb()
>  #else
> -#define __arm_heavy_mb(x...) dsb(x)
> +#define __arm_heavy_mb()	dsb()
> +#define __arm_heavy_wmb()	dsb(st)
> +#define __arm_heavy_dma_wmb()	dmb(oshst)
>  #endif
>  
>  #if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
>  #define mb()		__arm_heavy_mb()
>  #define rmb()		dsb()
> -#define wmb()		__arm_heavy_mb(st)
> +#define wmb()		__arm_heavy_wmb()
>  #define dma_rmb()	dmb(osh)
> -#define dma_wmb()	dmb(oshst)
> +#define dma_wmb()	__arm_heavy_dma_wmb()
>  #else
>  #define mb()		barrier()
>  #define rmb()		barrier()
> diff --git a/arch/arm/include/asm/outercache.h b/arch/arm/include/asm/outercache.h
> index c2bf24f40177..e744efe6e2a7 100644
> --- a/arch/arm/include/asm/outercache.h
> +++ b/arch/arm/include/asm/outercache.h
> @@ -43,6 +43,12 @@ struct outer_cache_fns {
>  
>  extern struct outer_cache_fns outer_cache;
>  
> +#ifdef CONFIG_OUTER_CACHE_SYNC
> +static inline bool outer_cache_has_sync(void) { return outer_cache.sync; }
> +#else
> +static inline bool outer_cache_has_sync(void) { return 0; }
> +#endif
> +
>  #ifdef CONFIG_OUTER_CACHE
>  /**
>   * outer_inv_range - invalidate range of outer cache lines
> diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> index 58469623b015..1252689362d8 100644
> --- a/arch/arm/mm/flush.c
> +++ b/arch/arm/mm/flush.c
> @@ -24,7 +24,7 @@
>  #ifdef CONFIG_ARM_HEAVY_MB
>  void (*soc_mb)(void);
>  
> -void arm_heavy_mb(void)
> +static void __arm_heavy_sync(void)
>  {
>  #ifdef CONFIG_OUTER_CACHE_SYNC
>  	if (outer_cache.sync)
> @@ -33,7 +33,29 @@ void arm_heavy_mb(void)
>  	if (soc_mb)
>  		soc_mb();
>  }
> +
> +void arm_heavy_mb(void)
> +{
> +	dsb();
> +	__arm_heavy_sync();
> +}
>  EXPORT_SYMBOL(arm_heavy_mb);
> +
> +void arm_heavy_wmb(void)
> +{
> +	dsb(st);
> +	__arm_heavy_sync();
> +}
> +EXPORT_SYMBOL(arm_heavy_wmb);
> +
> +void arm_heavy_dma_wmb(void)
> +{
> +	if (outer_cache_has_sync() || soc_mb)
> +		arm_heavy_wmb();
> +	else
> +		dmb(oshst);
> +}
> +EXPORT_SYMBOL(arm_heavy_dma_wmb);
>  #endif
>  
>  #ifdef CONFIG_CPU_CACHE_VIPT
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: OpenPGP digital signature
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20180709/ba850d25/attachment.sig>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-07-06 12:26             ` Will Deacon
  2018-07-09  6:20               ` Oleksij Rempel
@ 2018-07-09  9:45               ` Lucas Stach
  1 sibling, 0 replies; 17+ messages in thread
From: Lucas Stach @ 2018-07-09  9:45 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

Am Freitag, den 06.07.2018, 13:26 +0100 schrieb Will Deacon:
> Hi Lucas,
> 
> On Mon, Jul 02, 2018 at 06:45:52PM +0100, Will Deacon wrote:
> > On Mon, Jul 02, 2018 at 03:49:26PM +0200, Lucas Stach wrote:
> > > Also the section "External caches", "Before AMBA4 ACE" seems to suggest
> > > that the barriers aren't fully propagated to the PL310 write-caches and
> > > master ports. I'm unsure if this is just an artifact of the mentioned
> > > MMIO access, so handling of normal vs. device memory transactions in
> > > the PL310.
> > 
> > I'll check with the hardware folks about the PL310, but I don't see how
> > you could propagate barriers over AXI3 anyway because I don't think this
> > stuff existed back then.
> 
> Damn. As I feared, you can't propagate the barrier to the PL310, so the
> A9 is unable to enforce ordering beyond the responses it gets back. The
> PL310 will then re-order normal memory accesses (including non-cacheable)
> in its store buffer, so we do actually need a bloody CACHE SYNC here.
> 
> Patch below. Does it help?

It seems the issues we see on the SocFPGA aren't caused by insufficient
barriers, but we are still in the process of tracking down the root
cause.

Aside from this, the patch below seem to match my understanding of the
outer cache issue. I'll give this a spin on i.MX6, as this would
potentially allow us to switch to the dma_* barrier variants in a
number drivers, reducing overhead on systems without an outer cache.

Thanks,
Lucas

> --->8
> 
> From 5a2769b45e94b212d21bc44f5940ed5f3bde5cc2 Mon Sep 17 00:00:00 2001
> > From: Will Deacon <will.deacon@arm.com>
> Date: Wed, 4 Jul 2018 10:56:22 +0100
> Subject: [PATCH] ARM: barrier: Kick outer-cache and SoC fabric in dma_wmb()
> 
> Prior to AMBA4-ACE, the AXI bus protocol had no support for barrier
> transactions or shareability domains and was therefore unable to enforce
> ordering between memory accesses either side of a barrier instruction,
> requiring the CPU to handle all of the ordering itself.
> 
> Unfortunately, IP such as the venerable PL310 will re-order stores to
> Normal memory inside its store-buffer, meaning that accesses to a coherent
> DMA buffer (i.e. one allocated by dma_alloc_coherent()) can be observed
> out-of-order by a non-coherent DMA engine unless the PL310 is kicked with
> a CACHE_SYNC command between buffering the accesses. Whilst this is already
> the case for mb() and wmb(), dma_wmb() just expands to a DMB OSHST, leading
> to sequences such as:
> 
> > 	STR	Rdata, [Rbuf, #DATA_OFFSET]
> > 	DMB	OSHST
> > 	STR	Rvalid, [Rbuf, #VALID_OFFSET]
> 
> being reordered such that the DMA device sees the write of valid but not
> the write of data.
> 
> This patch ensures that the SoC fabric and the outer-cache are kicked
> by dma_wmb() if they are able to buffer transactions outside of the CPU
> but before the Point of Coherency.
> 
> > Reported-by: Lucas Stach <l.stach@pengutronix.de>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
> ?arch/arm/include/asm/barrier.h????| 14 ++++++++++----
> ?arch/arm/include/asm/outercache.h |??6 ++++++
> ?arch/arm/mm/flush.c???????????????| 24 +++++++++++++++++++++++-
> ?3 files changed, 39 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> index 69772e742a0a..e31833f72b69 100644
> --- a/arch/arm/include/asm/barrier.h
> +++ b/arch/arm/include/asm/barrier.h
> @@ -53,17 +53,23 @@
> ?#ifdef CONFIG_ARM_HEAVY_MB
> ?extern void (*soc_mb)(void);
> ?extern void arm_heavy_mb(void);
> -#define __arm_heavy_mb(x...) do { dsb(x); arm_heavy_mb(); } while (0)
> +extern void arm_heavy_wmb(void);
> +extern void arm_heavy_dma_wmb(void);
> > +#define __arm_heavy_mb()	arm_heavy_mb()
> > +#define __arm_heavy_wmb()	arm_heavy_wmb()
> > +#define __arm_heavy_dma_wmb()	arm_heavy_dma_wmb()
> ?#else
> -#define __arm_heavy_mb(x...) dsb(x)
> > +#define __arm_heavy_mb()	dsb()
> > +#define __arm_heavy_wmb()	dsb(st)
> > +#define __arm_heavy_dma_wmb()	dmb(oshst)
> ?#endif
> ?
> ?#if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
> > ?#define mb()		__arm_heavy_mb()
> > ?#define rmb()		dsb()
> > -#define wmb()		__arm_heavy_mb(st)
> > +#define wmb()		__arm_heavy_wmb()
> > ?#define dma_rmb()	dmb(osh)
> > -#define dma_wmb()	dmb(oshst)
> > +#define dma_wmb()	__arm_heavy_dma_wmb()
> ?#else
> > ?#define mb()		barrier()
> > ?#define rmb()		barrier()
> diff --git a/arch/arm/include/asm/outercache.h b/arch/arm/include/asm/outercache.h
> index c2bf24f40177..e744efe6e2a7 100644
> --- a/arch/arm/include/asm/outercache.h
> +++ b/arch/arm/include/asm/outercache.h
> @@ -43,6 +43,12 @@ struct outer_cache_fns {
> ?
> ?extern struct outer_cache_fns outer_cache;
> ?
> +#ifdef CONFIG_OUTER_CACHE_SYNC
> +static inline bool outer_cache_has_sync(void) { return outer_cache.sync; }
> +#else
> +static inline bool outer_cache_has_sync(void) { return 0; }
> +#endif
> +
> ?#ifdef CONFIG_OUTER_CACHE
> ?/**
> ? * outer_inv_range - invalidate range of outer cache lines
> diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> index 58469623b015..1252689362d8 100644
> --- a/arch/arm/mm/flush.c
> +++ b/arch/arm/mm/flush.c
> @@ -24,7 +24,7 @@
> ?#ifdef CONFIG_ARM_HEAVY_MB
> ?void (*soc_mb)(void);
> ?
> -void arm_heavy_mb(void)
> +static void __arm_heavy_sync(void)
> ?{
> ?#ifdef CONFIG_OUTER_CACHE_SYNC
> > ?	if (outer_cache.sync)
> @@ -33,7 +33,29 @@ void arm_heavy_mb(void)
> > ?	if (soc_mb)
> > ?		soc_mb();
> ?}
> +
> +void arm_heavy_mb(void)
> +{
> > +	dsb();
> > +	__arm_heavy_sync();
> +}
> ?EXPORT_SYMBOL(arm_heavy_mb);
> +
> +void arm_heavy_wmb(void)
> +{
> > +	dsb(st);
> > +	__arm_heavy_sync();
> +}
> +EXPORT_SYMBOL(arm_heavy_wmb);
> +
> +void arm_heavy_dma_wmb(void)
> +{
> > +	if (outer_cache_has_sync() || soc_mb)
> > +		arm_heavy_wmb();
> > +	else
> > +		dmb(oshst);
> +}
> +EXPORT_SYMBOL(arm_heavy_dma_wmb);
> ?#endif
> ?
> ?#ifdef CONFIG_CPU_CACHE_VIPT

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-07-09  6:20               ` Oleksij Rempel
@ 2018-09-13 13:17                 ` Will Deacon
  2018-09-13 14:09                   ` Oleksij Rempel
  0 siblings, 1 reply; 17+ messages in thread
From: Will Deacon @ 2018-09-13 13:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jul 09, 2018 at 08:20:22AM +0200, Oleksij Rempel wrote:
> On 06.07.2018 14:26, Will Deacon wrote:
> > On Mon, Jul 02, 2018 at 06:45:52PM +0100, Will Deacon wrote:
> >> On Mon, Jul 02, 2018 at 03:49:26PM +0200, Lucas Stach wrote:
> >>> Also the section "External caches", "Before AMBA4 ACE" seems to suggest
> >>> that the barriers aren't fully propagated to the PL310 write-caches and
> >>> master ports. I'm unsure if this is just an artifact of the mentioned
> >>> MMIO access, so handling of normal vs. device memory transactions in
> >>> the PL310.
> >>
> >> I'll check with the hardware folks about the PL310, but I don't see how
> >> you could propagate barriers over AXI3 anyway because I don't think this
> >> stuff existed back then.
> > 
> > Damn. As I feared, you can't propagate the barrier to the PL310, so the
> > A9 is unable to enforce ordering beyond the responses it gets back. The
> > PL310 will then re-order normal memory accesses (including non-cacheable)
> > in its store buffer, so we do actually need a bloody CACHE SYNC here.
> > 
> > Patch below. Does it help?
> 
> Suddenly no.

I just noticed that I'm still carrying this patch in my tree. Are you saying
it doesn't help on your system? It all went quiet, so I assume you fixed
this issue one way or the other?

Will

> > From 5a2769b45e94b212d21bc44f5940ed5f3bde5cc2 Mon Sep 17 00:00:00 2001
> > From: Will Deacon <will.deacon@arm.com>
> > Date: Wed, 4 Jul 2018 10:56:22 +0100
> > Subject: [PATCH] ARM: barrier: Kick outer-cache and SoC fabric in dma_wmb()
> > 
> > Prior to AMBA4-ACE, the AXI bus protocol had no support for barrier
> > transactions or shareability domains and was therefore unable to enforce
> > ordering between memory accesses either side of a barrier instruction,
> > requiring the CPU to handle all of the ordering itself.
> > 
> > Unfortunately, IP such as the venerable PL310 will re-order stores to
> > Normal memory inside its store-buffer, meaning that accesses to a coherent
> > DMA buffer (i.e. one allocated by dma_alloc_coherent()) can be observed
> > out-of-order by a non-coherent DMA engine unless the PL310 is kicked with
> > a CACHE_SYNC command between buffering the accesses. Whilst this is already
> > the case for mb() and wmb(), dma_wmb() just expands to a DMB OSHST, leading
> > to sequences such as:
> > 
> > 	STR	Rdata, [Rbuf, #DATA_OFFSET]
> > 	DMB	OSHST
> > 	STR	Rvalid, [Rbuf, #VALID_OFFSET]
> > 
> > being reordered such that the DMA device sees the write of valid but not
> > the write of data.
> > 
> > This patch ensures that the SoC fabric and the outer-cache are kicked
> > by dma_wmb() if they are able to buffer transactions outside of the CPU
> > but before the Point of Coherency.
> > 
> > Reported-by: Lucas Stach <l.stach@pengutronix.de>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  arch/arm/include/asm/barrier.h    | 14 ++++++++++----
> >  arch/arm/include/asm/outercache.h |  6 ++++++
> >  arch/arm/mm/flush.c               | 24 +++++++++++++++++++++++-
> >  3 files changed, 39 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> > index 69772e742a0a..e31833f72b69 100644
> > --- a/arch/arm/include/asm/barrier.h
> > +++ b/arch/arm/include/asm/barrier.h
> > @@ -53,17 +53,23 @@
> >  #ifdef CONFIG_ARM_HEAVY_MB
> >  extern void (*soc_mb)(void);
> >  extern void arm_heavy_mb(void);
> > -#define __arm_heavy_mb(x...) do { dsb(x); arm_heavy_mb(); } while (0)
> > +extern void arm_heavy_wmb(void);
> > +extern void arm_heavy_dma_wmb(void);
> > +#define __arm_heavy_mb()	arm_heavy_mb()
> > +#define __arm_heavy_wmb()	arm_heavy_wmb()
> > +#define __arm_heavy_dma_wmb()	arm_heavy_dma_wmb()
> >  #else
> > -#define __arm_heavy_mb(x...) dsb(x)
> > +#define __arm_heavy_mb()	dsb()
> > +#define __arm_heavy_wmb()	dsb(st)
> > +#define __arm_heavy_dma_wmb()	dmb(oshst)
> >  #endif
> >  
> >  #if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
> >  #define mb()		__arm_heavy_mb()
> >  #define rmb()		dsb()
> > -#define wmb()		__arm_heavy_mb(st)
> > +#define wmb()		__arm_heavy_wmb()
> >  #define dma_rmb()	dmb(osh)
> > -#define dma_wmb()	dmb(oshst)
> > +#define dma_wmb()	__arm_heavy_dma_wmb()
> >  #else
> >  #define mb()		barrier()
> >  #define rmb()		barrier()
> > diff --git a/arch/arm/include/asm/outercache.h b/arch/arm/include/asm/outercache.h
> > index c2bf24f40177..e744efe6e2a7 100644
> > --- a/arch/arm/include/asm/outercache.h
> > +++ b/arch/arm/include/asm/outercache.h
> > @@ -43,6 +43,12 @@ struct outer_cache_fns {
> >  
> >  extern struct outer_cache_fns outer_cache;
> >  
> > +#ifdef CONFIG_OUTER_CACHE_SYNC
> > +static inline bool outer_cache_has_sync(void) { return outer_cache.sync; }
> > +#else
> > +static inline bool outer_cache_has_sync(void) { return 0; }
> > +#endif
> > +
> >  #ifdef CONFIG_OUTER_CACHE
> >  /**
> >   * outer_inv_range - invalidate range of outer cache lines
> > diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> > index 58469623b015..1252689362d8 100644
> > --- a/arch/arm/mm/flush.c
> > +++ b/arch/arm/mm/flush.c
> > @@ -24,7 +24,7 @@
> >  #ifdef CONFIG_ARM_HEAVY_MB
> >  void (*soc_mb)(void);
> >  
> > -void arm_heavy_mb(void)
> > +static void __arm_heavy_sync(void)
> >  {
> >  #ifdef CONFIG_OUTER_CACHE_SYNC
> >  	if (outer_cache.sync)
> > @@ -33,7 +33,29 @@ void arm_heavy_mb(void)
> >  	if (soc_mb)
> >  		soc_mb();
> >  }
> > +
> > +void arm_heavy_mb(void)
> > +{
> > +	dsb();
> > +	__arm_heavy_sync();
> > +}
> >  EXPORT_SYMBOL(arm_heavy_mb);
> > +
> > +void arm_heavy_wmb(void)
> > +{
> > +	dsb(st);
> > +	__arm_heavy_sync();
> > +}
> > +EXPORT_SYMBOL(arm_heavy_wmb);
> > +
> > +void arm_heavy_dma_wmb(void)
> > +{
> > +	if (outer_cache_has_sync() || soc_mb)
> > +		arm_heavy_wmb();
> > +	else
> > +		dmb(oshst);
> > +}
> > +EXPORT_SYMBOL(arm_heavy_dma_wmb);
> >  #endif
> >  
> >  #ifdef CONFIG_CPU_CACHE_VIPT
> > 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches
  2018-09-13 13:17                 ` Will Deacon
@ 2018-09-13 14:09                   ` Oleksij Rempel
  0 siblings, 0 replies; 17+ messages in thread
From: Oleksij Rempel @ 2018-09-13 14:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 13, 2018 at 02:17:12PM +0100, Will Deacon wrote:
> On Mon, Jul 09, 2018 at 08:20:22AM +0200, Oleksij Rempel wrote:
> > On 06.07.2018 14:26, Will Deacon wrote:
> > > On Mon, Jul 02, 2018 at 06:45:52PM +0100, Will Deacon wrote:
> > >> On Mon, Jul 02, 2018 at 03:49:26PM +0200, Lucas Stach wrote:
> > >>> Also the section "External caches", "Before AMBA4 ACE" seems to suggest
> > >>> that the barriers aren't fully propagated to the PL310 write-caches and
> > >>> master ports. I'm unsure if this is just an artifact of the mentioned
> > >>> MMIO access, so handling of normal vs. device memory transactions in
> > >>> the PL310.
> > >>
> > >> I'll check with the hardware folks about the PL310, but I don't see how
> > >> you could propagate barriers over AXI3 anyway because I don't think this
> > >> stuff existed back then.
> > > 
> > > Damn. As I feared, you can't propagate the barrier to the PL310, so the
> > > A9 is unable to enforce ordering beyond the responses it gets back. The
> > > PL310 will then re-order normal memory accesses (including non-cacheable)
> > > in its store buffer, so we do actually need a bloody CACHE SYNC here.
> > > 
> > > Patch below. Does it help?
> > 
> > Suddenly no.
> 
> I just noticed that I'm still carrying this patch in my tree. Are you saying
> it doesn't help on your system? It all went quiet, so I assume you fixed
> this issue one way or the other?

Hi, I was no able to fix this problem. Currently we are using 2G memory
split. With this configuration we are not able to reproduce it, at least
not with described test.

> Will
> 
> > > From 5a2769b45e94b212d21bc44f5940ed5f3bde5cc2 Mon Sep 17 00:00:00 2001
> > > From: Will Deacon <will.deacon@arm.com>
> > > Date: Wed, 4 Jul 2018 10:56:22 +0100
> > > Subject: [PATCH] ARM: barrier: Kick outer-cache and SoC fabric in dma_wmb()
> > > 
> > > Prior to AMBA4-ACE, the AXI bus protocol had no support for barrier
> > > transactions or shareability domains and was therefore unable to enforce
> > > ordering between memory accesses either side of a barrier instruction,
> > > requiring the CPU to handle all of the ordering itself.
> > > 
> > > Unfortunately, IP such as the venerable PL310 will re-order stores to
> > > Normal memory inside its store-buffer, meaning that accesses to a coherent
> > > DMA buffer (i.e. one allocated by dma_alloc_coherent()) can be observed
> > > out-of-order by a non-coherent DMA engine unless the PL310 is kicked with
> > > a CACHE_SYNC command between buffering the accesses. Whilst this is already
> > > the case for mb() and wmb(), dma_wmb() just expands to a DMB OSHST, leading
> > > to sequences such as:
> > > 
> > > 	STR	Rdata, [Rbuf, #DATA_OFFSET]
> > > 	DMB	OSHST
> > > 	STR	Rvalid, [Rbuf, #VALID_OFFSET]
> > > 
> > > being reordered such that the DMA device sees the write of valid but not
> > > the write of data.
> > > 
> > > This patch ensures that the SoC fabric and the outer-cache are kicked
> > > by dma_wmb() if they are able to buffer transactions outside of the CPU
> > > but before the Point of Coherency.
> > > 
> > > Reported-by: Lucas Stach <l.stach@pengutronix.de>
> > > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > > ---
> > >  arch/arm/include/asm/barrier.h    | 14 ++++++++++----
> > >  arch/arm/include/asm/outercache.h |  6 ++++++
> > >  arch/arm/mm/flush.c               | 24 +++++++++++++++++++++++-
> > >  3 files changed, 39 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/arm/include/asm/barrier.h b/arch/arm/include/asm/barrier.h
> > > index 69772e742a0a..e31833f72b69 100644
> > > --- a/arch/arm/include/asm/barrier.h
> > > +++ b/arch/arm/include/asm/barrier.h
> > > @@ -53,17 +53,23 @@
> > >  #ifdef CONFIG_ARM_HEAVY_MB
> > >  extern void (*soc_mb)(void);
> > >  extern void arm_heavy_mb(void);
> > > -#define __arm_heavy_mb(x...) do { dsb(x); arm_heavy_mb(); } while (0)
> > > +extern void arm_heavy_wmb(void);
> > > +extern void arm_heavy_dma_wmb(void);
> > > +#define __arm_heavy_mb()	arm_heavy_mb()
> > > +#define __arm_heavy_wmb()	arm_heavy_wmb()
> > > +#define __arm_heavy_dma_wmb()	arm_heavy_dma_wmb()
> > >  #else
> > > -#define __arm_heavy_mb(x...) dsb(x)
> > > +#define __arm_heavy_mb()	dsb()
> > > +#define __arm_heavy_wmb()	dsb(st)
> > > +#define __arm_heavy_dma_wmb()	dmb(oshst)
> > >  #endif
> > >  
> > >  #if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP)
> > >  #define mb()		__arm_heavy_mb()
> > >  #define rmb()		dsb()
> > > -#define wmb()		__arm_heavy_mb(st)
> > > +#define wmb()		__arm_heavy_wmb()
> > >  #define dma_rmb()	dmb(osh)
> > > -#define dma_wmb()	dmb(oshst)
> > > +#define dma_wmb()	__arm_heavy_dma_wmb()
> > >  #else
> > >  #define mb()		barrier()
> > >  #define rmb()		barrier()
> > > diff --git a/arch/arm/include/asm/outercache.h b/arch/arm/include/asm/outercache.h
> > > index c2bf24f40177..e744efe6e2a7 100644
> > > --- a/arch/arm/include/asm/outercache.h
> > > +++ b/arch/arm/include/asm/outercache.h
> > > @@ -43,6 +43,12 @@ struct outer_cache_fns {
> > >  
> > >  extern struct outer_cache_fns outer_cache;
> > >  
> > > +#ifdef CONFIG_OUTER_CACHE_SYNC
> > > +static inline bool outer_cache_has_sync(void) { return outer_cache.sync; }
> > > +#else
> > > +static inline bool outer_cache_has_sync(void) { return 0; }
> > > +#endif
> > > +
> > >  #ifdef CONFIG_OUTER_CACHE
> > >  /**
> > >   * outer_inv_range - invalidate range of outer cache lines
> > > diff --git a/arch/arm/mm/flush.c b/arch/arm/mm/flush.c
> > > index 58469623b015..1252689362d8 100644
> > > --- a/arch/arm/mm/flush.c
> > > +++ b/arch/arm/mm/flush.c
> > > @@ -24,7 +24,7 @@
> > >  #ifdef CONFIG_ARM_HEAVY_MB
> > >  void (*soc_mb)(void);
> > >  
> > > -void arm_heavy_mb(void)
> > > +static void __arm_heavy_sync(void)
> > >  {
> > >  #ifdef CONFIG_OUTER_CACHE_SYNC
> > >  	if (outer_cache.sync)
> > > @@ -33,7 +33,29 @@ void arm_heavy_mb(void)
> > >  	if (soc_mb)
> > >  		soc_mb();
> > >  }
> > > +
> > > +void arm_heavy_mb(void)
> > > +{
> > > +	dsb();
> > > +	__arm_heavy_sync();
> > > +}
> > >  EXPORT_SYMBOL(arm_heavy_mb);
> > > +
> > > +void arm_heavy_wmb(void)
> > > +{
> > > +	dsb(st);
> > > +	__arm_heavy_sync();
> > > +}
> > > +EXPORT_SYMBOL(arm_heavy_wmb);
> > > +
> > > +void arm_heavy_dma_wmb(void)
> > > +{
> > > +	if (outer_cache_has_sync() || soc_mb)
> > > +		arm_heavy_wmb();
> > > +	else
> > > +		dmb(oshst);
> > > +}
> > > +EXPORT_SYMBOL(arm_heavy_dma_wmb);
> > >  #endif
> > >  
> > >  #ifdef CONFIG_CPU_CACHE_VIPT
> > > 
> > 
> 
> 
> 
> 

-- 
Pengutronix e.K.                           |                             |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
Peiner Str. 6-8, 31137 Hildesheim, Germany | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 488 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20180913/ee0972d5/attachment.sig>

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2018-09-13 14:09 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-29 12:28 Clarifying dma_wmb behavior in presence of non-coherent masters and outer caches Lucas Stach
2018-06-29 14:25 ` Russell King - ARM Linux
2018-06-29 16:19   ` Lucas Stach
2018-06-29 16:22   ` Will Deacon
2018-06-29 16:48     ` Russell King - ARM Linux
2018-06-29 17:43       ` Will Deacon
2018-06-29 18:01         ` Russell King - ARM Linux
2018-07-02 13:49         ` Lucas Stach
2018-07-02 17:45           ` Will Deacon
2018-07-06 12:26             ` Will Deacon
2018-07-09  6:20               ` Oleksij Rempel
2018-09-13 13:17                 ` Will Deacon
2018-09-13 14:09                   ` Oleksij Rempel
2018-07-09  9:45               ` Lucas Stach
2018-06-29 17:14     ` Lucas Stach
2018-06-29 17:46       ` Will Deacon
2018-07-02  9:58         ` Lucas Stach

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.