All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
       [not found] ` <CAHrpEqRsp2_bt=p5JgS5F-2F_LCwgT+VX7mSENzpEYTQiW1tjg@mail.gmail.com>
@ 2021-06-17  9:27   ` Catalin Marinas
  2021-06-17 17:25     ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Catalin Marinas @ 2021-06-17  9:27 UTC (permalink / raw)
  To: Zhi Li
  Cc: Frank Li, Will Deacon, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel

On Wed, Jun 16, 2021 at 02:24:39PM -0500, Zhi Li wrote:
> On Wed, Jun 16, 2021 at 2:18 PM Frank Li <frank.li@nxp.com> wrote:
> > Will Deacon wrote:
> > > It would also be helpful to know a bit more about the hardware:
> > >
> > >   - What is the "internal bus fabric"?
> 
> > Look like ARM call as "Interconnect",  Multi AXI master and multi AXI slave
> > connected together. 
> 
> I  drawed simplified bus structure. 
>  
>         ┌──────┐ ┌────┐
>         │ A53  │ │A72 │
>         └───┬──┘ └─┬──┘
>             │      │
>         ┌───▼──────▼──┐
>         │    CCI400   │
>         └─────┬───────┘
>               │   1 (a)write to ddr (normal uncached memory)
>               │   DMB OSHST
>               │   2 (b)write to usb register(device, nGnRE)
>         ┌─────▼───────────────────────┐       ┌───────────┐
>         │                             ◄───────┤   GPU     │
>         │     Bus fabric              │       │           │
>         └────────────────────────────┬┘       └───────────┘
> 3 (b) reach usb   ▲ 4 usb read   ▲   │ 6.(a)reach
>          │        │   ddr        │   │
>       ┌──▼────────┴─┐            │   │
>       │             │            │   │
>       │  USB        │      5.usb │   │
>       │             │      read  │   │
>       └─────────────┘            │   │
>                                ┌─┴───▼─┐
>                                │       │
>                                │ DDR   │
>                                │       │
>                                └───────┘

Since you sent an HTML message, it was rejected by the list server. The
above is a plain-text rendition by w3m (and changed barrier() to DMB
OSHST).

Is the DMB propagated to the bus fabric? IIUC, our logic is that if the
write (b) to USB is observable by, let's say, the GPU, the same GPU
should also observe the write (a) to DDR. Since the write (a) to DDR is
globally observable, the USB device read at (4) should also observe it
(well, we may be wrong).

So while the bus fabric could ensure the ordering of the DDR write (a)
and the USB write (b) from the perspective of a third observer (the
GPU), I don't see how it can force it from the USB perspective as it
cannot observe the write (b) to its registers.

Replacing the DMB with the DSB forces the write (a) to reach the DDR on
your platform.

Will, any better idea of why it goes wrong?

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-17  9:27   ` The problem about arm64: io: Relax implicit barriers in default I/O accessors Catalin Marinas
@ 2021-06-17 17:25     ` Will Deacon
  2021-06-17 17:41       ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2021-06-17 17:25 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Zhi Li, Frank Li, Shenwei Wang, Han Xu, Nitin Garg, Jason Liu,
	linux-arm-kernel

On Thu, Jun 17, 2021 at 10:27:44AM +0100, Catalin Marinas wrote:
> On Wed, Jun 16, 2021 at 02:24:39PM -0500, Zhi Li wrote:
> > On Wed, Jun 16, 2021 at 2:18 PM Frank Li <frank.li@nxp.com> wrote:
> > > Will Deacon wrote:
> > > > It would also be helpful to know a bit more about the hardware:
> > > >
> > > >   - What is the "internal bus fabric"?
> > 
> > > Look like ARM call as "Interconnect",  Multi AXI master and multi AXI slave
> > > connected together. 
> > 
> > I  drawed simplified bus structure. 
> >  
> >         ┌──────┐ ┌────┐
> >         │ A53  │ │A72 │
> >         └───┬──┘ └─┬──┘
> >             │      │
> >         ┌───▼──────▼──┐
> >         │    CCI400   │
> >         └─────┬───────┘
> >               │   1 (a)write to ddr (normal uncached memory)
> >               │   DMB OSHST
> >               │   2 (b)write to usb register(device, nGnRE)
> >         ┌─────▼───────────────────────┐       ┌───────────┐
> >         │                             ◄───────┤   GPU     │
> >         │     Bus fabric              │       │           │
> >         └────────────────────────────┬┘       └───────────┘
> > 3 (b) reach usb   ▲ 4 usb read   ▲   │ 6.(a)reach
> >          │        │   ddr        │   │
> >       ┌──▼────────┴─┐            │   │
> >       │             │            │   │
> >       │  USB        │      5.usb │   │
> >       │             │      read  │   │
> >       └─────────────┘            │   │
> >                                ┌─┴───▼─┐
> >                                │       │
> >                                │ DDR   │
> >                                │       │
> >                                └───────┘
> 
> Since you sent an HTML message, it was rejected by the list server. The
> above is a plain-text rendition by w3m (and changed barrier() to DMB
> OSHST).
> 
> Is the DMB propagated to the bus fabric? IIUC, our logic is that if the
> write (b) to USB is observable by, let's say, the GPU, the same GPU
> should also observe the write (a) to DDR. Since the write (a) to DDR is
> globally observable, the USB device read at (4) should also observe it
> (well, we may be wrong).

It's pretty rare for barriers to propagate onto the fabric -- usually the
CPU just orders everything based on acknowledgements. If the CCI gives the
write response for the non-cacheable write I could see that causing an issue
if the bus fabric can then reorder accesses, but then I would argue that's a
broken system because simple ring buffers in non-cacheable memory would fail
for peripherals hooking into the bus fabric (i.e. dma_*mb() would be
broken). I think it would also mean that DSB doesn't necessarily fix the
issue, it probably just makes it less likely because it takes longer to
get the device write out after the acknowledgement -- ndelay() would achieve
the same effect :)

Frank -- what happens if you try either DMB SY, or DMB OSH (without the ST)
in writel()?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-17 17:25     ` Will Deacon
@ 2021-06-17 17:41       ` Will Deacon
  2021-06-17 20:11         ` [EXT] " Frank Li
  0 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2021-06-17 17:41 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Zhi Li, Frank Li, Shenwei Wang, Han Xu, Nitin Garg, Jason Liu,
	linux-arm-kernel

On Thu, Jun 17, 2021 at 06:25:28PM +0100, Will Deacon wrote:
> On Thu, Jun 17, 2021 at 10:27:44AM +0100, Catalin Marinas wrote:
> > On Wed, Jun 16, 2021 at 02:24:39PM -0500, Zhi Li wrote:
> > > On Wed, Jun 16, 2021 at 2:18 PM Frank Li <frank.li@nxp.com> wrote:
> > > > Will Deacon wrote:
> > > > > It would also be helpful to know a bit more about the hardware:
> > > > >
> > > > >   - What is the "internal bus fabric"?
> > > 
> > > > Look like ARM call as "Interconnect",  Multi AXI master and multi AXI slave
> > > > connected together. 
> > > 
> > > I  drawed simplified bus structure. 
> > >  
> > >         ┌──────┐ ┌────┐
> > >         │ A53  │ │A72 │
> > >         └───┬──┘ └─┬──┘
> > >             │      │
> > >         ┌───▼──────▼──┐
> > >         │    CCI400   │
> > >         └─────┬───────┘
> > >               │   1 (a)write to ddr (normal uncached memory)
> > >               │   DMB OSHST
> > >               │   2 (b)write to usb register(device, nGnRE)
> > >         ┌─────▼───────────────────────┐       ┌───────────┐
> > >         │                             ◄───────┤   GPU     │
> > >         │     Bus fabric              │       │           │
> > >         └────────────────────────────┬┘       └───────────┘
> > > 3 (b) reach usb   ▲ 4 usb read   ▲   │ 6.(a)reach
> > >          │        │   ddr        │   │
> > >       ┌──▼────────┴─┐            │   │
> > >       │             │            │   │
> > >       │  USB        │      5.usb │   │
> > >       │             │      read  │   │
> > >       └─────────────┘            │   │
> > >                                ┌─┴───▼─┐
> > >                                │       │
> > >                                │ DDR   │
> > >                                │       │
> > >                                └───────┘
> > 
> > Since you sent an HTML message, it was rejected by the list server. The
> > above is a plain-text rendition by w3m (and changed barrier() to DMB
> > OSHST).
> > 
> > Is the DMB propagated to the bus fabric? IIUC, our logic is that if the
> > write (b) to USB is observable by, let's say, the GPU, the same GPU
> > should also observe the write (a) to DDR. Since the write (a) to DDR is
> > globally observable, the USB device read at (4) should also observe it
> > (well, we may be wrong).
> 
> It's pretty rare for barriers to propagate onto the fabric -- usually the
> CPU just orders everything based on acknowledgements. If the CCI gives the
> write response for the non-cacheable write I could see that causing an issue
> if the bus fabric can then reorder accesses, but then I would argue that's a
> broken system because simple ring buffers in non-cacheable memory would fail
> for peripherals hooking into the bus fabric (i.e. dma_*mb() would be
> broken). I think it would also mean that DSB doesn't necessarily fix the
> issue, it probably just makes it less likely because it takes longer to
> get the device write out after the acknowledgement -- ndelay() would achieve
> the same effect :)
> 
> Frank -- what happens if you try either DMB SY, or DMB OSH (without the ST)
> in writel()?

Also, digging into the A72 TRM there are a bunch of configuration signals
in this area; see SYSBARDISABLE and BROADCASTOUTER, for example.

Does the failure happen on both a53 and a72, or only on one CPU type?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-17 17:41       ` Will Deacon
@ 2021-06-17 20:11         ` Frank Li
  2021-06-17 21:40           ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Frank Li @ 2021-06-17 20:11 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas
  Cc: Zhi Li, Shenwei Wang, Han Xu, Nitin Garg, Jason Liu, linux-arm-kernel



> -----Original Message-----
> From: Will Deacon <will@kernel.org>
> Sent: Thursday, June 17, 2021 12:42 PM
> To: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Zhi Li <lznuaa@gmail.com>; Frank Li <frank.li@nxp.com>; Shenwei Wang
> <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> kernel@lists.infradead.org
> Subject: [EXT] Re: The problem about arm64: io: Relax implicit barriers in
> default I/O accessors
> 
> Caution: EXT Email
> 
> On Thu, Jun 17, 2021 at 06:25:28PM +0100, Will Deacon wrote:
> > On Thu, Jun 17, 2021 at 10:27:44AM +0100, Catalin Marinas wrote:
> > > On Wed, Jun 16, 2021 at 02:24:39PM -0500, Zhi Li wrote:
> > > > On Wed, Jun 16, 2021 at 2:18 PM Frank Li <frank.li@nxp.com> wrote:
> > > > > Will Deacon wrote:
> > > > > > It would also be helpful to know a bit more about the hardware:
> > > > > >
> > > > > >   - What is the "internal bus fabric"?
> > > >
> > > > > Look like ARM call as "Interconnect",  Multi AXI master and multi
> AXI slave
> > > > > connected together.
> > > >
> > > > I  drawed simplified bus structure.
> > > >
> > > >         ┌──────┐ ┌────┐
> > > >         │ A53  │ │A72 │
> > > >         └───┬──┘ └─┬──┘
> > > >             │      │
> > > >         ┌───▼──────▼──┐
> > > >         │    CCI400   │
> > > >         └─────┬───────┘
> > > >               │   1 (a)write to ddr (normal uncached memory)
> > > >               │   DMB OSHST
> > > >               │   2 (b)write to usb register(device, nGnRE)
> > > >         ┌─────▼───────────────────────┐       ┌
> ───────────┐
> > > >         │                             ◄───────┤   GPU     │
> > > >         │     Bus fabric              │       │           │
> > > >         └────────────────────────────┬┘       └
> ───────────┘
> > > > 3 (b) reach usb   ▲ 4 usb read   ▲   │ 6.(a)reach
> > > >          │        │   ddr        │   │
> > > >       ┌──▼────────┴─┐            │   │
> > > >       │             │            │   │
> > > >       │  USB        │      5.usb │   │
> > > >       │             │      read  │   │
> > > >       └─────────────┘            │   │
> > > >                                ┌─┴───▼─┐
> > > >                                │       │
> > > >                                │ DDR   │
> > > >                                │       │
> > > >                                └───────┘
> > >
> > > Since you sent an HTML message, it was rejected by the list server. The
> > > above is a plain-text rendition by w3m (and changed barrier() to DMB
> > > OSHST).
> > >
> > > Is the DMB propagated to the bus fabric? IIUC, our logic is that if the
> > > write (b) to USB is observable by, let's say, the GPU, the same GPU
> > > should also observe the write (a) to DDR. Since the write (a) to DDR is
> > > globally observable, the USB device read at (4) should also observe it
> > > (well, we may be wrong).
> >
> > It's pretty rare for barriers to propagate onto the fabric -- usually the
> > CPU just orders everything based on acknowledgements. If the CCI gives
> the
> > write response for the non-cacheable write I could see that causing an
> issue
> > if the bus fabric can then reorder accesses, but then I would argue
> that's a
> > broken system because simple ring buffers in non-cacheable memory would
> fail

Bus fabric don't reorder the same axi master. 
https://elinux.org/images/7/73/Deacon-weak-to-weedy.pdf
Page 42 show race condition. I think above race condition happen at our system.
I am not sure if it is exist at Armv8 system.

> > for peripherals hooking into the bus fabric (i.e. dma_*mb() would be
> > broken). I think it would also mean that DSB doesn't necessarily fix the
> > issue, it probably just makes it less likely because it takes longer to
> > get the device write out after the acknowledgement -- ndelay() would
> achieve
> > the same effect :)

That's what I worried. 

> >
> > Frank -- what happens if you try either DMB SY, or DMB OSH (without the
> ST)
> > in writel()?

It works well for 2 hours! Normally, problem happen below 10min. So I think DMB SY
can fix it. 

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index c3009b0e52393..277c9d1c1a8fa 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -47,7 +47,7 @@

 #define dma_mb()       dmb(osh)
 #define dma_rmb()      dmb(oshld)
-#define dma_wmb()      dmb(oshst)
+#define dma_wmb()      dmb(sy)

> 
> Also, digging into the A72 TRM there are a bunch of configuration signals
> in this area; see SYSBARDISABLE and BROADCASTOUTER, for example.
> 
> Does the failure happen on both a53 and a72, or only on one CPU type?

Both A53, A72 have this problem. 

> 
> Will
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-17 20:11         ` [EXT] " Frank Li
@ 2021-06-17 21:40           ` Will Deacon
  2021-06-17 22:13             ` Frank Li
                               ` (2 more replies)
  0 siblings, 3 replies; 27+ messages in thread
From: Will Deacon @ 2021-06-17 21:40 UTC (permalink / raw)
  To: Frank Li
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel

On Thu, Jun 17, 2021 at 08:11:50PM +0000, Frank Li wrote:
> 
> 
> > -----Original Message-----
> > From: Will Deacon <will@kernel.org>
> > Sent: Thursday, June 17, 2021 12:42 PM
> > To: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Zhi Li <lznuaa@gmail.com>; Frank Li <frank.li@nxp.com>; Shenwei Wang
> > <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> > <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> > kernel@lists.infradead.org
> > Subject: [EXT] Re: The problem about arm64: io: Relax implicit barriers in
> > default I/O accessors
> > 
> > Caution: EXT Email
> > 
> > On Thu, Jun 17, 2021 at 06:25:28PM +0100, Will Deacon wrote:
> > > On Thu, Jun 17, 2021 at 10:27:44AM +0100, Catalin Marinas wrote:
> > > > On Wed, Jun 16, 2021 at 02:24:39PM -0500, Zhi Li wrote:
> > > > > On Wed, Jun 16, 2021 at 2:18 PM Frank Li <frank.li@nxp.com> wrote:
> > > > > > Will Deacon wrote:
> > > > > > > It would also be helpful to know a bit more about the hardware:
> > > > > > >
> > > > > > >   - What is the "internal bus fabric"?
> > > > >
> > > > > > Look like ARM call as "Interconnect",  Multi AXI master and multi
> > AXI slave
> > > > > > connected together.
> > > > >
> > > > > I  drawed simplified bus structure.
> > > > >
> > > > >         ┌──────┐ ┌────┐
> > > > >         │ A53  │ │A72 │
> > > > >         └───┬──┘ └─┬──┘
> > > > >             │      │
> > > > >         ┌───▼──────▼──┐
> > > > >         │    CCI400   │
> > > > >         └─────┬───────┘
> > > > >               │   1 (a)write to ddr (normal uncached memory)
> > > > >               │   DMB OSHST
> > > > >               │   2 (b)write to usb register(device, nGnRE)
> > > > >         ┌─────▼───────────────────────┐       ┌
> > ───────────┐
> > > > >         │                             ◄───────┤   GPU     │
> > > > >         │     Bus fabric              │       │           │
> > > > >         └────────────────────────────┬┘       └
> > ───────────┘
> > > > > 3 (b) reach usb   ▲ 4 usb read   ▲   │ 6.(a)reach
> > > > >          │        │   ddr        │   │
> > > > >       ┌──▼────────┴─┐            │   │
> > > > >       │             │            │   │
> > > > >       │  USB        │      5.usb │   │
> > > > >       │             │      read  │   │
> > > > >       └─────────────┘            │   │
> > > > >                                ┌─┴───▼─┐
> > > > >                                │       │
> > > > >                                │ DDR   │
> > > > >                                │       │
> > > > >                                └───────┘
> > > >
> > > > Since you sent an HTML message, it was rejected by the list server. The
> > > > above is a plain-text rendition by w3m (and changed barrier() to DMB
> > > > OSHST).
> > > >
> > > > Is the DMB propagated to the bus fabric? IIUC, our logic is that if the
> > > > write (b) to USB is observable by, let's say, the GPU, the same GPU
> > > > should also observe the write (a) to DDR. Since the write (a) to DDR is
> > > > globally observable, the USB device read at (4) should also observe it
> > > > (well, we may be wrong).
> > >
> > > It's pretty rare for barriers to propagate onto the fabric -- usually the
> > > CPU just orders everything based on acknowledgements. If the CCI gives
> > the
> > > write response for the non-cacheable write I could see that causing an
> > issue
> > > if the bus fabric can then reorder accesses, but then I would argue
> > that's a
> > > broken system because simple ring buffers in non-cacheable memory would
> > fail
> 
> Bus fabric don't reorder the same axi master. 
> https://elinux.org/images/7/73/Deacon-weak-to-weedy.pdf
> Page 42 show race condition. I think above race condition happen at our system.
> I am not sure if it is exist at Armv8 system.

Just a word of warning here, but the Armv8 memory model was
*retrospectively* strengthened since I gave that talk, so the stuff in that
pdf is out of date (and wrong).

> > > for peripherals hooking into the bus fabric (i.e. dma_*mb() would be
> > > broken). I think it would also mean that DSB doesn't necessarily fix the
> > > issue, it probably just makes it less likely because it takes longer to
> > > get the device write out after the acknowledgement -- ndelay() would
> > achieve
> > > the same effect :)
> 
> That's what I worried. 
> 
> > >
> > > Frank -- what happens if you try either DMB SY, or DMB OSH (without the
> > ST)
> > > in writel()?
> 
> It works well for 2 hours! Normally, problem happen below 10min. So I think DMB SY
> can fix it. 

Oh, interesting. Maybe this is a case where OSH vs SY actually makes a
difference. I'm not quite sure what it means for the coherency of normal,
non-cacheable accesses (which are outer-shareable) so that probably needs a
bit more thought.

Can you confirm that the issue *does* still occur if you use dmb(osh)
instead of dmb(oshst), please?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-17 21:40           ` Will Deacon
@ 2021-06-17 22:13             ` Frank Li
  2021-06-18 14:56             ` Nitin Garg
  2021-06-21 16:11             ` Frank Li
  2 siblings, 0 replies; 27+ messages in thread
From: Frank Li @ 2021-06-17 22:13 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel



> -----Original Message-----
> From: Will Deacon <will@kernel.org>
> Sent: Thursday, June 17, 2021 4:40 PM
> To: Frank Li <frank.li@nxp.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li <lznuaa@gmail.com>;
> Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> kernel@lists.infradead.org
> Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> in default I/O accessors
> 
> Caution: EXT Email
> 
> On Thu, Jun 17, 2021 at 08:11:50PM +0000, Frank Li wrote:
> >
> >
> > > -----Original Message-----
> > > From: Will Deacon <will@kernel.org>
> > > Sent: Thursday, June 17, 2021 12:42 PM
> > > To: Catalin Marinas <catalin.marinas@arm.com>
> > > Cc: Zhi Li <lznuaa@gmail.com>; Frank Li <frank.li@nxp.com>; Shenwei
> Wang
> > > <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> > > <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> > > kernel@lists.infradead.org
> > > Subject: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> in
> > > default I/O accessors
> > >
> > > Caution: EXT Email
> > >
> > > On Thu, Jun 17, 2021 at 06:25:28PM +0100, Will Deacon wrote:
> > > > On Thu, Jun 17, 2021 at 10:27:44AM +0100, Catalin Marinas wrote:
> > > > > On Wed, Jun 16, 2021 at 02:24:39PM -0500, Zhi Li wrote:
> > > > > > On Wed, Jun 16, 2021 at 2:18 PM Frank Li <frank.li@nxp.com> wrote:
> > > > > > > Will Deacon wrote:
> > > > > > > > It would also be helpful to know a bit more about the
> hardware:
> > > > > > > >
> > > > > > > >   - What is the "internal bus fabric"?
> > > > > >
> > > > > > > Look like ARM call as "Interconnect",  Multi AXI master and
> multi
> > > AXI slave
> > > > > > > connected together.
> > > > > >
> > > > > > I  drawed simplified bus structure.
> > > > > >
> > > > > >         ┌──────┐ ┌────┐
> > > > > >         │ A53  │ │A72 │
> > > > > >         └───┬──┘ └─┬──┘
> > > > > >             │      │
> > > > > >         ┌───▼──────▼──┐
> > > > > >         │    CCI400   │
> > > > > >         └─────┬───────┘
> > > > > >               │   1 (a)write to ddr (normal uncached memory)
> > > > > >               │   DMB OSHST
> > > > > >               │   2 (b)write to usb register(device, nGnRE)
> > > > > >         ┌─────▼───────────────────────┐
> ┌
> > > ───────────┐
> > > > > >         │                             ◄───────┤   GPU
> │
> > > > > >         │     Bus fabric              │       │           │
> > > > > >         └────────────────────────────┬┘
> └
> > > ───────────┘
> > > > > > 3 (b) reach usb   ▲ 4 usb read   ▲   │ 6.(a)reach
> > > > > >          │        │   ddr        │   │
> > > > > >       ┌──▼────────┴─┐            │   │
> > > > > >       │             │            │   │
> > > > > >       │  USB        │      5.usb │   │
> > > > > >       │             │      read  │   │
> > > > > >       └─────────────┘            │   │
> > > > > >                                ┌─┴───▼─┐
> > > > > >                                │       │
> > > > > >                                │ DDR   │
> > > > > >                                │       │
> > > > > >                                └───────┘
> > > > >
> > > > > Since you sent an HTML message, it was rejected by the list server.
> The
> > > > > above is a plain-text rendition by w3m (and changed barrier() to
> DMB
> > > > > OSHST).
> > > > >
> > > > > Is the DMB propagated to the bus fabric? IIUC, our logic is that if
> the
> > > > > write (b) to USB is observable by, let's say, the GPU, the same GPU
> > > > > should also observe the write (a) to DDR. Since the write (a) to
> DDR is
> > > > > globally observable, the USB device read at (4) should also observe
> it
> > > > > (well, we may be wrong).
> > > >
> > > > It's pretty rare for barriers to propagate onto the fabric -- usually
> the
> > > > CPU just orders everything based on acknowledgements. If the CCI
> gives
> > > the
> > > > write response for the non-cacheable write I could see that causing
> an
> > > issue
> > > > if the bus fabric can then reorder accesses, but then I would argue
> > > that's a
> > > > broken system because simple ring buffers in non-cacheable memory
> would
> > > fail
> >
> > Bus fabric don't reorder the same axi master.
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felinux.or
> g%2Fimages%2F7%2F73%2FDeacon-weak-to-
> weedy.pdf&amp;data=04%7C01%7Cfrank.li%40nxp.com%7C5e6b6690d52d4e31d3a408d93
> 1d88105%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637595628211882416%7CU
> nknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC
> JXVCI6Mn0%3D%7C1000&amp;sdata=%2BEu10nmFVE1w3fBP11rXD8Wk1vVcvYLirjZQEhSIKCM
> %3D&amp;reserved=0
> > Page 42 show race condition. I think above race condition happen at our
> system.
> > I am not sure if it is exist at Armv8 system.
> 
> Just a word of warning here, but the Armv8 memory model was
> *retrospectively* strengthened since I gave that talk, so the stuff in that
> pdf is out of date (and wrong).
> 
> > > > for peripherals hooking into the bus fabric (i.e. dma_*mb() would be
> > > > broken). I think it would also mean that DSB doesn't necessarily fix
> the
> > > > issue, it probably just makes it less likely because it takes longer
> to
> > > > get the device write out after the acknowledgement -- ndelay() would
> > > achieve
> > > > the same effect :)
> >
> > That's what I worried.
> >
> > > >
> > > > Frank -- what happens if you try either DMB SY, or DMB OSH (without
> the
> > > ST)
> > > > in writel()?
> >
> > It works well for 2 hours! Normally, problem happen below 10min. So I
> think DMB SY
> > can fix it.
> 
> Oh, interesting. Maybe this is a case where OSH vs SY actually makes a
> difference. I'm not quite sure what it means for the coherency of normal,
> non-cacheable accesses (which are outer-shareable) so that probably needs a
> bit more thought.
> 
> Can you confirm that the issue *does* still occur if you use dmb(osh)
> instead of dmb(oshst), please?

Yes, dmb(osh) have problem. 

diff --git a/arch/arm64/include/asm/barrier.h b/arch/arm64/include/asm/barrier.h
index 277c9d1c1a8fa..d53fbe9f9f7ce 100644
--- a/arch/arm64/include/asm/barrier.h
+++ b/arch/arm64/include/asm/barrier.h
@@ -47,7 +47,7 @@

 #define dma_mb()       dmb(osh)
 #define dma_rmb()      dmb(oshld)
-#define dma_wmb()      dmb(sy)
+#define dma_wmb()      dmb(osh)

> 
> Will
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-17 21:40           ` Will Deacon
  2021-06-17 22:13             ` Frank Li
@ 2021-06-18 14:56             ` Nitin Garg
  2021-06-21 16:11             ` Frank Li
  2 siblings, 0 replies; 27+ messages in thread
From: Nitin Garg @ 2021-06-18 14:56 UTC (permalink / raw)
  To: Will Deacon, Frank Li
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Jason Liu,
	linux-arm-kernel


On Thu, Jun 17, 2021 at 08:11:50PM +0000, Frank Li wrote:
> 
> 
> > -----Original Message-----
> > From: Will Deacon <will@kernel.org>
> > Sent: Thursday, June 17, 2021 12:42 PM
> > To: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Zhi Li <lznuaa@gmail.com>; Frank Li <frank.li@nxp.com>; Shenwei Wang
> > <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> > <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> > kernel@lists.infradead.org
> > Subject: [EXT] Re: The problem about arm64: io: Relax implicit barriers in
> > default I/O accessors
> > 
> > Caution: EXT Email
> > 
> > On Thu, Jun 17, 2021 at 06:25:28PM +0100, Will Deacon wrote:
> > > On Thu, Jun 17, 2021 at 10:27:44AM +0100, Catalin Marinas wrote:
> > > > On Wed, Jun 16, 2021 at 02:24:39PM -0500, Zhi Li wrote:
> > > > > On Wed, Jun 16, 2021 at 2:18 PM Frank Li <frank.li@nxp.com> wrote:
> > > > > > Will Deacon wrote:
> > > > > > > It would also be helpful to know a bit more about the hardware:
> > > > > > >
> > > > > > >   - What is the "internal bus fabric"?
> > > > >
> > > > > > Look like ARM call as "Interconnect",  Multi AXI master and multi
> > AXI slave
> > > > > > connected together.
> > > > >
> > > > > I  drawed simplified bus structure.
> > > > >
> > > > >         ┌──────┐ ┌────┐
> > > > >         │ A53  │ │A72 │
> > > > >         └───┬──┘ └─┬──┘
> > > > >             │      │
> > > > >         ┌───▼──────▼──┐
> > > > >         │    CCI400   │
> > > > >         └─────┬───────┘
> > > > >               │   1 (a)write to ddr (normal uncached memory)
> > > > >               │   DMB OSHST
> > > > >               │   2 (b)write to usb register(device, nGnRE)
> > > > >         ┌─────▼───────────────────────┐       ┌
> > ───────────┐
> > > > >         │                             ◄───────┤   GPU     │
> > > > >         │     Bus fabric              │       │           │
> > > > >         └────────────────────────────┬┘       └
> > ───────────┘
> > > > > 3 (b) reach usb   ▲ 4 usb read   ▲   │ 6.(a)reach
> > > > >          │        │   ddr        │   │
> > > > >       ┌──▼────────┴─┐            │   │
> > > > >       │             │            │   │
> > > > >       │  USB        │      5.usb │   │
> > > > >       │             │      read  │   │
> > > > >       └─────────────┘            │   │
> > > > >                                ┌─┴───▼─┐
> > > > >                                │       │
> > > > >                                │ DDR   │
> > > > >                                │       │
> > > > >                                └───────┘
> > > >
> > > > Since you sent an HTML message, it was rejected by the list server. The
> > > > above is a plain-text rendition by w3m (and changed barrier() to DMB
> > > > OSHST).
> > > >
> > > > Is the DMB propagated to the bus fabric? IIUC, our logic is that if the
> > > > write (b) to USB is observable by, let's say, the GPU, the same GPU
> > > > should also observe the write (a) to DDR. Since the write (a) to DDR is
> > > > globally observable, the USB device read at (4) should also observe it
> > > > (well, we may be wrong).
> > >
> > > It's pretty rare for barriers to propagate onto the fabric -- usually the
> > > CPU just orders everything based on acknowledgements. If the CCI gives
> > the
> > > write response for the non-cacheable write I could see that causing an
> > issue
> > > if the bus fabric can then reorder accesses, but then I would argue
> > that's a
> > > broken system because simple ring buffers in non-cacheable memory would
> > fail
> 
> Bus fabric don't reorder the same axi master. 
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felinux.org%2Fimages%2F7%2F73%2FDeacon-weak-to-weedy.pdf&amp;data=04%7C01%7Cnitin.garg%40nxp.com%7C5e6b6690d52d4e31d3a408d931d88105%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637595628213301897%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wU7SmksL3We187u%2BadXAJcGgT0fVaOMw68iJka15xXc%3D&amp;reserved=0
> Page 42 show race condition. I think above race condition happen at our system.
> I am not sure if it is exist at Armv8 system.

Just a word of warning here, but the Armv8 memory model was
*retrospectively* strengthened since I gave that talk, so the stuff in that
pdf is out of date (and wrong).

> > > for peripherals hooking into the bus fabric (i.e. dma_*mb() would be
> > > broken). I think it would also mean that DSB doesn't necessarily fix the
> > > issue, it probably just makes it less likely because it takes longer to
> > > get the device write out after the acknowledgement -- ndelay() would
> > achieve
> > > the same effect :)
> 
> That's what I worried. 
> 
> > >
> > > Frank -- what happens if you try either DMB SY, or DMB OSH (without the
> > ST)
> > > in writel()?
> 
> It works well for 2 hours! Normally, problem happen below 10min. So I think DMB SY
> can fix it. 

> Oh, interesting. Maybe this is a case where OSH vs SY actually makes a
> difference. I'm not quite sure what it means for the coherency of normal,
> non-cacheable accesses (which are outer-shareable) so that probably needs a
> bit more thought.

> Can you confirm that the issue *does* still occur if you use dmb(osh)
> instead of dmb(oshst), please?

dsm(osh) fails; dmb(st) works fine like dmb(sy).

Nitin Garg
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-17 21:40           ` Will Deacon
  2021-06-17 22:13             ` Frank Li
  2021-06-18 14:56             ` Nitin Garg
@ 2021-06-21 16:11             ` Frank Li
  2021-06-21 16:26               ` Will Deacon
  2 siblings, 1 reply; 27+ messages in thread
From: Frank Li @ 2021-06-21 16:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel



> -----Original Message-----
> From: Will Deacon <will@kernel.org>
> Sent: Thursday, June 17, 2021 4:40 PM
> To: Frank Li <frank.li@nxp.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li <lznuaa@gmail.com>;
> Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> kernel@lists.infradead.org
> Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> in default I/O accessors
> 
> Caution: EXT Email
> 
> On Thu, Jun 17, 2021 at 08:11:50PM +0000, Frank Li wrote:
> >
> >
> > > -----Original Message-----
> > > From: Will Deacon <will@kernel.org>
> > > Sent: Thursday, June 17, 2021 12:42 PM
> > > To: Catalin Marinas <catalin.marinas@arm.com>
> > > Cc: Zhi Li <lznuaa@gmail.com>; Frank Li <frank.li@nxp.com>; Shenwei
> Wang
> > > <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> > > <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> > > kernel@lists.infradead.org
> > > Subject: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> in
> > > default I/O accessors
> > >
> > > Caution: EXT Email
> > >
> > > On Thu, Jun 17, 2021 at 06:25:28PM +0100, Will Deacon wrote:
> > > > On Thu, Jun 17, 2021 at 10:27:44AM +0100, Catalin Marinas wrote:
> > > > > On Wed, Jun 16, 2021 at 02:24:39PM -0500, Zhi Li wrote:
> > > > > > On Wed, Jun 16, 2021 at 2:18 PM Frank Li <frank.li@nxp.com> wrote:
> > > > > > > Will Deacon wrote:
> > > > > > > > It would also be helpful to know a bit more about the
> hardware:
> > > > > > > >
> > > > > > > >   - What is the "internal bus fabric"?
> > > > > >
> > > > > > > Look like ARM call as "Interconnect",  Multi AXI master and
> multi
> > > AXI slave
> > > > > > > connected together.
> > > > > >
> > > > > > I  drawed simplified bus structure.
> > > > > >
> > > > > >         ┌──────┐ ┌────┐
> > > > > >         │ A53  │ │A72 │
> > > > > >         └───┬──┘ └─┬──┘
> > > > > >             │      │
> > > > > >         ┌───▼──────▼──┐
> > > > > >         │    CCI400   │
> > > > > >         └─────┬───────┘
> > > > > >               │   1 (a)write to ddr (normal uncached memory)
> > > > > >               │   DMB OSHST
> > > > > >               │   2 (b)write to usb register(device, nGnRE)
> > > > > >         ┌─────▼───────────────────────┐
> ┌
> > > ───────────┐
> > > > > >         │                             ◄───────┤   GPU
> │
> > > > > >         │     Bus fabric              │       │           │
> > > > > >         └────────────────────────────┬┘
> └
> > > ───────────┘
> > > > > > 3 (b) reach usb   ▲ 4 usb read   ▲   │ 6.(a)reach
> > > > > >          │        │   ddr        │   │
> > > > > >       ┌──▼────────┴─┐            │   │
> > > > > >       │             │            │   │
> > > > > >       │  USB        │      5.usb │   │
> > > > > >       │             │      read  │   │
> > > > > >       └─────────────┘            │   │
> > > > > >                                ┌─┴───▼─┐
> > > > > >                                │       │
> > > > > >                                │ DDR   │
> > > > > >                                │       │
> > > > > >                                └───────┘
> > > > >
> > > > > Since you sent an HTML message, it was rejected by the list server.
> The
> > > > > above is a plain-text rendition by w3m (and changed barrier() to
> DMB
> > > > > OSHST).
> > > > >
> > > > > Is the DMB propagated to the bus fabric? IIUC, our logic is that if
> the
> > > > > write (b) to USB is observable by, let's say, the GPU, the same GPU
> > > > > should also observe the write (a) to DDR. Since the write (a) to
> DDR is
> > > > > globally observable, the USB device read at (4) should also observe
> it
> > > > > (well, we may be wrong).
> > > >
> > > > It's pretty rare for barriers to propagate onto the fabric -- usually
> the
> > > > CPU just orders everything based on acknowledgements. If the CCI
> gives
> > > the
> > > > write response for the non-cacheable write I could see that causing
> an
> > > issue
> > > > if the bus fabric can then reorder accesses, but then I would argue
> > > that's a
> > > > broken system because simple ring buffers in non-cacheable memory
> would
> > > fail
> >
> > Bus fabric don't reorder the same axi master.
> >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Felinux.or
> g%2Fimages%2F7%2F73%2FDeacon-weak-to-
> weedy.pdf&amp;data=04%7C01%7Cfrank.li%40nxp.com%7C5e6b6690d52d4e31d3a408d93
> 1d88105%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7C1%7C637595628211882416%7CU
> nknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLC
> JXVCI6Mn0%3D%7C1000&amp;sdata=%2BEu10nmFVE1w3fBP11rXD8Wk1vVcvYLirjZQEhSIKCM
> %3D&amp;reserved=0
> > Page 42 show race condition. I think above race condition happen at our
> system.
> > I am not sure if it is exist at Armv8 system.
> 
> Just a word of warning here, but the Armv8 memory model was
> *retrospectively* strengthened since I gave that talk, so the stuff in that
> pdf is out of date (and wrong).
> 
> > > > for peripherals hooking into the bus fabric (i.e. dma_*mb() would be
> > > > broken). I think it would also mean that DSB doesn't necessarily fix
> the
> > > > issue, it probably just makes it less likely because it takes longer
> to
> > > > get the device write out after the acknowledgement -- ndelay() would
> > > achieve
> > > > the same effect :)
> >
> > That's what I worried.
> >
> > > >
> > > > Frank -- what happens if you try either DMB SY, or DMB OSH (without
> the
> > > ST)
> > > > in writel()?
> >
> > It works well for 2 hours! Normally, problem happen below 10min. So I
> think DMB SY
> > can fix it.
> 
> Oh, interesting. Maybe this is a case where OSH vs SY actually makes a
> difference. I'm not quite sure what it means for the coherency of normal,
> non-cacheable accesses (which are outer-shareable) so that probably needs a
> bit more thought.
> 
> Can you confirm that the issue *does* still occur if you use dmb(osh)
> instead of dmb(oshst), please?

After get ARM support https://services.arm.com/support/s/case/5003t00001RuJHw, 
This issue have some progress. 

Our system configure SYSBARDISABLE = 0x0, So ARM core barrier propagate to CCI-400

Our DMA and USB is located below downstream of CCI-400. So USB or DMA is located
in system shared domain. Only use dmb(st), CCI-400 wait for previous transaction
Complete. When dma(osh), the response is sent when snoop responses are received for
all earlier transactions. CCI-400 don't wait for previous write finish. 

Best regards
Frank Li

> 
> Will
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-21 16:11             ` Frank Li
@ 2021-06-21 16:26               ` Will Deacon
  2021-06-21 16:59                 ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2021-06-21 16:26 UTC (permalink / raw)
  To: Frank Li
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel

On Mon, Jun 21, 2021 at 04:11:57PM +0000, Frank Li wrote:
> > Oh, interesting. Maybe this is a case where OSH vs SY actually makes a
> > difference. I'm not quite sure what it means for the coherency of normal,
> > non-cacheable accesses (which are outer-shareable) so that probably needs a
> > bit more thought.
> > 
> > Can you confirm that the issue *does* still occur if you use dmb(osh)
> > instead of dmb(oshst), please?
> 
> After get ARM support https://services.arm.com/support/s/case/5003t00001RuJHw, 
> This issue have some progress. 
> 
> Our system configure SYSBARDISABLE = 0x0, So ARM core barrier propagate to CCI-400
> 
> Our DMA and USB is located below downstream of CCI-400. So USB or DMA is located
> in system shared domain. Only use dmb(st), CCI-400 wait for previous transaction
> Complete. When dma(osh), the response is sent when snoop responses are received for
> all earlier transactions. CCI-400 don't wait for previous write finish. 

Thanks for following up. I'll cook a patch to fix this...

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-21 16:26               ` Will Deacon
@ 2021-06-21 16:59                 ` Will Deacon
  2021-06-21 17:56                   ` Frank Li
  0 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2021-06-21 16:59 UTC (permalink / raw)
  To: Frank Li
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel

On Mon, Jun 21, 2021 at 05:26:41PM +0100, Will Deacon wrote:
> On Mon, Jun 21, 2021 at 04:11:57PM +0000, Frank Li wrote:
> > > Oh, interesting. Maybe this is a case where OSH vs SY actually makes a
> > > difference. I'm not quite sure what it means for the coherency of normal,
> > > non-cacheable accesses (which are outer-shareable) so that probably needs a
> > > bit more thought.
> > > 
> > > Can you confirm that the issue *does* still occur if you use dmb(osh)
> > > instead of dmb(oshst), please?
> > 
> > After get ARM support https://services.arm.com/support/s/case/5003t00001RuJHw, 
> > This issue have some progress. 
> > 
> > Our system configure SYSBARDISABLE = 0x0, So ARM core barrier propagate to CCI-400
> > 
> > Our DMA and USB is located below downstream of CCI-400. So USB or DMA is located
> > in system shared domain. Only use dmb(st), CCI-400 wait for previous transaction
> > Complete. When dma(osh), the response is sent when snoop responses are received for
> > all earlier transactions. CCI-400 don't wait for previous write finish. 
> 
> Thanks for following up. I'll cook a patch to fix this...

... and in doing so, I realised I still have a question about this.

If a CPU is writing to a zero-initialised non-cacheable buffer in memory
and does something like:

	buffer[0] = 1;
	dma_wmb();	// DMB OSHST
	buffer[64] = 1;

would a non-coherent device reading this be able to see buffer[64] == 1
but buffer[0] = 0? In other words, do we need to upgrade the dmb_* barriers
as well as the I/O accessors, or are they still ordered by the bus fabric
because all of the accesses are going to the DDR?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-21 16:59                 ` Will Deacon
@ 2021-06-21 17:56                   ` Frank Li
  2021-06-21 18:13                     ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Frank Li @ 2021-06-21 17:56 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel



> -----Original Message-----
> From: Will Deacon <will@kernel.org>
> Sent: Monday, June 21, 2021 12:00 PM
> To: Frank Li <frank.li@nxp.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li <lznuaa@gmail.com>;
> Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> kernel@lists.infradead.org
> Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> in default I/O accessors
> 
> Caution: EXT Email
> 
> On Mon, Jun 21, 2021 at 05:26:41PM +0100, Will Deacon wrote:
> > On Mon, Jun 21, 2021 at 04:11:57PM +0000, Frank Li wrote:
> > > > Oh, interesting. Maybe this is a case where OSH vs SY actually makes
> a
> > > > difference. I'm not quite sure what it means for the coherency of
> normal,
> > > > non-cacheable accesses (which are outer-shareable) so that probably
> needs a
> > > > bit more thought.
> > > >
> > > > Can you confirm that the issue *does* still occur if you use dmb(osh)
> > > > instead of dmb(oshst), please?
> > >
> > > After get ARM support
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fservices.
> arm.com%2Fsupport%2Fs%2Fcase%2F5003t00001RuJHw&amp;data=04%7C01%7Cfrank.li%
> 40nxp.com%7Ca319ac5213a14aa6bb2508d934d5facc%7C686ea1d3bc2b4c6fa92cd99c5c30
> 1635%7C0%7C0%7C637598915908588560%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6%2F%2FK
> ScsCmnUgNPnzcvyjRrOLjLVPrHtbVgI3J959U%2BQ%3D&amp;reserved=0,
> > > This issue have some progress.
> > >
> > > Our system configure SYSBARDISABLE = 0x0, So ARM core barrier propagate
> to CCI-400
> > >
> > > Our DMA and USB is located below downstream of CCI-400. So USB or DMA
> is located
> > > in system shared domain. Only use dmb(st), CCI-400 wait for previous
> transaction
> > > Complete. When dma(osh), the response is sent when snoop responses are
> received for
> > > all earlier transactions. CCI-400 don't wait for previous write finish.
> >
> > Thanks for following up. I'll cook a patch to fix this...
> 
> ... and in doing so, I realised I still have a question about this.
> 
> If a CPU is writing to a zero-initialised non-cacheable buffer in memory
> and does something like:
> 
>         buffer[0] = 1;
>         dma_wmb();      // DMB OSHST
>         buffer[64] = 1;
> 
> would a non-coherent device reading this be able to see buffer[64] == 1
> but buffer[0] = 0? In other words, do we need to upgrade the dmb_* barriers
> as well as the I/O accessors, or are they still ordered by the bus fabric
> because all of the accesses are going to the DDR?

I think re-order is possible. According to my understanding, 
If cci ack dmb(oshst), the follow order is not guaranteed if no address overlap
for normal memory. 

A6.6.1 of AXI protocol spec. 

A write W1 must be ordered before a write W2 with the same ID, to the *same Memory location*, where W2
is received after W1 is received.    

> 
> Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-21 17:56                   ` Frank Li
@ 2021-06-21 18:13                     ` Will Deacon
  2021-06-21 21:32                       ` Frank Li
  0 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2021-06-21 18:13 UTC (permalink / raw)
  To: Frank Li
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel

On Mon, Jun 21, 2021 at 05:56:43PM +0000, Frank Li wrote:
> 
> 
> > -----Original Message-----
> > From: Will Deacon <will@kernel.org>
> > Sent: Monday, June 21, 2021 12:00 PM
> > To: Frank Li <frank.li@nxp.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li <lznuaa@gmail.com>;
> > Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> > <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> > kernel@lists.infradead.org
> > Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> > in default I/O accessors
> > 
> > Caution: EXT Email
> > 
> > On Mon, Jun 21, 2021 at 05:26:41PM +0100, Will Deacon wrote:
> > > On Mon, Jun 21, 2021 at 04:11:57PM +0000, Frank Li wrote:
> > > > > Oh, interesting. Maybe this is a case where OSH vs SY actually makes
> > a
> > > > > difference. I'm not quite sure what it means for the coherency of
> > normal,
> > > > > non-cacheable accesses (which are outer-shareable) so that probably
> > needs a
> > > > > bit more thought.
> > > > >
> > > > > Can you confirm that the issue *does* still occur if you use dmb(osh)
> > > > > instead of dmb(oshst), please?
> > > >
> > > > After get ARM support
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fservices.
> > arm.com%2Fsupport%2Fs%2Fcase%2F5003t00001RuJHw&amp;data=04%7C01%7Cfrank.li%
> > 40nxp.com%7Ca319ac5213a14aa6bb2508d934d5facc%7C686ea1d3bc2b4c6fa92cd99c5c30
> > 1635%7C0%7C0%7C637598915908588560%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> > DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6%2F%2FK
> > ScsCmnUgNPnzcvyjRrOLjLVPrHtbVgI3J959U%2BQ%3D&amp;reserved=0,
> > > > This issue have some progress.
> > > >
> > > > Our system configure SYSBARDISABLE = 0x0, So ARM core barrier propagate
> > to CCI-400
> > > >
> > > > Our DMA and USB is located below downstream of CCI-400. So USB or DMA
> > is located
> > > > in system shared domain. Only use dmb(st), CCI-400 wait for previous
> > transaction
> > > > Complete. When dma(osh), the response is sent when snoop responses are
> > received for
> > > > all earlier transactions. CCI-400 don't wait for previous write finish.
> > >
> > > Thanks for following up. I'll cook a patch to fix this...
> > 
> > ... and in doing so, I realised I still have a question about this.
> > 
> > If a CPU is writing to a zero-initialised non-cacheable buffer in memory
> > and does something like:
> > 
> >         buffer[0] = 1;
> >         dma_wmb();      // DMB OSHST
> >         buffer[64] = 1;
> > 
> > would a non-coherent device reading this be able to see buffer[64] == 1
> > but buffer[0] = 0? In other words, do we need to upgrade the dmb_* barriers
> > as well as the I/O accessors, or are they still ordered by the bus fabric
> > because all of the accesses are going to the DDR?
> 
> I think re-order is possible. According to my understanding, 
> If cci ack dmb(oshst), the follow order is not guaranteed if no address overlap
> for normal memory. 

Hmm, so that's a bit rubbish because it means that
load-acquire/store-release to non-cacheable memory will *not* create order
for non-coherent devices, as the memory type is outer-shareable :/

So rewriting the above as:

	buffer[0] = 1;
	smp_store_release(&buffer[64], 1);

wouldn't be ordered either.

Can you confirm that it is the case, please?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-21 18:13                     ` Will Deacon
@ 2021-06-21 21:32                       ` Frank Li
  2021-06-22  9:11                         ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Frank Li @ 2021-06-21 21:32 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel



> -----Original Message-----
> From: Will Deacon <will@kernel.org>
> Sent: Monday, June 21, 2021 1:13 PM
> To: Frank Li <frank.li@nxp.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li <lznuaa@gmail.com>;
> Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> kernel@lists.infradead.org
> Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> in default I/O accessors
> 
> Caution: EXT Email
> 
> On Mon, Jun 21, 2021 at 05:56:43PM +0000, Frank Li wrote:
> >
> >
> > > -----Original Message-----
> > > From: Will Deacon <will@kernel.org>
> > > Sent: Monday, June 21, 2021 12:00 PM
> > > To: Frank Li <frank.li@nxp.com>
> > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li
> <lznuaa@gmail.com>;
> > > Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin
> Garg
> > > <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> > > kernel@lists.infradead.org
> > > Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit
> barriers
> > > in default I/O accessors
> > >
> > > Caution: EXT Email
> > >
> > > On Mon, Jun 21, 2021 at 05:26:41PM +0100, Will Deacon wrote:
> > > > On Mon, Jun 21, 2021 at 04:11:57PM +0000, Frank Li wrote:
> > > > > > Oh, interesting. Maybe this is a case where OSH vs SY actually
> makes
> > > a
> > > > > > difference. I'm not quite sure what it means for the coherency of
> > > normal,
> > > > > > non-cacheable accesses (which are outer-shareable) so that
> probably
> > > needs a
> > > > > > bit more thought.
> > > > > >
> > > > > > Can you confirm that the issue *does* still occur if you use
> dmb(osh)
> > > > > > instead of dmb(oshst), please?
> > > > >
> > > > > After get ARM support
> > >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fservices.
> > >
> arm.com%2Fsupport%2Fs%2Fcase%2F5003t00001RuJHw&amp;data=04%7C01%7Cfrank.li%
> > >
> 40nxp.com%7Ca319ac5213a14aa6bb2508d934d5facc%7C686ea1d3bc2b4c6fa92cd99c5c30
> > >
> 1635%7C0%7C0%7C637598915908588560%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> > >
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6%2F%2FK
> > > ScsCmnUgNPnzcvyjRrOLjLVPrHtbVgI3J959U%2BQ%3D&amp;reserved=0,
> > > > > This issue have some progress.
> > > > >
> > > > > Our system configure SYSBARDISABLE = 0x0, So ARM core barrier
> propagate
> > > to CCI-400
> > > > >
> > > > > Our DMA and USB is located below downstream of CCI-400. So USB or
> DMA
> > > is located
> > > > > in system shared domain. Only use dmb(st), CCI-400 wait for
> previous
> > > transaction
> > > > > Complete. When dma(osh), the response is sent when snoop responses
> are
> > > received for
> > > > > all earlier transactions. CCI-400 don't wait for previous write
> finish.
> > > >
> > > > Thanks for following up. I'll cook a patch to fix this...
> > >
> > > ... and in doing so, I realised I still have a question about this.
> > >
> > > If a CPU is writing to a zero-initialised non-cacheable buffer in
> memory
> > > and does something like:
> > >
> > >         buffer[0] = 1;
> > >         dma_wmb();      // DMB OSHST
> > >         buffer[64] = 1;
> > >
> > > would a non-coherent device reading this be able to see buffer[64] == 1
> > > but buffer[0] = 0? In other words, do we need to upgrade the dmb_*
> barriers
> > > as well as the I/O accessors, or are they still ordered by the bus
> fabric
> > > because all of the accesses are going to the DDR?
> >
> > I think re-order is possible. According to my understanding,
> > If cci ack dmb(oshst), the follow order is not guaranteed if no address
> overlap
> > for normal memory.
> 
> Hmm, so that's a bit rubbish because it means that
> load-acquire/store-release to non-cacheable memory will *not* create order
> for non-coherent devices, as the memory type is outer-shareable :/
> 
> So rewriting the above as:
> 
>         buffer[0] = 1;
>         smp_store_release(&buffer[64], 1);
> 
> wouldn't be ordered either.
> 
> Can you confirm that it is the case, please?

I have not test case, which can test it directly. 
I supposed smp_mb is not work for no-coherent dma master. 
If want dma master see order, need dma_wmb(). 

Best regards
Frank Li

> 
> Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-21 21:32                       ` Frank Li
@ 2021-06-22  9:11                         ` Will Deacon
  2021-06-23 15:48                           ` Frank Li
  0 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2021-06-22  9:11 UTC (permalink / raw)
  To: Frank Li
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel

On Mon, Jun 21, 2021 at 09:32:22PM +0000, Frank Li wrote:
> 
> 
> > -----Original Message-----
> > From: Will Deacon <will@kernel.org>
> > Sent: Monday, June 21, 2021 1:13 PM
> > To: Frank Li <frank.li@nxp.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li <lznuaa@gmail.com>;
> > Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> > <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> > kernel@lists.infradead.org
> > Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> > in default I/O accessors
> > 
> > Caution: EXT Email
> > 
> > On Mon, Jun 21, 2021 at 05:56:43PM +0000, Frank Li wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Will Deacon <will@kernel.org>
> > > > Sent: Monday, June 21, 2021 12:00 PM
> > > > To: Frank Li <frank.li@nxp.com>
> > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li
> > <lznuaa@gmail.com>;
> > > > Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin
> > Garg
> > > > <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> > > > kernel@lists.infradead.org
> > > > Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit
> > barriers
> > > > in default I/O accessors
> > > >
> > > > Caution: EXT Email
> > > >
> > > > On Mon, Jun 21, 2021 at 05:26:41PM +0100, Will Deacon wrote:
> > > > > On Mon, Jun 21, 2021 at 04:11:57PM +0000, Frank Li wrote:
> > > > > > > Oh, interesting. Maybe this is a case where OSH vs SY actually
> > makes
> > > > a
> > > > > > > difference. I'm not quite sure what it means for the coherency of
> > > > normal,
> > > > > > > non-cacheable accesses (which are outer-shareable) so that
> > probably
> > > > needs a
> > > > > > > bit more thought.
> > > > > > >
> > > > > > > Can you confirm that the issue *does* still occur if you use
> > dmb(osh)
> > > > > > > instead of dmb(oshst), please?
> > > > > >
> > > > > > After get ARM support
> > > >
> > https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fservices.
> > > >
> > arm.com%2Fsupport%2Fs%2Fcase%2F5003t00001RuJHw&amp;data=04%7C01%7Cfrank.li%
> > > >
> > 40nxp.com%7Ca319ac5213a14aa6bb2508d934d5facc%7C686ea1d3bc2b4c6fa92cd99c5c30
> > > >
> > 1635%7C0%7C0%7C637598915908588560%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> > > >
> > DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6%2F%2FK
> > > > ScsCmnUgNPnzcvyjRrOLjLVPrHtbVgI3J959U%2BQ%3D&amp;reserved=0,
> > > > > > This issue have some progress.
> > > > > >
> > > > > > Our system configure SYSBARDISABLE = 0x0, So ARM core barrier
> > propagate
> > > > to CCI-400
> > > > > >
> > > > > > Our DMA and USB is located below downstream of CCI-400. So USB or
> > DMA
> > > > is located
> > > > > > in system shared domain. Only use dmb(st), CCI-400 wait for
> > previous
> > > > transaction
> > > > > > Complete. When dma(osh), the response is sent when snoop responses
> > are
> > > > received for
> > > > > > all earlier transactions. CCI-400 don't wait for previous write
> > finish.
> > > > >
> > > > > Thanks for following up. I'll cook a patch to fix this...
> > > >
> > > > ... and in doing so, I realised I still have a question about this.
> > > >
> > > > If a CPU is writing to a zero-initialised non-cacheable buffer in
> > memory
> > > > and does something like:
> > > >
> > > >         buffer[0] = 1;
> > > >         dma_wmb();      // DMB OSHST
> > > >         buffer[64] = 1;
> > > >
> > > > would a non-coherent device reading this be able to see buffer[64] == 1
> > > > but buffer[0] = 0? In other words, do we need to upgrade the dmb_*
> > barriers
> > > > as well as the I/O accessors, or are they still ordered by the bus
> > fabric
> > > > because all of the accesses are going to the DDR?
> > >
> > > I think re-order is possible. According to my understanding,
> > > If cci ack dmb(oshst), the follow order is not guaranteed if no address
> > overlap
> > > for normal memory.
> > 
> > Hmm, so that's a bit rubbish because it means that
> > load-acquire/store-release to non-cacheable memory will *not* create order
> > for non-coherent devices, as the memory type is outer-shareable :/
> > 
> > So rewriting the above as:
> > 
> >         buffer[0] = 1;
> >         smp_store_release(&buffer[64], 1);
> > 
> > wouldn't be ordered either.
> > 
> > Can you confirm that it is the case, please?
> 
> I have not test case, which can test it directly. 
> I supposed smp_mb is not work for no-coherent dma master. 
> If want dma master see order, need dma_wmb(). 

I think you had a support case open with Arm [1] which I'm not able to
access -- please can you ask them about the two examples above?

Will

[1] https://services.arm.com/support/s/case/5003t00001RuJHw

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-22  9:11                         ` Will Deacon
@ 2021-06-23 15:48                           ` Frank Li
  2021-07-06 17:11                             ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Frank Li @ 2021-06-23 15:48 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel



> -----Original Message-----
> From: Will Deacon <will@kernel.org>
> Sent: Tuesday, June 22, 2021 4:12 AM
> To: Frank Li <frank.li@nxp.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li <lznuaa@gmail.com>;
> Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> kernel@lists.infradead.org
> Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> in default I/O accessors
> 
> Caution: EXT Email
> 
> On Mon, Jun 21, 2021 at 09:32:22PM +0000, Frank Li wrote:
> >
> >
> > > -----Original Message-----
> > > From: Will Deacon <will@kernel.org>
> > > Sent: Monday, June 21, 2021 1:13 PM
> > > To: Frank Li <frank.li@nxp.com>
> > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li
> <lznuaa@gmail.com>;
> > > Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin
> Garg
> > > <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> > > kernel@lists.infradead.org
> > > Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit
> barriers
> > > in default I/O accessors
> > >
> > > Caution: EXT Email
> > >
> > > On Mon, Jun 21, 2021 at 05:56:43PM +0000, Frank Li wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Will Deacon <will@kernel.org>
> > > > > Sent: Monday, June 21, 2021 12:00 PM
> > > > > To: Frank Li <frank.li@nxp.com>
> > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li
> > > <lznuaa@gmail.com>;
> > > > > Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin
> > > Garg
> > > > > <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> > > > > kernel@lists.infradead.org
> > > > > Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit
> > > barriers
> > > > > in default I/O accessors
> > > > >
> > > > > Caution: EXT Email
> > > > >
> > > > > On Mon, Jun 21, 2021 at 05:26:41PM +0100, Will Deacon wrote:
> > > > > > On Mon, Jun 21, 2021 at 04:11:57PM +0000, Frank Li wrote:
> > > > > > > > Oh, interesting. Maybe this is a case where OSH vs SY
> actually
> > > makes
> > > > > a
> > > > > > > > difference. I'm not quite sure what it means for the
> coherency of
> > > > > normal,
> > > > > > > > non-cacheable accesses (which are outer-shareable) so that
> > > probably
> > > > > needs a
> > > > > > > > bit more thought.
> > > > > > > >
> > > > > > > > Can you confirm that the issue *does* still occur if you use
> > > dmb(osh)
> > > > > > > > instead of dmb(oshst), please?
> > > > > > >
> > > > > > > After get ARM support
> > > > >
> > >
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fservices.
> > > > >
> > >
> arm.com%2Fsupport%2Fs%2Fcase%2F5003t00001RuJHw&amp;data=04%7C01%7Cfrank.li%
> > > > >
> > >
> 40nxp.com%7Ca319ac5213a14aa6bb2508d934d5facc%7C686ea1d3bc2b4c6fa92cd99c5c30
> > > > >
> > >
> 1635%7C0%7C0%7C637598915908588560%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> > > > >
> > >
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6%2F%2FK
> > > > > ScsCmnUgNPnzcvyjRrOLjLVPrHtbVgI3J959U%2BQ%3D&amp;reserved=0,
> > > > > > > This issue have some progress.
> > > > > > >
> > > > > > > Our system configure SYSBARDISABLE = 0x0, So ARM core barrier
> > > propagate
> > > > > to CCI-400
> > > > > > >
> > > > > > > Our DMA and USB is located below downstream of CCI-400. So USB
> or
> > > DMA
> > > > > is located
> > > > > > > in system shared domain. Only use dmb(st), CCI-400 wait for
> > > previous
> > > > > transaction
> > > > > > > Complete. When dma(osh), the response is sent when snoop
> responses
> > > are
> > > > > received for
> > > > > > > all earlier transactions. CCI-400 don't wait for previous write
> > > finish.
> > > > > >
> > > > > > Thanks for following up. I'll cook a patch to fix this...
> > > > >
> > > > > ... and in doing so, I realised I still have a question about this.
> > > > >
> > > > > If a CPU is writing to a zero-initialised non-cacheable buffer in
> > > memory
> > > > > and does something like:
> > > > >
> > > > >         buffer[0] = 1;
> > > > >         dma_wmb();      // DMB OSHST
> > > > >         buffer[64] = 1;
> > > > >
> > > > > would a non-coherent device reading this be able to see buffer[64]
> == 1
> > > > > but buffer[0] = 0? In other words, do we need to upgrade the dmb_*
> > > barriers
> > > > > as well as the I/O accessors, or are they still ordered by the bus
> > > fabric
> > > > > because all of the accesses are going to the DDR?
> > > >
> > > > I think re-order is possible. According to my understanding,
> > > > If cci ack dmb(oshst), the follow order is not guaranteed if no
> address
> > > overlap
> > > > for normal memory.
> > >
> > > Hmm, so that's a bit rubbish because it means that
> > > load-acquire/store-release to non-cacheable memory will *not* create
> order
> > > for non-coherent devices, as the memory type is outer-shareable :/
> > >
> > > So rewriting the above as:
> > >
> > >         buffer[0] = 1;
> > >         smp_store_release(&buffer[64], 1);
> > >
> > > wouldn't be ordered either.
> > >
> > > Can you confirm that it is the case, please?
> >
> > I have not test case, which can test it directly.
> > I supposed smp_mb is not work for no-coherent dma master.
> > If want dma master see order, need dma_wmb().
> 
> I think you had a support case open with Arm [1] which I'm not able to
> access -- please can you ask them about the two examples above?

Still not get feedback from ARM.
But I found some information, 
https://developer.arm.com/documentation/den0024/a/CHDCJBGA

Unlike the data barrier instructions, which take a qualifier to control which shareability domains see the effect of the barrier, the LDAR and STLR instructions use the attribute of the address accessed.

* address attribute * is controlled by page table.

SH0 bits[13:12] Shareability     
  00            Non-shareable    
  01            UNPREDICTABLE
  10            Outer Shareable
  11            Inner Shareable

#define PTE_SHARED               (_AT(pteval_t, 3) << 8)         /* SH[1:0], inner shareable */

So I think smp_store_release barrier to inner shared domain only.

Frank Li

> 
> Will
> 
> [1]
> https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fservices.
> arm.com%2Fsupport%2Fs%2Fcase%2F5003t00001RuJHw&amp;data=04%7C01%7Cfrank.li%
> 40nxp.com%7C985edf1d391d42b0a6c908d9355dc3d7%7C686ea1d3bc2b4c6fa92cd99c5c30
> 1635%7C0%7C0%7C637599499095794610%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wgaC0e%2
> B%2BjDBC0LrqVX%2F0b4KHJUqds5DUS72db94%2B%2Fsw%3D&amp;reserved=0

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-23 15:48                           ` Frank Li
@ 2021-07-06 17:11                             ` Will Deacon
  2021-07-15 15:53                               ` Frank Li
  0 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2021-07-06 17:11 UTC (permalink / raw)
  To: Frank Li
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel

Hi Frank,

On Wed, Jun 23, 2021 at 03:48:10PM +0000, Frank Li wrote:
> > I think you had a support case open with Arm [1] which I'm not able to
> > access -- please can you ask them about the two examples above?
> 
> Still not get feedback from ARM.

Just wondering if you were able to solve this without the need to change
Linux?

Cheers,

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-07-06 17:11                             ` Will Deacon
@ 2021-07-15 15:53                               ` Frank Li
  2021-07-22 19:14                                 ` Frank Li
  0 siblings, 1 reply; 27+ messages in thread
From: Frank Li @ 2021-07-15 15:53 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel



> -----Original Message-----
> From: Will Deacon <will@kernel.org>
> Sent: Tuesday, July 6, 2021 12:11 PM
> To: Frank Li <frank.li@nxp.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li <lznuaa@gmail.com>;
> Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> kernel@lists.infradead.org
> Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> in default I/O accessors
> 
> Caution: EXT Email
> 
> Hi Frank,
> 
> On Wed, Jun 23, 2021 at 03:48:10PM +0000, Frank Li wrote:
> > > I think you had a support case open with Arm [1] which I'm not able to
> > > access -- please can you ask them about the two examples above?
> >
> > Still not get feedback from ARM.
> 
> Just wondering if you were able to solve this without the need to change
> Linux?

Sorry for late reply

For CCI-500 and 550, ARM removed support for barrier transactions but CCI-400 supports barrier transactions. With CCI-400 it is a valid configuration to have SYSBARDISABLE LOW in Cortex-A processors. This change in Linux kernel is assuming that the SYSBARDISABLE is set to HIGH hence its not correct change for all products having various versions of ARM CCI IP.

Frank Li

> 
> Cheers,
> 
> Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-07-15 15:53                               ` Frank Li
@ 2021-07-22 19:14                                 ` Frank Li
  2021-08-09 13:50                                   ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Frank Li @ 2021-07-22 19:14 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel



> -----Original Message-----
> From: Frank Li
> Sent: Thursday, July 15, 2021 10:54 AM
> To: Will Deacon <will@kernel.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li <lznuaa@gmail.com>;
> Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> kernel@lists.infradead.org
> Subject: RE: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> in default I/O accessors
> 
> 
> 
> > -----Original Message-----
> > From: Will Deacon <will@kernel.org>
> > Sent: Tuesday, July 6, 2021 12:11 PM
> > To: Frank Li <frank.li@nxp.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li <lznuaa@gmail.com>;
> > Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> > <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> > kernel@lists.infradead.org
> > Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit
> barriers
> > in default I/O accessors
> >
> > Caution: EXT Email
> >
> > Hi Frank,
> >
> > On Wed, Jun 23, 2021 at 03:48:10PM +0000, Frank Li wrote:
> > > > I think you had a support case open with Arm [1] which I'm not able
> to
> > > > access -- please can you ask them about the two examples above?
> > >
> > > Still not get feedback from ARM.
> >
> > Just wondering if you were able to solve this without the need to change
> > Linux?
> 
> Sorry for late reply
> 
> For CCI-500 and 550, ARM removed support for barrier transactions but CCI-
> 400 supports barrier transactions. With CCI-400 it is a valid configuration
> to have SYSBARDISABLE LOW in Cortex-A processors. This change in Linux
> kernel is assuming that the SYSBARDISABLE is set to HIGH hence its not
> correct change for all products having various versions of ARM CCI IP.
> 
> Frank Li

Deacon:

      Did you plan fix this problem by changing dma_wmb()?

Frank Li
> 
> >
> > Cheers,
> >
> > Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-07-22 19:14                                 ` Frank Li
@ 2021-08-09 13:50                                   ` Will Deacon
  2021-08-09 14:46                                     ` Frank Li
  0 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2021-08-09 13:50 UTC (permalink / raw)
  To: Frank Li
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel

On Thu, Jul 22, 2021 at 07:14:19PM +0000, Frank Li wrote:
> > > On Wed, Jun 23, 2021 at 03:48:10PM +0000, Frank Li wrote:
> > > > > I think you had a support case open with Arm [1] which I'm not able
> > to
> > > > > access -- please can you ask them about the two examples above?
> > > >
> > > > Still not get feedback from ARM.
> > >
> > > Just wondering if you were able to solve this without the need to change
> > > Linux?
> > 
> > Sorry for late reply
> > 
> > For CCI-500 and 550, ARM removed support for barrier transactions but CCI-
> > 400 supports barrier transactions. With CCI-400 it is a valid configuration
> > to have SYSBARDISABLE LOW in Cortex-A processors. This change in Linux
> > kernel is assuming that the SYSBARDISABLE is set to HIGH hence its not
> > correct change for all products having various versions of ARM CCI IP.
> > 
> > Frank Li
> 
> Deacon:
> 
>       Did you plan fix this problem by changing dma_wmb()?

No. As far as I understand this problem, you're driving SYSBARDISABLE
'low' yet you have your own bus fabric downstream of the CCI which doesn't
respect barrier transactions. Even if we bodge dma_wmb(), store-release to
non-cacheable memory cannot be made to work on your system as you're
effectively putting some of your non-coherent DMA devices into a separate
outer-shareable domain from the CPUs.

So you have two options:

  1. Drive SYSBARDISABLE 'high' so that the CPU handles ordering for you

- or -

  2. Quirk Linux so that we patch dma_wmb() when we detect your system at
     runtime (so we can extend this in future if we need to emit a different
     sequence for store release)

(1) is definitely the easiest option if it's possible.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-08-09 13:50                                   ` Will Deacon
@ 2021-08-09 14:46                                     ` Frank Li
  2021-08-09 15:26                                       ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Frank Li @ 2021-08-09 14:46 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel



> -----Original Message-----
> From: Will Deacon <will@kernel.org>
> Sent: Monday, August 9, 2021 8:51 AM
> To: Frank Li <frank.li@nxp.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li <lznuaa@gmail.com>;
> Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> kernel@lists.infradead.org
> Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> in default I/O accessors
> 
> Caution: EXT Email
> 
> On Thu, Jul 22, 2021 at 07:14:19PM +0000, Frank Li wrote:
> > > > On Wed, Jun 23, 2021 at 03:48:10PM +0000, Frank Li wrote:
> > > > > > I think you had a support case open with Arm [1] which I'm not
> able
> > > to
> > > > > > access -- please can you ask them about the two examples above?
> > > > >
> > > > > Still not get feedback from ARM.
> > > >
> > > > Just wondering if you were able to solve this without the need to
> change
> > > > Linux?
> > >
> > > Sorry for late reply
> > >
> > > For CCI-500 and 550, ARM removed support for barrier transactions but
> CCI-
> > > 400 supports barrier transactions. With CCI-400 it is a valid
> configuration
> > > to have SYSBARDISABLE LOW in Cortex-A processors. This change in Linux
> > > kernel is assuming that the SYSBARDISABLE is set to HIGH hence its not
> > > correct change for all products having various versions of ARM CCI IP.
> > >
> > > Frank Li
> >
> > Deacon:
> >
> >       Did you plan fix this problem by changing dma_wmb()?
> 
> No. As far as I understand this problem, you're driving SYSBARDISABLE
> 'low' yet you have your own bus fabric downstream of the CCI which doesn't
> respect barrier transactions. Even if we bodge dma_wmb(), store-release to
> non-cacheable memory cannot be made to work on your system as you're
> effectively putting some of your non-coherent DMA devices into a separate
> outer-shareable domain from the CPUs.

Does it means the Linux expect all DMA devices in outer-shareable domain instead
of system shared domain?  

Frank

> 
> So you have two options:
> 
>   1. Drive SYSBARDISABLE 'high' so that the CPU handles ordering for you
> 
> - or -
> 
>   2. Quirk Linux so that we patch dma_wmb() when we detect your system at
>      runtime (so we can extend this in future if we need to emit a
> different
>      sequence for store release)
> 
> (1) is definitely the easiest option if it's possible.
> 
> Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-08-09 14:46                                     ` Frank Li
@ 2021-08-09 15:26                                       ` Will Deacon
  2021-08-10 18:50                                         ` Frank Li
  0 siblings, 1 reply; 27+ messages in thread
From: Will Deacon @ 2021-08-09 15:26 UTC (permalink / raw)
  To: Frank Li
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel

On Mon, Aug 09, 2021 at 02:46:55PM +0000, Frank Li wrote:
> 
> 
> > -----Original Message-----
> > From: Will Deacon <will@kernel.org>
> > Sent: Monday, August 9, 2021 8:51 AM
> > To: Frank Li <frank.li@nxp.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li <lznuaa@gmail.com>;
> > Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> > <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> > kernel@lists.infradead.org
> > Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> > in default I/O accessors
> > 
> > Caution: EXT Email
> > 
> > On Thu, Jul 22, 2021 at 07:14:19PM +0000, Frank Li wrote:
> > > > > On Wed, Jun 23, 2021 at 03:48:10PM +0000, Frank Li wrote:
> > > > > > > I think you had a support case open with Arm [1] which I'm not
> > able
> > > > to
> > > > > > > access -- please can you ask them about the two examples above?
> > > > > >
> > > > > > Still not get feedback from ARM.
> > > > >
> > > > > Just wondering if you were able to solve this without the need to
> > change
> > > > > Linux?
> > > >
> > > > Sorry for late reply
> > > >
> > > > For CCI-500 and 550, ARM removed support for barrier transactions but
> > CCI-
> > > > 400 supports barrier transactions. With CCI-400 it is a valid
> > configuration
> > > > to have SYSBARDISABLE LOW in Cortex-A processors. This change in Linux
> > > > kernel is assuming that the SYSBARDISABLE is set to HIGH hence its not
> > > > correct change for all products having various versions of ARM CCI IP.
> > > >
> > > > Frank Li
> > >
> > > Deacon:
> > >
> > >       Did you plan fix this problem by changing dma_wmb()?
> > 
> > No. As far as I understand this problem, you're driving SYSBARDISABLE
> > 'low' yet you have your own bus fabric downstream of the CCI which doesn't
> > respect barrier transactions. Even if we bodge dma_wmb(), store-release to
> > non-cacheable memory cannot be made to work on your system as you're
> > effectively putting some of your non-coherent DMA devices into a separate
> > outer-shareable domain from the CPUs.
> 
> Does it means the Linux expect all DMA devices in outer-shareable domain instead
> of system shared domain?  

I don't think we've ever documented that and, to be honest, the
outer-shareable domain stuff in the architecture is pretty academic.

However, I think it's fair to say that we do want the acquire/release
instructions to work for non-cacheable buffers when communicating with
non-coherent devices. I _think_ that implies that such devices need to
be in the same outer-shareable domain as the CPUs, although the
architecture isn't really clear here. I can try to find out.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [EXT] Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-08-09 15:26                                       ` Will Deacon
@ 2021-08-10 18:50                                         ` Frank Li
  0 siblings, 0 replies; 27+ messages in thread
From: Frank Li @ 2021-08-10 18:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, Zhi Li, Shenwei Wang, Han Xu, Nitin Garg,
	Jason Liu, linux-arm-kernel



> -----Original Message-----
> From: Will Deacon <will@kernel.org>
> Sent: Monday, August 9, 2021 10:27 AM
> To: Frank Li <frank.li@nxp.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li <lznuaa@gmail.com>;
> Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin Garg
> <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> kernel@lists.infradead.org
> Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit barriers
> in default I/O accessors
> 
> Caution: EXT Email
> 
> On Mon, Aug 09, 2021 at 02:46:55PM +0000, Frank Li wrote:
> >
> >
> > > -----Original Message-----
> > > From: Will Deacon <will@kernel.org>
> > > Sent: Monday, August 9, 2021 8:51 AM
> > > To: Frank Li <frank.li@nxp.com>
> > > Cc: Catalin Marinas <catalin.marinas@arm.com>; Zhi Li
> <lznuaa@gmail.com>;
> > > Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>; Nitin
> Garg
> > > <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-arm-
> > > kernel@lists.infradead.org
> > > Subject: Re: [EXT] Re: The problem about arm64: io: Relax implicit
> barriers
> > > in default I/O accessors
> > >
> > > Caution: EXT Email
> > >
> > > On Thu, Jul 22, 2021 at 07:14:19PM +0000, Frank Li wrote:
> > > > > > On Wed, Jun 23, 2021 at 03:48:10PM +0000, Frank Li wrote:
> > > > > > > > I think you had a support case open with Arm [1] which I'm
> not
> > > able
> > > > > to
> > > > > > > > access -- please can you ask them about the two examples
> above?
> > > > > > >
> > > > > > > Still not get feedback from ARM.
> > > > > >
> > > > > > Just wondering if you were able to solve this without the need to
> > > change
> > > > > > Linux?
> > > > >
> > > > > Sorry for late reply
> > > > >
> > > > > For CCI-500 and 550, ARM removed support for barrier transactions
> but
> > > CCI-
> > > > > 400 supports barrier transactions. With CCI-400 it is a valid
> > > configuration
> > > > > to have SYSBARDISABLE LOW in Cortex-A processors. This change in
> Linux
> > > > > kernel is assuming that the SYSBARDISABLE is set to HIGH hence its
> not
> > > > > correct change for all products having various versions of ARM CCI
> IP.
> > > > >
> > > > > Frank Li
> > > >
> > > > Deacon:
> > > >
> > > >       Did you plan fix this problem by changing dma_wmb()?
> > >
> > > No. As far as I understand this problem, you're driving SYSBARDISABLE
> > > 'low' yet you have your own bus fabric downstream of the CCI which
> doesn't
> > > respect barrier transactions. Even if we bodge dma_wmb(), store-release
> to
> > > non-cacheable memory cannot be made to work on your system as you're
> > > effectively putting some of your non-coherent DMA devices into a
> separate
> > > outer-shareable domain from the CPUs.
> >
> > Does it means the Linux expect all DMA devices in outer-shareable domain
> instead
> > of system shared domain?
> 
> I don't think we've ever documented that and, to be honest, the
> outer-shareable domain stuff in the architecture is pretty academic.
> 
> However, I think it's fair to say that we do want the acquire/release
> instructions to work for non-cacheable buffers when communicating with
> non-coherent devices. I _think_ that implies that such devices need to
> be in the same outer-shareable domain as the CPUs, although the
> architecture isn't really clear here. I can try to find out.

Thanks, if you find anything, let me know.

Frank
 

> 
> Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-16 18:40 ` Catalin Marinas
@ 2021-06-16 18:55   ` Will Deacon
  0 siblings, 0 replies; 27+ messages in thread
From: Will Deacon @ 2021-06-16 18:55 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Frank Li, Shenwei Wang, Han Xu, Nitin Garg, Jason Liu,
	linux-arm-kernel, Zhi Li

On Wed, Jun 16, 2021 at 07:40:23PM +0100, Catalin Marinas wrote:
> On Mon, Jun 14, 2021 at 10:41:38PM +0000, Frank Li wrote:
> > commit 22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > Author: Will Deacon <will@kernel.org>
> > Date:   Fri Jun 7 15:48:58 2019 +0100
> > 
> >     arm64: io: Relax implicit barriers in default I/O accessors
> > 
> >     The arm64 implementation of the default I/O accessors requires barrier
> >     instructions to satisfy the memory ordering requirements documented in
> >     memory-barriers.txt [1], which are largely derived from the behaviour of
> >     I/O accesses on x86.
> [...]
> > 	If I added wmb() before xhci_ring_ep_doorbell, the problem gone.
> > 	Writel include io_wmb, which map into dma_wmb(). 
> > 	
> > 	1. write ddr
> > 	2. writel
> > 		2a. io_wmb(),   dmb(oshst)
> > 		2b, write usb register
> > 	3. usb dma read ddr.
> > 
> > 	
> > 	Internal bus fabric only guarantee the order for the same AXID.
> > 	1 write ddr may be slow.  USB register get data before 1 because
> > 	GPU occupy ddr now.  So USB DMA start read from ddr and get old
> > 	dma descriptor data and find not ready yet, then missed door
> > 	bell. 
> 
> That's a complex topic, Will should have a better answer. I'll try some
> thought exercise below introducing a hypothetical second CPU.

It would also be helpful to know a bit more about the hardware:

  - What is the "internal bus fabric"?
  - Can you be more specific about the AxIDs? I can't tell how that
    correlates back to code running on the CPU.
  - Is the device cache coherent?
  - What memory types are used to map the DDR and the USB register on the
    CPU? (I got lost in the indirection)

Also, do you know which part of the data appears to be stale when the device
reads it?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-14 22:41 Frank Li
  2021-06-16 16:27 ` Frank Li
@ 2021-06-16 18:40 ` Catalin Marinas
  2021-06-16 18:55   ` Will Deacon
  1 sibling, 1 reply; 27+ messages in thread
From: Catalin Marinas @ 2021-06-16 18:40 UTC (permalink / raw)
  To: Frank Li
  Cc: Will Deacon, Shenwei Wang, Han Xu, Nitin Garg, Jason Liu,
	linux-arm-kernel, Zhi Li

On Mon, Jun 14, 2021 at 10:41:38PM +0000, Frank Li wrote:
> commit 22ec71615d824f4f11d38d0e55a88d8956b7e45f
> Author: Will Deacon <will@kernel.org>
> Date:   Fri Jun 7 15:48:58 2019 +0100
> 
>     arm64: io: Relax implicit barriers in default I/O accessors
> 
>     The arm64 implementation of the default I/O accessors requires barrier
>     instructions to satisfy the memory ordering requirements documented in
>     memory-barriers.txt [1], which are largely derived from the behaviour of
>     I/O accesses on x86.
[...]
> 	If I added wmb() before xhci_ring_ep_doorbell, the problem gone.
> 	Writel include io_wmb, which map into dma_wmb(). 
> 	
> 	1. write ddr
> 	2. writel
> 		2a. io_wmb(),   dmb(oshst)
> 		2b, write usb register
> 	3. usb dma read ddr.
> 
> 	
> 	Internal bus fabric only guarantee the order for the same AXID.
> 	1 write ddr may be slow.  USB register get data before 1 because
> 	GPU occupy ddr now.  So USB DMA start read from ddr and get old
> 	dma descriptor data and find not ready yet, then missed door
> 	bell. 

That's a complex topic, Will should have a better answer. I'll try some
thought exercise below introducing a hypothetical second CPU.

From Will's commit above w.r.t. other-multi-copy atomicity:

      1. A write arriving at an endpoint shared between multiple CPUs is
         visible to all CPUs

      2. A write that is visible to all CPUs is also visible to all other
         observers in the shareability domain

So (1) would be the write to the USB device which is also an observer in
the system (of the DDR writes). (2) refers to the write to the DDR.

If we have CPU0 writing to DDR, followed by DMB and the write to the USB
device, a CPU1 observing the write to the USB device would also observe
the write to DDR (with a DMB between them). Since the USB device is an
observer and the system is multi-copy atomic, the USB should also
observe the CPU0 write to the DDR if CPU1 observed it.

CPU1 can only observe the write to the USB device via an access to that
USB device (e.g. a register read). Such access probably goes through
some serialisation point and the DMB on CPU0 ensures that the prior
write to DDR is visible. Now, a CPU1 read from the USB device cannot
affect the DMA access that the USB device started to the DDR, so we can
take it out of the equation. However, this means that the hardware
should ensure such ordering USB DMA ordering otherwise it wouldn't be
multi-copy atomic (or our understanding of it).

Either the hardware doesn't match the memory model or our reasoning is
incorrect (both are possible ;)).

I wonder whether we can look at this in a different way: the USB device
doing a "speculative" access to the DDR before the write to USB is
globally observable. There isn't a way to fix it in the USB device since
it does not observe the write to its register, so we are left with
having to guarantee the completion of the write to the DDR before
informing the USB about it.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-16 16:27 ` Frank Li
@ 2021-06-16 16:29   ` Frank Li
  0 siblings, 0 replies; 27+ messages in thread
From: Frank Li @ 2021-06-16 16:29 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas
  Cc: Shenwei Wang, Han Xu, Nitin Garg, Jason Liu, linux-arm-kernel, Zhi Li




> 
> > -----Original Message-----
> > From: Frank Li
> > Sent: Monday, June 14, 2021 5:42 PM
> > To: Will Deacon <will@kernel.org>
> > Cc: Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>;
> > Nitin Garg <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>;
> linux-
> > arm-kernel@lists.infradead.org; Zhi Li <lznuaa@gmail.com>
> > Subject: The problem about arm64: io: Relax implicit barriers in default I/O
> > accessors
> 
> Added Catalin.
[Frank Li] sorry, corrected catalin's address
> 
> >
> > Will Deacon:
> >
> > 	Our a test case is failure at 8QM platform(arm64).  USB transfer
> > failure if run with GPU stress test.
> > 	I found it related with your below change.
> >
> > commit 22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > Author: Will Deacon <will@kernel.org>
> > Date:   Fri Jun 7 15:48:58 2019 +0100
> >
> >     arm64: io: Relax implicit barriers in default I/O accessors
> >
> >     The arm64 implementation of the default I/O accessors requires barrier
> >     instructions to satisfy the memory ordering requirements documented in
> >     memory-barriers.txt [1], which are largely derived from the behaviour of
> >     I/O accesses on x86.
> >
> > drivers/usb/host/xhci-ring.c
> >
> > static void giveback_first_trb(struct xhci_hcd *xhci, int slot_id,
> >                 unsigned int ep_index, unsigned int stream_id, int start_cycle,
> >                 struct xhci_generic_trb *start_trb)
> > {
> >         /*
> >          * Pass all the TRBs to the hardware at once and make sure this write
> >          * isn't reordered.
> >          */
> >         wmb();
> >         if (start_cycle)
> >                 start_trb->field[3] |= cpu_to_le32(start_cycle);
> >         else
> >                 start_trb->field[3] &= cpu_to_le32(~TRB_CYCLE);
> >         xhci_ring_ep_doorbell(xhci, slot_id, ep_index, stream_id);
> > }
> >
> > 	If I added wmb() before xhci_ring_ep_doorbell, the problem gone.
> > Writel include io_wmb, which map into dma_wmb().
> >
> > 	1. write ddr
> > 	2. writel
> > 		2a. io_wmb(),   dmb(oshst)
> > 		2b, write usb register
> > 	3. usb dma read ddr.
> >
> >
> > 	Internal bus fabric only guarantee the order for the same AXID.  1
> > write ddr may be slow.  USB register get data before 1 because GPU occupy
> > ddr now.  So USB DMA start read from ddr and get old dma descriptor data
> > and find not ready yet, then missed door bell.
> >
> > 	If do 2-3 times doorbell, problem also gone.
> >
> > 	So I think dmb(oshst) is not enough for writel.
> >
> >        A writeX() by the CPU to the peripheral will first wait for the
> >         completion of all prior CPU writes to memory. For example, this
> ensures
> >         that writes by the CPU to an outbound DMA buffer allocated by
> >         dma_alloc_coherent() will be visible to a DMA engine when the CPU
> > writes
> >         to its MMIO control register to trigger the transfer.
> >
> >
> > Best regards
> > Frank Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: The problem about arm64: io: Relax implicit barriers in default I/O accessors
  2021-06-14 22:41 Frank Li
@ 2021-06-16 16:27 ` Frank Li
  2021-06-16 16:29   ` Frank Li
  2021-06-16 18:40 ` Catalin Marinas
  1 sibling, 1 reply; 27+ messages in thread
From: Frank Li @ 2021-06-16 16:27 UTC (permalink / raw)
  To: Will Deacon, ;catalin.marinas@arm.com
  Cc: Shenwei Wang, Han Xu, Nitin Garg, Jason Liu, linux-arm-kernel, Zhi Li



> -----Original Message-----
> From: Frank Li
> Sent: Monday, June 14, 2021 5:42 PM
> To: Will Deacon <will@kernel.org>
> Cc: Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>;
> Nitin Garg <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-
> arm-kernel@lists.infradead.org; Zhi Li <lznuaa@gmail.com>
> Subject: The problem about arm64: io: Relax implicit barriers in default I/O
> accessors

Added Catalin. 

> 
> Will Deacon:
> 
> 	Our a test case is failure at 8QM platform(arm64).  USB transfer
> failure if run with GPU stress test.
> 	I found it related with your below change.
> 
> commit 22ec71615d824f4f11d38d0e55a88d8956b7e45f
> Author: Will Deacon <will@kernel.org>
> Date:   Fri Jun 7 15:48:58 2019 +0100
> 
>     arm64: io: Relax implicit barriers in default I/O accessors
> 
>     The arm64 implementation of the default I/O accessors requires barrier
>     instructions to satisfy the memory ordering requirements documented in
>     memory-barriers.txt [1], which are largely derived from the behaviour of
>     I/O accesses on x86.
> 
> drivers/usb/host/xhci-ring.c
> 
> static void giveback_first_trb(struct xhci_hcd *xhci, int slot_id,
>                 unsigned int ep_index, unsigned int stream_id, int start_cycle,
>                 struct xhci_generic_trb *start_trb)
> {
>         /*
>          * Pass all the TRBs to the hardware at once and make sure this write
>          * isn't reordered.
>          */
>         wmb();
>         if (start_cycle)
>                 start_trb->field[3] |= cpu_to_le32(start_cycle);
>         else
>                 start_trb->field[3] &= cpu_to_le32(~TRB_CYCLE);
>         xhci_ring_ep_doorbell(xhci, slot_id, ep_index, stream_id);
> }
> 
> 	If I added wmb() before xhci_ring_ep_doorbell, the problem gone.
> Writel include io_wmb, which map into dma_wmb().
> 
> 	1. write ddr
> 	2. writel
> 		2a. io_wmb(),   dmb(oshst)
> 		2b, write usb register
> 	3. usb dma read ddr.
> 
> 
> 	Internal bus fabric only guarantee the order for the same AXID.  1
> write ddr may be slow.  USB register get data before 1 because GPU occupy
> ddr now.  So USB DMA start read from ddr and get old dma descriptor data
> and find not ready yet, then missed door bell.
> 
> 	If do 2-3 times doorbell, problem also gone.
> 
> 	So I think dmb(oshst) is not enough for writel.
> 
>        A writeX() by the CPU to the peripheral will first wait for the
>         completion of all prior CPU writes to memory. For example, this ensures
>         that writes by the CPU to an outbound DMA buffer allocated by
>         dma_alloc_coherent() will be visible to a DMA engine when the CPU
> writes
>         to its MMIO control register to trigger the transfer.
> 
> 
> Best regards
> Frank Li
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

* The problem about arm64: io: Relax implicit barriers in default I/O accessors
@ 2021-06-14 22:41 Frank Li
  2021-06-16 16:27 ` Frank Li
  2021-06-16 18:40 ` Catalin Marinas
  0 siblings, 2 replies; 27+ messages in thread
From: Frank Li @ 2021-06-14 22:41 UTC (permalink / raw)
  To: Will Deacon
  Cc: Shenwei Wang, Han Xu, Nitin Garg, Jason Liu, linux-arm-kernel, Zhi Li

Will Deacon:

	Our a test case is failure at 8QM platform(arm64).  USB transfer failure if run with GPU stress test.
	I found it related with your below change. 
	
commit 22ec71615d824f4f11d38d0e55a88d8956b7e45f
Author: Will Deacon <will@kernel.org>
Date:   Fri Jun 7 15:48:58 2019 +0100

    arm64: io: Relax implicit barriers in default I/O accessors

    The arm64 implementation of the default I/O accessors requires barrier
    instructions to satisfy the memory ordering requirements documented in
    memory-barriers.txt [1], which are largely derived from the behaviour of
    I/O accesses on x86.
 
drivers/usb/host/xhci-ring.c

static void giveback_first_trb(struct xhci_hcd *xhci, int slot_id,
                unsigned int ep_index, unsigned int stream_id, int start_cycle,
                struct xhci_generic_trb *start_trb)
{
        /*
         * Pass all the TRBs to the hardware at once and make sure this write
         * isn't reordered.
         */
        wmb();
        if (start_cycle)
                start_trb->field[3] |= cpu_to_le32(start_cycle);
        else
                start_trb->field[3] &= cpu_to_le32(~TRB_CYCLE);
        xhci_ring_ep_doorbell(xhci, slot_id, ep_index, stream_id);
}

	If I added wmb() before xhci_ring_ep_doorbell, the problem gone.  Writel include io_wmb, which map into dma_wmb(). 
	
	1. write ddr
	2. writel
		2a. io_wmb(),   dmb(oshst)
		2b, write usb register
	3. usb dma read ddr.

	
	Internal bus fabric only guarantee the order for the same AXID.  1 write ddr may be slow.  USB register get data before 1 because GPU occupy ddr now.  So USB DMA start read from ddr and get old dma descriptor data and find not ready yet, then missed door bell. 

	If do 2-3 times doorbell, problem also gone.

	So I think dmb(oshst) is not enough for writel. 

       A writeX() by the CPU to the peripheral will first wait for the
        completion of all prior CPU writes to memory. For example, this ensures
        that writes by the CPU to an outbound DMA buffer allocated by
        dma_alloc_coherent() will be visible to a DMA engine when the CPU writes
        to its MMIO control register to trigger the transfer.

	
Best regards
Frank Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-08-10 18:52 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <AS8PR04MB850004639EE6CE9432BBF13E880F9@AS8PR04MB8500.eurprd04.prod.outlook.com>
     [not found] ` <CAHrpEqRsp2_bt=p5JgS5F-2F_LCwgT+VX7mSENzpEYTQiW1tjg@mail.gmail.com>
2021-06-17  9:27   ` The problem about arm64: io: Relax implicit barriers in default I/O accessors Catalin Marinas
2021-06-17 17:25     ` Will Deacon
2021-06-17 17:41       ` Will Deacon
2021-06-17 20:11         ` [EXT] " Frank Li
2021-06-17 21:40           ` Will Deacon
2021-06-17 22:13             ` Frank Li
2021-06-18 14:56             ` Nitin Garg
2021-06-21 16:11             ` Frank Li
2021-06-21 16:26               ` Will Deacon
2021-06-21 16:59                 ` Will Deacon
2021-06-21 17:56                   ` Frank Li
2021-06-21 18:13                     ` Will Deacon
2021-06-21 21:32                       ` Frank Li
2021-06-22  9:11                         ` Will Deacon
2021-06-23 15:48                           ` Frank Li
2021-07-06 17:11                             ` Will Deacon
2021-07-15 15:53                               ` Frank Li
2021-07-22 19:14                                 ` Frank Li
2021-08-09 13:50                                   ` Will Deacon
2021-08-09 14:46                                     ` Frank Li
2021-08-09 15:26                                       ` Will Deacon
2021-08-10 18:50                                         ` Frank Li
2021-06-14 22:41 Frank Li
2021-06-16 16:27 ` Frank Li
2021-06-16 16:29   ` Frank Li
2021-06-16 18:40 ` Catalin Marinas
2021-06-16 18:55   ` Will Deacon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.