* The problem about arm64: io: Relax implicit barriers in default I/O accessors
@ 2021-06-14 22:41 Frank Li
2021-06-16 16:27 ` Frank Li
2021-06-16 18:40 ` Catalin Marinas
0 siblings, 2 replies; 8+ messages in thread
From: Frank Li @ 2021-06-14 22:41 UTC (permalink / raw)
To: Will Deacon
Cc: Shenwei Wang, Han Xu, Nitin Garg, Jason Liu, linux-arm-kernel, Zhi Li
Will Deacon:
Our a test case is failure at 8QM platform(arm64). USB transfer failure if run with GPU stress test.
I found it related with your below change.
commit 22ec71615d824f4f11d38d0e55a88d8956b7e45f
Author: Will Deacon <will@kernel.org>
Date: Fri Jun 7 15:48:58 2019 +0100
arm64: io: Relax implicit barriers in default I/O accessors
The arm64 implementation of the default I/O accessors requires barrier
instructions to satisfy the memory ordering requirements documented in
memory-barriers.txt [1], which are largely derived from the behaviour of
I/O accesses on x86.
drivers/usb/host/xhci-ring.c
static void giveback_first_trb(struct xhci_hcd *xhci, int slot_id,
unsigned int ep_index, unsigned int stream_id, int start_cycle,
struct xhci_generic_trb *start_trb)
{
/*
* Pass all the TRBs to the hardware at once and make sure this write
* isn't reordered.
*/
wmb();
if (start_cycle)
start_trb->field[3] |= cpu_to_le32(start_cycle);
else
start_trb->field[3] &= cpu_to_le32(~TRB_CYCLE);
xhci_ring_ep_doorbell(xhci, slot_id, ep_index, stream_id);
}
If I added wmb() before xhci_ring_ep_doorbell, the problem gone. Writel include io_wmb, which map into dma_wmb().
1. write ddr
2. writel
2a. io_wmb(), dmb(oshst)
2b, write usb register
3. usb dma read ddr.
Internal bus fabric only guarantee the order for the same AXID. 1 write ddr may be slow. USB register get data before 1 because GPU occupy ddr now. So USB DMA start read from ddr and get old dma descriptor data and find not ready yet, then missed door bell.
If do 2-3 times doorbell, problem also gone.
So I think dmb(oshst) is not enough for writel.
A writeX() by the CPU to the peripheral will first wait for the
completion of all prior CPU writes to memory. For example, this ensures
that writes by the CPU to an outbound DMA buffer allocated by
dma_alloc_coherent() will be visible to a DMA engine when the CPU writes
to its MMIO control register to trigger the transfer.
Best regards
Frank Li
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: The problem about arm64: io: Relax implicit barriers in default I/O accessors
2021-06-14 22:41 The problem about arm64: io: Relax implicit barriers in default I/O accessors Frank Li
@ 2021-06-16 16:27 ` Frank Li
2021-06-16 16:29 ` Frank Li
2021-06-16 18:40 ` Catalin Marinas
1 sibling, 1 reply; 8+ messages in thread
From: Frank Li @ 2021-06-16 16:27 UTC (permalink / raw)
To: Will Deacon, ;catalin.marinas
Cc: Shenwei Wang, Han Xu, Nitin Garg, Jason Liu, linux-arm-kernel, Zhi Li
> -----Original Message-----
> From: Frank Li
> Sent: Monday, June 14, 2021 5:42 PM
> To: Will Deacon <will@kernel.org>
> Cc: Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>;
> Nitin Garg <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>; linux-
> arm-kernel@lists.infradead.org; Zhi Li <lznuaa@gmail.com>
> Subject: The problem about arm64: io: Relax implicit barriers in default I/O
> accessors
Added Catalin.
>
> Will Deacon:
>
> Our a test case is failure at 8QM platform(arm64). USB transfer
> failure if run with GPU stress test.
> I found it related with your below change.
>
> commit 22ec71615d824f4f11d38d0e55a88d8956b7e45f
> Author: Will Deacon <will@kernel.org>
> Date: Fri Jun 7 15:48:58 2019 +0100
>
> arm64: io: Relax implicit barriers in default I/O accessors
>
> The arm64 implementation of the default I/O accessors requires barrier
> instructions to satisfy the memory ordering requirements documented in
> memory-barriers.txt [1], which are largely derived from the behaviour of
> I/O accesses on x86.
>
> drivers/usb/host/xhci-ring.c
>
> static void giveback_first_trb(struct xhci_hcd *xhci, int slot_id,
> unsigned int ep_index, unsigned int stream_id, int start_cycle,
> struct xhci_generic_trb *start_trb)
> {
> /*
> * Pass all the TRBs to the hardware at once and make sure this write
> * isn't reordered.
> */
> wmb();
> if (start_cycle)
> start_trb->field[3] |= cpu_to_le32(start_cycle);
> else
> start_trb->field[3] &= cpu_to_le32(~TRB_CYCLE);
> xhci_ring_ep_doorbell(xhci, slot_id, ep_index, stream_id);
> }
>
> If I added wmb() before xhci_ring_ep_doorbell, the problem gone.
> Writel include io_wmb, which map into dma_wmb().
>
> 1. write ddr
> 2. writel
> 2a. io_wmb(), dmb(oshst)
> 2b, write usb register
> 3. usb dma read ddr.
>
>
> Internal bus fabric only guarantee the order for the same AXID. 1
> write ddr may be slow. USB register get data before 1 because GPU occupy
> ddr now. So USB DMA start read from ddr and get old dma descriptor data
> and find not ready yet, then missed door bell.
>
> If do 2-3 times doorbell, problem also gone.
>
> So I think dmb(oshst) is not enough for writel.
>
> A writeX() by the CPU to the peripheral will first wait for the
> completion of all prior CPU writes to memory. For example, this ensures
> that writes by the CPU to an outbound DMA buffer allocated by
> dma_alloc_coherent() will be visible to a DMA engine when the CPU
> writes
> to its MMIO control register to trigger the transfer.
>
>
> Best regards
> Frank Li
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread
* RE: The problem about arm64: io: Relax implicit barriers in default I/O accessors
2021-06-16 16:27 ` Frank Li
@ 2021-06-16 16:29 ` Frank Li
0 siblings, 0 replies; 8+ messages in thread
From: Frank Li @ 2021-06-16 16:29 UTC (permalink / raw)
To: Will Deacon, Catalin Marinas
Cc: Shenwei Wang, Han Xu, Nitin Garg, Jason Liu, linux-arm-kernel, Zhi Li
>
> > -----Original Message-----
> > From: Frank Li
> > Sent: Monday, June 14, 2021 5:42 PM
> > To: Will Deacon <will@kernel.org>
> > Cc: Shenwei Wang <shenwei.wang@nxp.com>; Han Xu <han.xu@nxp.com>;
> > Nitin Garg <nitin.garg@nxp.com>; Jason Liu <jason.hui.liu@nxp.com>;
> linux-
> > arm-kernel@lists.infradead.org; Zhi Li <lznuaa@gmail.com>
> > Subject: The problem about arm64: io: Relax implicit barriers in default I/O
> > accessors
>
> Added Catalin.
[Frank Li] sorry, corrected catalin's address
>
> >
> > Will Deacon:
> >
> > Our a test case is failure at 8QM platform(arm64). USB transfer
> > failure if run with GPU stress test.
> > I found it related with your below change.
> >
> > commit 22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > Author: Will Deacon <will@kernel.org>
> > Date: Fri Jun 7 15:48:58 2019 +0100
> >
> > arm64: io: Relax implicit barriers in default I/O accessors
> >
> > The arm64 implementation of the default I/O accessors requires barrier
> > instructions to satisfy the memory ordering requirements documented in
> > memory-barriers.txt [1], which are largely derived from the behaviour of
> > I/O accesses on x86.
> >
> > drivers/usb/host/xhci-ring.c
> >
> > static void giveback_first_trb(struct xhci_hcd *xhci, int slot_id,
> > unsigned int ep_index, unsigned int stream_id, int start_cycle,
> > struct xhci_generic_trb *start_trb)
> > {
> > /*
> > * Pass all the TRBs to the hardware at once and make sure this write
> > * isn't reordered.
> > */
> > wmb();
> > if (start_cycle)
> > start_trb->field[3] |= cpu_to_le32(start_cycle);
> > else
> > start_trb->field[3] &= cpu_to_le32(~TRB_CYCLE);
> > xhci_ring_ep_doorbell(xhci, slot_id, ep_index, stream_id);
> > }
> >
> > If I added wmb() before xhci_ring_ep_doorbell, the problem gone.
> > Writel include io_wmb, which map into dma_wmb().
> >
> > 1. write ddr
> > 2. writel
> > 2a. io_wmb(), dmb(oshst)
> > 2b, write usb register
> > 3. usb dma read ddr.
> >
> >
> > Internal bus fabric only guarantee the order for the same AXID. 1
> > write ddr may be slow. USB register get data before 1 because GPU occupy
> > ddr now. So USB DMA start read from ddr and get old dma descriptor data
> > and find not ready yet, then missed door bell.
> >
> > If do 2-3 times doorbell, problem also gone.
> >
> > So I think dmb(oshst) is not enough for writel.
> >
> > A writeX() by the CPU to the peripheral will first wait for the
> > completion of all prior CPU writes to memory. For example, this
> ensures
> > that writes by the CPU to an outbound DMA buffer allocated by
> > dma_alloc_coherent() will be visible to a DMA engine when the CPU
> > writes
> > to its MMIO control register to trigger the transfer.
> >
> >
> > Best regards
> > Frank Li
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
2021-06-14 22:41 The problem about arm64: io: Relax implicit barriers in default I/O accessors Frank Li
2021-06-16 16:27 ` Frank Li
@ 2021-06-16 18:40 ` Catalin Marinas
2021-06-16 18:55 ` Will Deacon
1 sibling, 1 reply; 8+ messages in thread
From: Catalin Marinas @ 2021-06-16 18:40 UTC (permalink / raw)
To: Frank Li
Cc: Will Deacon, Shenwei Wang, Han Xu, Nitin Garg, Jason Liu,
linux-arm-kernel, Zhi Li
On Mon, Jun 14, 2021 at 10:41:38PM +0000, Frank Li wrote:
> commit 22ec71615d824f4f11d38d0e55a88d8956b7e45f
> Author: Will Deacon <will@kernel.org>
> Date: Fri Jun 7 15:48:58 2019 +0100
>
> arm64: io: Relax implicit barriers in default I/O accessors
>
> The arm64 implementation of the default I/O accessors requires barrier
> instructions to satisfy the memory ordering requirements documented in
> memory-barriers.txt [1], which are largely derived from the behaviour of
> I/O accesses on x86.
[...]
> If I added wmb() before xhci_ring_ep_doorbell, the problem gone.
> Writel include io_wmb, which map into dma_wmb().
>
> 1. write ddr
> 2. writel
> 2a. io_wmb(), dmb(oshst)
> 2b, write usb register
> 3. usb dma read ddr.
>
>
> Internal bus fabric only guarantee the order for the same AXID.
> 1 write ddr may be slow. USB register get data before 1 because
> GPU occupy ddr now. So USB DMA start read from ddr and get old
> dma descriptor data and find not ready yet, then missed door
> bell.
That's a complex topic, Will should have a better answer. I'll try some
thought exercise below introducing a hypothetical second CPU.
From Will's commit above w.r.t. other-multi-copy atomicity:
1. A write arriving at an endpoint shared between multiple CPUs is
visible to all CPUs
2. A write that is visible to all CPUs is also visible to all other
observers in the shareability domain
So (1) would be the write to the USB device which is also an observer in
the system (of the DDR writes). (2) refers to the write to the DDR.
If we have CPU0 writing to DDR, followed by DMB and the write to the USB
device, a CPU1 observing the write to the USB device would also observe
the write to DDR (with a DMB between them). Since the USB device is an
observer and the system is multi-copy atomic, the USB should also
observe the CPU0 write to the DDR if CPU1 observed it.
CPU1 can only observe the write to the USB device via an access to that
USB device (e.g. a register read). Such access probably goes through
some serialisation point and the DMB on CPU0 ensures that the prior
write to DDR is visible. Now, a CPU1 read from the USB device cannot
affect the DMA access that the USB device started to the DDR, so we can
take it out of the equation. However, this means that the hardware
should ensure such ordering USB DMA ordering otherwise it wouldn't be
multi-copy atomic (or our understanding of it).
Either the hardware doesn't match the memory model or our reasoning is
incorrect (both are possible ;)).
I wonder whether we can look at this in a different way: the USB device
doing a "speculative" access to the DDR before the write to USB is
globally observable. There isn't a way to fix it in the USB device since
it does not observe the write to its register, so we are left with
having to guarantee the completion of the write to the DDR before
informing the USB about it.
--
Catalin
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: The problem about arm64: io: Relax implicit barriers in default I/O accessors
2021-06-16 18:40 ` Catalin Marinas
@ 2021-06-16 18:55 ` Will Deacon
0 siblings, 0 replies; 8+ messages in thread
From: Will Deacon @ 2021-06-16 18:55 UTC (permalink / raw)
To: Catalin Marinas
Cc: Frank Li, Shenwei Wang, Han Xu, Nitin Garg, Jason Liu,
linux-arm-kernel, Zhi Li
On Wed, Jun 16, 2021 at 07:40:23PM +0100, Catalin Marinas wrote:
> On Mon, Jun 14, 2021 at 10:41:38PM +0000, Frank Li wrote:
> > commit 22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > Author: Will Deacon <will@kernel.org>
> > Date: Fri Jun 7 15:48:58 2019 +0100
> >
> > arm64: io: Relax implicit barriers in default I/O accessors
> >
> > The arm64 implementation of the default I/O accessors requires barrier
> > instructions to satisfy the memory ordering requirements documented in
> > memory-barriers.txt [1], which are largely derived from the behaviour of
> > I/O accesses on x86.
> [...]
> > If I added wmb() before xhci_ring_ep_doorbell, the problem gone.
> > Writel include io_wmb, which map into dma_wmb().
> >
> > 1. write ddr
> > 2. writel
> > 2a. io_wmb(), dmb(oshst)
> > 2b, write usb register
> > 3. usb dma read ddr.
> >
> >
> > Internal bus fabric only guarantee the order for the same AXID.
> > 1 write ddr may be slow. USB register get data before 1 because
> > GPU occupy ddr now. So USB DMA start read from ddr and get old
> > dma descriptor data and find not ready yet, then missed door
> > bell.
>
> That's a complex topic, Will should have a better answer. I'll try some
> thought exercise below introducing a hypothetical second CPU.
It would also be helpful to know a bit more about the hardware:
- What is the "internal bus fabric"?
- Can you be more specific about the AxIDs? I can't tell how that
correlates back to code running on the CPU.
- Is the device cache coherent?
- What memory types are used to map the DDR and the USB register on the
CPU? (I got lost in the indirection)
Also, do you know which part of the data appears to be stale when the device
reads it?
Will
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread
[parent not found: <AS8PR04MB850004639EE6CE9432BBF13E880F9@AS8PR04MB8500.eurprd04.prod.outlook.com>]
end of thread, other threads:[~2021-06-17 17:43 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-14 22:41 The problem about arm64: io: Relax implicit barriers in default I/O accessors Frank Li
2021-06-16 16:27 ` Frank Li
2021-06-16 16:29 ` Frank Li
2021-06-16 18:40 ` Catalin Marinas
2021-06-16 18:55 ` Will Deacon
[not found] <AS8PR04MB850004639EE6CE9432BBF13E880F9@AS8PR04MB8500.eurprd04.prod.outlook.com>
[not found] ` <CAHrpEqRsp2_bt=p5JgS5F-2F_LCwgT+VX7mSENzpEYTQiW1tjg@mail.gmail.com>
2021-06-17 9:27 ` Catalin Marinas
2021-06-17 17:25 ` Will Deacon
2021-06-17 17:41 ` Will Deacon
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.