linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* Should we use "dsb" or "dmb" between write to buffer and write to register
@ 2022-08-22  7:53 Mark Zhang
  2022-09-07 17:53 ` Catalin Marinas
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Zhang @ 2022-08-22  7:53 UTC (permalink / raw)
  To: catalin.marinas, will, linux-arm-kernel
  Cc: Yishai Hadas, Jason Gunthorpe, Maor Gottlieb, Leon Romanovsky,
	Michael Guralnik, Michael Berezin, yong.xu, Eran Ben Elisha

Hi ARM experts,

May I consult when to use dsb or dmb in our device driver, thanks:

For example when send a command a FW/HW, usually we do it with 3 steps:
   1. memcpy(buff, src, size);
   2. wmb();
   3. write64(ctrl, reg_addr);

IIUC in kernel wmb() is implemented with "dsb st". When we change this 
to "dmb st" then we get better performance, but we are not sure if it's 
safe. I have read Will's post[1] but still not sure.

So our questions are:
1. can we use "dmb" here?
2. If we can then should we use "dmb st", or "dmb oshst"?

Thank you very much.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we use "dsb" or "dmb" between write to buffer and write to register
  2022-08-22  7:53 Should we use "dsb" or "dmb" between write to buffer and write to register Mark Zhang
@ 2022-09-07 17:53 ` Catalin Marinas
  2022-09-08 13:50   ` Will Deacon
  0 siblings, 1 reply; 8+ messages in thread
From: Catalin Marinas @ 2022-09-07 17:53 UTC (permalink / raw)
  To: Mark Zhang
  Cc: will, linux-arm-kernel, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Leon Romanovsky, Michael Guralnik,
	Michael Berezin, yong.xu, Eran Ben Elisha

On Mon, Aug 22, 2022 at 03:53:42PM +0800, Mark Zhang wrote:
> May I consult when to use dsb or dmb in our device driver, thanks:
> 
> For example when send a command a FW/HW, usually we do it with 3 steps:
>   1. memcpy(buff, src, size);
>   2. wmb();
>   3. write64(ctrl, reg_addr);
> 
> IIUC in kernel wmb() is implemented with "dsb st". When we change this to
> "dmb st" then we get better performance, but we are not sure if it's safe. I
> have read Will's post[1] but still not sure.
> 
> So our questions are:
> 1. can we use "dmb" here?
> 2. If we can then should we use "dmb st", or "dmb oshst"?
> 
> Thank you very much.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f

Will convinced me at the time that it's sufficient, though every time I
revisit this I get confused ;). Not sure whether we have updated the
memory model since to cover such scenarios. In practice at least from
what I recall that should be safe.

IIRC, the logic is that if an observer in the same shareability domain
is seeing the write64 (3), it should have observed the memcpy (1) as
well given the DMB. The device in question is one of the observers
observing the memcpy to 'buff' (but it doesn't 'observe' the write64
itself). In a multi-copy atomic world, if a third observer is seeing the
write64 and therefore the memcpy, it means that the device should have
observed the memcpy as well (the multi-copy atomicity requirement).

That's where it looks a bit like Schrodinger's cat to me (the state of
the cat being whether the device observed the memcpy or not). You can't
be sure until you have a third observer seeing the write64 to device. In
the absence of such hypothetical observer, the device might or might not
have seen the new data in 'buff' since it cannot observe the write64 to
its control register (and from the commit log, this seems to be the case
with peripherals private to a CPU).

I guess the question is what does it mean for the device that a third
observer saw the write64. In one interpretation of observability,
another write64 from the third observer is ordered after the original
write64 but to me it still doesn't help clarify any order imposed on the
device access to 'buff':

Initial state:
  buff=0
  ctrl=0

P0:		P1:		Device:
  Wbuff=1	  Wctrl=2	  Ry=buff
  DMB		  DMB
  Wctrl=1	  Rx=buff

If the final 'ctrl' register value is 2 then x==1. But I don't see how
y==0 or 1 is influenced by Wctrl=2. If x==1 on P1, any other observer,
including the device, should see the buff value of 1 but this assumes
that there is some other ordering for when Ry=buff is issued.

So, as you can see, I'm even more confused than when I started writing
this email ;). I'd leave this to Will to explain and, of course, if your
hardware folks disagree, they should let us know.

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we use "dsb" or "dmb" between write to buffer and write to register
  2022-09-07 17:53 ` Catalin Marinas
@ 2022-09-08 13:50   ` Will Deacon
  2022-09-12  8:43     ` Mark Zhang
  2022-09-12 18:16     ` Catalin Marinas
  0 siblings, 2 replies; 8+ messages in thread
From: Will Deacon @ 2022-09-08 13:50 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Mark Zhang, linux-arm-kernel, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Leon Romanovsky, Michael Guralnik,
	Michael Berezin, yong.xu, Eran Ben Elisha

On Wed, Sep 07, 2022 at 06:53:43PM +0100, Catalin Marinas wrote:
> On Mon, Aug 22, 2022 at 03:53:42PM +0800, Mark Zhang wrote:
> > May I consult when to use dsb or dmb in our device driver, thanks:
> > 
> > For example when send a command a FW/HW, usually we do it with 3 steps:
> >   1. memcpy(buff, src, size);
> >   2. wmb();
> >   3. write64(ctrl, reg_addr);

I'm assuming that write64 is just a plain 64-bit store to a device mapping
and doesn't imply any further ordering.

> > IIUC in kernel wmb() is implemented with "dsb st". When we change this to
> > "dmb st" then we get better performance, but we are not sure if it's safe. I
> > have read Will's post[1] but still not sure.
> > 
> > So our questions are:
> > 1. can we use "dmb" here?
> > 2. If we can then should we use "dmb st", or "dmb oshst"?
> > 
> > Thank you very much.
> > 
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> 
> Will convinced me at the time that it's sufficient, though every time I
> revisit this I get confused ;). Not sure whether we have updated the
> memory model since to cover such scenarios. In practice at least from
> what I recall that should be safe.

The Armv8 memory model is "other-multi-copy-atomic" which means that a
store is either visible _only_ to the observer from which it originates
or it is visible to all observers. It cannot exist in some intermediate
state.

With that, the insight is that a write to the MMIO interface of a shared
peripheral must be observed by all observers when it reaches the endpoint.
Consequently, we only need to ensure that the stores from your memcpy()
in the motivating example are observed before the MMIO write is observed
and a DMB ST is sufficient for that. We use OSHST in Linux in case the
memory buffer is mapped as non-cacheable but I'm doubtful whether it makes
a real practical difference.

> IIRC, the logic is that if an observer in the same shareability domain
> is seeing the write64 (3), it should have observed the memcpy (1) as
> well given the DMB. The device in question is one of the observers
> observing the memcpy to 'buff' (but it doesn't 'observe' the write64
> itself). In a multi-copy atomic world, if a third observer is seeing the
> write64 and therefore the memcpy, it means that the device should have
> observed the memcpy as well (the multi-copy atomicity requirement).
> 
> That's where it looks a bit like Schrodinger's cat to me (the state of
> the cat being whether the device observed the memcpy or not). You can't
> be sure until you have a third observer seeing the write64 to device. In
> the absence of such hypothetical observer, the device might or might not
> have seen the new data in 'buff' since it cannot observe the write64 to
> its control register (and from the commit log, this seems to be the case
> with peripherals private to a CPU).

Yes, CPU-private peripherals may well need additional ordering, but they
likely also roll their own I/O accessors.

> I guess the question is what does it mean for the device that a third
> observer saw the write64. In one interpretation of observability,
> another write64 from the third observer is ordered after the original
> write64 but to me it still doesn't help clarify any order imposed on the
> device access to 'buff':
> 
> Initial state:
>   buff=0
>   ctrl=0
> 
> P0:		P1:		Device:
>   Wbuff=1	  Wctrl=2	  Ry=buff
>   DMB		  DMB
>   Wctrl=1	  Rx=buff
> 
> If the final 'ctrl' register value is 2 then x==1. But I don't see how
> y==0 or 1 is influenced by Wctrl=2. If x==1 on P1, any other observer,
> including the device, should see the buff value of 1 but this assumes
> that there is some other ordering for when Ry=buff is issued.

You need to relate the write to 'ctrl' with the device's read of 'buff'
somehow. Under which circumstances does the device read 'buff' (i.e.
what are the register fields in 'ctrl')?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we use "dsb" or "dmb" between write to buffer and write to register
  2022-09-08 13:50   ` Will Deacon
@ 2022-09-12  8:43     ` Mark Zhang
  2022-09-13 11:11       ` Will Deacon
  2022-09-12 18:16     ` Catalin Marinas
  1 sibling, 1 reply; 8+ messages in thread
From: Mark Zhang @ 2022-09-12  8:43 UTC (permalink / raw)
  To: Will Deacon, Catalin Marinas, Suresh Warrier
  Cc: linux-arm-kernel, Yishai Hadas, Jason Gunthorpe, Maor Gottlieb,
	Leon Romanovsky, Michael Guralnik, Michael Berezin, yong.xu,
	Eran Ben Elisha

On 9/8/2022 9:50 PM, Will Deacon wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Wed, Sep 07, 2022 at 06:53:43PM +0100, Catalin Marinas wrote:
>> On Mon, Aug 22, 2022 at 03:53:42PM +0800, Mark Zhang wrote:
>>> May I consult when to use dsb or dmb in our device driver, thanks:
>>>
>>> For example when send a command a FW/HW, usually we do it with 3 steps:
>>>    1. memcpy(buff, src, size);
>>>    2. wmb();
>>>    3. write64(ctrl, reg_addr);
> 
> I'm assuming that write64 is just a plain 64-bit store to a device mapping
> and doesn't imply any further ordering.
> 
>>> IIUC in kernel wmb() is implemented with "dsb st". When we change this to
>>> "dmb st" then we get better performance, but we are not sure if it's safe. I
>>> have read Will's post[1] but still not sure.
>>>
>>> So our questions are:
>>> 1. can we use "dmb" here?
>>> 2. If we can then should we use "dmb st", or "dmb oshst"?
>>>
>>> Thank you very much.
>>>
>>> [1] https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f
>>
>> Will convinced me at the time that it's sufficient, though every time I
>> revisit this I get confused ;). Not sure whether we have updated the
>> memory model since to cover such scenarios. In practice at least from
>> what I recall that should be safe.
> 
> The Armv8 memory model is "other-multi-copy-atomic" which means that a
> store is either visible _only_ to the observer from which it originates
> or it is visible to all observers. It cannot exist in some intermediate
> state.
> 
> With that, the insight is that a write to the MMIO interface of a shared
> peripheral must be observed by all observers when it reaches the endpoint.
> Consequently, we only need to ensure that the stores from your memcpy()
> in the motivating example are observed before the MMIO write is observed
> and a DMB ST is sufficient for that. We use OSHST in Linux in case the
> memory buffer is mapped as non-cacheable but I'm doubtful whether it makes
> a real practical difference.
> 

Thank you very much Catalin and Will.

However my colleague Suresh still has some concerns:
"I believe the effect of the write64() here is to trigger a side effect 
in the device (that it is not a true write to memory although it is a 
memory access and so the NIC is not actually reading this memory 
address). If that is the case, a dsb is likely needed to guarantee that 
the effects of the memcpy are also observed by the NIC. You can check 
out some examples in Appendix K11 (Barrier Litmus Tests ) of the Arm ARM 
– for instance K11.4 and K11.5.4, where a dsb is used for these kinds of 
scenarios.
... There is a subtle difference between observing the execution of an 
instruction and observing the completion of an instruction"

What do you think?
Thanks.

>> IIRC, the logic is that if an observer in the same shareability domain
>> is seeing the write64 (3), it should have observed the memcpy (1) as
>> well given the DMB. The device in question is one of the observers
>> observing the memcpy to 'buff' (but it doesn't 'observe' the write64
>> itself). In a multi-copy atomic world, if a third observer is seeing the
>> write64 and therefore the memcpy, it means that the device should have
>> observed the memcpy as well (the multi-copy atomicity requirement).
>>
>> That's where it looks a bit like Schrodinger's cat to me (the state of
>> the cat being whether the device observed the memcpy or not). You can't
>> be sure until you have a third observer seeing the write64 to device. In
>> the absence of such hypothetical observer, the device might or might not
>> have seen the new data in 'buff' since it cannot observe the write64 to
>> its control register (and from the commit log, this seems to be the case
>> with peripherals private to a CPU).
> 
> Yes, CPU-private peripherals may well need additional ordering, but they
> likely also roll their own I/O accessors.
> 
>> I guess the question is what does it mean for the device that a third
>> observer saw the write64. In one interpretation of observability,
>> another write64 from the third observer is ordered after the original
>> write64 but to me it still doesn't help clarify any order imposed on the
>> device access to 'buff':
>>
>> Initial state:
>>    buff=0
>>    ctrl=0
>>
>> P0:           P1:             Device:
>>    Wbuff=1       Wctrl=2         Ry=buff
>>    DMB           DMB
>>    Wctrl=1       Rx=buff
>>
>> If the final 'ctrl' register value is 2 then x==1. But I don't see how
>> y==0 or 1 is influenced by Wctrl=2. If x==1 on P1, any other observer,
>> including the device, should see the buff value of 1 but this assumes
>> that there is some other ordering for when Ry=buff is issued.
> 
> You need to relate the write to 'ctrl' with the device's read of 'buff'
> somehow. Under which circumstances does the device read 'buff' (i.e.
> what are the register fields in 'ctrl')?
> 
> Will


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we use "dsb" or "dmb" between write to buffer and write to register
  2022-09-08 13:50   ` Will Deacon
  2022-09-12  8:43     ` Mark Zhang
@ 2022-09-12 18:16     ` Catalin Marinas
  2022-09-13 11:06       ` Will Deacon
  1 sibling, 1 reply; 8+ messages in thread
From: Catalin Marinas @ 2022-09-12 18:16 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Zhang, linux-arm-kernel, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Leon Romanovsky, Michael Guralnik,
	Michael Berezin, yong.xu, Eran Ben Elisha

On Thu, Sep 08, 2022 at 02:50:17PM +0100, Will Deacon wrote:
> On Wed, Sep 07, 2022 at 06:53:43PM +0100, Catalin Marinas wrote:
> > On Mon, Aug 22, 2022 at 03:53:42PM +0800, Mark Zhang wrote:
> > > May I consult when to use dsb or dmb in our device driver, thanks:
> > > 
> > > For example when send a command a FW/HW, usually we do it with 3 steps:
> > >   1. memcpy(buff, src, size);
> > >   2. wmb();
> > >   3. write64(ctrl, reg_addr);
> 
> I'm assuming that write64 is just a plain 64-bit store to a device mapping
> and doesn't imply any further ordering.

That was my assumption as well, an STR to device memory (if it's an MSR,
we do need a DSB).

> > > IIUC in kernel wmb() is implemented with "dsb st". When we change this to
> > > "dmb st" then we get better performance, but we are not sure if it's safe. I
> > > have read Will's post[1] but still not sure.
> > > 
> > > So our questions are:
> > > 1. can we use "dmb" here?
> > > 2. If we can then should we use "dmb st", or "dmb oshst"?
> > > 
> > > Thank you very much.
> > > 
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > 
> > Will convinced me at the time that it's sufficient, though every time I
> > revisit this I get confused ;). Not sure whether we have updated the
> > memory model since to cover such scenarios. In practice at least from
> > what I recall that should be safe.
> 
> The Armv8 memory model is "other-multi-copy-atomic" which means that a
> store is either visible _only_ to the observer from which it originates
> or it is visible to all observers. It cannot exist in some intermediate
> state.
> 
> With that, the insight is that a write to the MMIO interface of a shared
> peripheral must be observed by all observers when it reaches the endpoint.

What's the endpoint here? The device itself or some serialisation point
on the path to the device? IIUC, this can be a serialisation point in
certain circumstances (e.g. with early write acknowledgement).

> Consequently, we only need to ensure that the stores from your memcpy()
> in the motivating example are observed before the MMIO write is observed
> and a DMB ST is sufficient for that.

Yes but this is all about other observers observing the MMIO write
rather than the device itself which cannot observe the MMIO write, so
the CPU doesn't need to impose any order between these two.

Let's say we have a topology with two ports, one for MMIO and the other
for RAM accesses, each with its own serialisation point:

  +-------+    +-------+
  | CPU 0 |    | CPU 1 |
  +-------+    +-------+
    |   |        |   |
   (a)--|--------+   |                  (a) MMIO serialisation point
    |   +-----------(b)---+             (b) RAM serialisation point
    |                |    |
  +-----+        +-----+  |
  | Dev |        | RAM |  |
  +-----+        +-----+  |
     |                    |
     +-----DMA------------+

All accesses to RAM, including the device DMA, go through serialisation
point (b). The MMIO accesses go through point (a). I don't know how
realistic this is in practice (well, it can be a lot more complex) but
with a few rules the above topology can obey the memory model. The
simplest is for a DMB to cause the CPU to wait for the acknowledgement
that a transaction reached a serialisation point before issuing new ones
but there can be other ways like accesses issued on both ports before
reaching the corresponding serialisation points. The serialisation
points could communicate between them to ensure ordering in the presence
of a third observer.

My worry is that in the absence of CPU1 (or transactions from CPU1), the
hardware may decide to forward an MMIO access to the device even if it
is ordered after a RAM transaction since it doesn't break any
observability rules (it might as well consider the device private).

> > I guess the question is what does it mean for the device that a third
> > observer saw the write64. In one interpretation of observability,
> > another write64 from the third observer is ordered after the original
> > write64 but to me it still doesn't help clarify any order imposed on the
> > device access to 'buff':
> > 
> > Initial state:
> >   buff=0
> >   ctrl=0
> > 
> > P0:		P1:		Device:
> >   Wbuff=1	  Wctrl=2	  Ry=buff
> >   DMB		  DMB
> >   Wctrl=1	  Rx=buff
> > 
> > If the final 'ctrl' register value is 2 then x==1. But I don't see how
> > y==0 or 1 is influenced by Wctrl=2. If x==1 on P1, any other observer,
> > including the device, should see the buff value of 1 but this assumes
> > that there is some other ordering for when Ry=buff is issued.
> 
> You need to relate the write to 'ctrl' with the device's read of 'buff'
> somehow. Under which circumstances does the device read 'buff' (i.e.
> what are the register fields in 'ctrl')?

I don't think we have anything in the memory model that can relate the
write to MMIO with the device read from memory (DMA) since the device
doesn't do a 'master' access to its own registers (i.e. go through
serialisation point (a)). That's where I fail to explain in terms of the
memory model why a DMB is sufficient (but I'm far from an expert here).

The scenario I have in mind is that P0 might forward the Wctrl=1 before
Wbuff=1 reaches serialisation point (b) (e.g. there is some congestion
on that port). If Wctlr=2 on P1 arrives at (a) after Wctrl=1,
serialisation point (a) could stall it until point (b) confirms that all
transactions prior to DMB have been sent so that the P0/P1 ordering is
respected. However, this has no effect on the device observing Wbuff=1.

I think the other way around holds - if the device observes Wbuff=1, the
P1 must observe it as well. But if the device doesn't observe Wbuff=1,
nothing breaks AFAICT.

I think we need a whiteboard (or a table in a pub after Plumbers).

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we use "dsb" or "dmb" between write to buffer and write to register
  2022-09-12 18:16     ` Catalin Marinas
@ 2022-09-13 11:06       ` Will Deacon
  2022-09-14 17:29         ` Catalin Marinas
  0 siblings, 1 reply; 8+ messages in thread
From: Will Deacon @ 2022-09-13 11:06 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Mark Zhang, linux-arm-kernel, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Leon Romanovsky, Michael Guralnik,
	Michael Berezin, yong.xu, Eran Ben Elisha

On Mon, Sep 12, 2022 at 07:16:26PM +0100, Catalin Marinas wrote:
> On Thu, Sep 08, 2022 at 02:50:17PM +0100, Will Deacon wrote:
> > On Wed, Sep 07, 2022 at 06:53:43PM +0100, Catalin Marinas wrote:
> > > On Mon, Aug 22, 2022 at 03:53:42PM +0800, Mark Zhang wrote:
> > > > So our questions are:
> > > > 1. can we use "dmb" here?
> > > > 2. If we can then should we use "dmb st", or "dmb oshst"?
> > > > 
> > > > Thank you very much.
> > > > 
> > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > > 
> > > Will convinced me at the time that it's sufficient, though every time I
> > > revisit this I get confused ;). Not sure whether we have updated the
> > > memory model since to cover such scenarios. In practice at least from
> > > what I recall that should be safe.
> > 
> > The Armv8 memory model is "other-multi-copy-atomic" which means that a
> > store is either visible _only_ to the observer from which it originates
> > or it is visible to all observers. It cannot exist in some intermediate
> > state.
> > 
> > With that, the insight is that a write to the MMIO interface of a shared
> > peripheral must be observed by all observers when it reaches the endpoint.
> 
> What's the endpoint here? The device itself or some serialisation point
> on the path to the device? IIUC, this can be a serialisation point in
> certain circumstances (e.g. with early write acknowledgement).

I don't think it matters too much, but you could replace "reaches the
endpoint" with "when the endpoint device begins to change state" if it
helps.

> > Consequently, we only need to ensure that the stores from your memcpy()
> > in the motivating example are observed before the MMIO write is observed
> > and a DMB ST is sufficient for that.
> 
> Yes but this is all about other observers observing the MMIO write
> rather than the device itself which cannot observe the MMIO write, so
> the CPU doesn't need to impose any order between these two.

Sorry, I don't understand you here. I agree that the device itself does not
observe the MMIO write, but the read transaction which it will generate in
response to the MMIO write is what we're trying to order against, and we need
to impose order between the memory writes and the MMIO write for that.

> Let's say we have a topology with two ports, one for MMIO and the other
> for RAM accesses, each with its own serialisation point:
> 
>   +-------+    +-------+
>   | CPU 0 |    | CPU 1 |
>   +-------+    +-------+
>     |   |        |   |
>    (a)--|--------+   |                  (a) MMIO serialisation point
>     |   +-----------(b)---+             (b) RAM serialisation point
>     |                |    |
>   +-----+        +-----+  |
>   | Dev |        | RAM |  |
>   +-----+        +-----+  |
>      |                    |
>      +-----DMA------------+
> 
> All accesses to RAM, including the device DMA, go through serialisation
> point (b). The MMIO accesses go through point (a). I don't know how
> realistic this is in practice (well, it can be a lot more complex) but
> with a few rules the above topology can obey the memory model. The
> simplest is for a DMB to cause the CPU to wait for the acknowledgement
> that a transaction reached a serialisation point before issuing new ones
> but there can be other ways like accesses issued on both ports before
> reaching the corresponding serialisation points. The serialisation
> points could communicate between them to ensure ordering in the presence
> of a third observer.
> 
> My worry is that in the absence of CPU1 (or transactions from CPU1), the
> hardware may decide to forward an MMIO access to the device even if it
> is ordered after a RAM transaction since it doesn't break any
> observability rules (it might as well consider the device private).

Right, I did distinguish between private and shared peripherals in my
initial response. A DSB may be required for private peripherals, but in
practice this is not the case and any such devices tend to have special
(arch-specific) drivers which can include the DSB (e.g. the GIC).

> > > I guess the question is what does it mean for the device that a third
> > > observer saw the write64. In one interpretation of observability,
> > > another write64 from the third observer is ordered after the original
> > > write64 but to me it still doesn't help clarify any order imposed on the
> > > device access to 'buff':
> > > 
> > > Initial state:
> > >   buff=0
> > >   ctrl=0
> > > 
> > > P0:		P1:		Device:
> > >   Wbuff=1	  Wctrl=2	  Ry=buff
> > >   DMB		  DMB
> > >   Wctrl=1	  Rx=buff
> > > 
> > > If the final 'ctrl' register value is 2 then x==1. But I don't see how
> > > y==0 or 1 is influenced by Wctrl=2. If x==1 on P1, any other observer,
> > > including the device, should see the buff value of 1 but this assumes
> > > that there is some other ordering for when Ry=buff is issued.
> > 
> > You need to relate the write to 'ctrl' with the device's read of 'buff'
> > somehow. Under which circumstances does the device read 'buff' (i.e.
> > what are the register fields in 'ctrl')?
> 
> I don't think we have anything in the memory model that can relate the
> write to MMIO with the device read from memory (DMA) since the device
> doesn't do a 'master' access to its own registers (i.e. go through
> serialisation point (a)). That's where I fail to explain in terms of the
> memory model why a DMB is sufficient (but I'm far from an expert here).

The architecture does actually have "out-of-band-ordered-before" for this
case, but it is built around DSB because it has to deal with the general
case which includes private peripherals.

> The scenario I have in mind is that P0 might forward the Wctrl=1 before
> Wbuff=1 reaches serialisation point (b) (e.g. there is some congestion
> on that port). If Wctlr=2 on P1 arrives at (a) after Wctrl=1,
> serialisation point (a) could stall it until point (b) confirms that all
> transactions prior to DMB have been sent so that the P0/P1 ordering is
> respected. However, this has no effect on the device observing Wbuff=1.
>
> I think the other way around holds - if the device observes Wbuff=1, the
> P1 must observe it as well. But if the device doesn't observe Wbuff=1,
> nothing breaks AFAICT.
> 
> I think we need a whiteboard (or a table in a pub after Plumbers).

Happy to. I can't tell from your diagram and text whether CPU1 can read
writes from CPU0 before they've reached (a) or (b).

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we use "dsb" or "dmb" between write to buffer and write to register
  2022-09-12  8:43     ` Mark Zhang
@ 2022-09-13 11:11       ` Will Deacon
  0 siblings, 0 replies; 8+ messages in thread
From: Will Deacon @ 2022-09-13 11:11 UTC (permalink / raw)
  To: Mark Zhang
  Cc: Catalin Marinas, Suresh Warrier, linux-arm-kernel, Yishai Hadas,
	Jason Gunthorpe, Maor Gottlieb, Leon Romanovsky,
	Michael Guralnik, Michael Berezin, yong.xu, Eran Ben Elisha

On Mon, Sep 12, 2022 at 04:43:14PM +0800, Mark Zhang wrote:
> However my colleague Suresh still has some concerns:
> "I believe the effect of the write64() here is to trigger a side effect in
> the device (that it is not a true write to memory although it is a memory
> access and so the NIC is not actually reading this memory address). If that
> is the case, a dsb is likely needed to guarantee that the effects of the
> memcpy are also observed by the NIC. You can check out some examples in
> Appendix K11 (Barrier Litmus Tests ) of the Arm ARM – for instance K11.4 and
> K11.5.4, where a dsb is used for these kinds of scenarios.
> ... There is a subtle difference between observing the execution of an
> instruction and observing the completion of an instruction"

A DSB is required in the general case to deal with private peripherals,
but these are pretty rare in practice and generally MMIO regions are going
to be shared between CPUs.

> What do you think?

I think a DMB is sufficient to ensure ordering (and so a DSB will work too,
but with much greater performance impact). If you actually want to cause the
device to change state before issuing the writes to memory, then even a DSB
is not enough -- you likely also want something like a read-back from a
status register and that sequence is device specific.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Should we use "dsb" or "dmb" between write to buffer and write to register
  2022-09-13 11:06       ` Will Deacon
@ 2022-09-14 17:29         ` Catalin Marinas
  0 siblings, 0 replies; 8+ messages in thread
From: Catalin Marinas @ 2022-09-14 17:29 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Zhang, linux-arm-kernel, Yishai Hadas, Jason Gunthorpe,
	Maor Gottlieb, Leon Romanovsky, Michael Guralnik,
	Michael Berezin, yong.xu, Eran Ben Elisha

On Tue, Sep 13, 2022 at 12:06:20PM +0100, Will Deacon wrote:
> On Mon, Sep 12, 2022 at 07:16:26PM +0100, Catalin Marinas wrote:
> > On Thu, Sep 08, 2022 at 02:50:17PM +0100, Will Deacon wrote:
> > > On Wed, Sep 07, 2022 at 06:53:43PM +0100, Catalin Marinas wrote:
> > > > On Mon, Aug 22, 2022 at 03:53:42PM +0800, Mark Zhang wrote:
> > > > > So our questions are:
> > > > > 1. can we use "dmb" here?
> > > > > 2. If we can then should we use "dmb st", or "dmb oshst"?
> > > > > 
> > > > > Thank you very much.
> > > > > 
> > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=22ec71615d824f4f11d38d0e55a88d8956b7e45f
> > > > 
> > > > Will convinced me at the time that it's sufficient, though every time I
> > > > revisit this I get confused ;). Not sure whether we have updated the
> > > > memory model since to cover such scenarios. In practice at least from
> > > > what I recall that should be safe.
> > > 
> > > The Armv8 memory model is "other-multi-copy-atomic" which means that a
> > > store is either visible _only_ to the observer from which it originates
> > > or it is visible to all observers. It cannot exist in some intermediate
> > > state.
> > > 
> > > With that, the insight is that a write to the MMIO interface of a shared
> > > peripheral must be observed by all observers when it reaches the endpoint.
> > 
> > What's the endpoint here? The device itself or some serialisation point
> > on the path to the device? IIUC, this can be a serialisation point in
> > certain circumstances (e.g. with early write acknowledgement).
> 
> I don't think it matters too much, but you could replace "reaches the
> endpoint" with "when the endpoint device begins to change state" if it
> helps.
> 
> > > Consequently, we only need to ensure that the stores from your memcpy()
> > > in the motivating example are observed before the MMIO write is observed
> > > and a DMB ST is sufficient for that.
> > 
> > Yes but this is all about other observers observing the MMIO write
> > rather than the device itself which cannot observe the MMIO write, so
> > the CPU doesn't need to impose any order between these two.
> 
> Sorry, I don't understand you here. I agree that the device itself does not
> observe the MMIO write, but the read transaction which it will generate in
> response to the MMIO write is what we're trying to order against, and we need
> to impose order between the memory writes and the MMIO write for that.

Yes, but the question is whether DMB is sufficient for such order or the
CPU/interconnect is free to reorder the MMIO write to the device and fix
things up later if there is a third observer of the MMIO write.

> > Let's say we have a topology with two ports, one for MMIO and the other
> > for RAM accesses, each with its own serialisation point:
> > 
> >   +-------+    +-------+
> >   | CPU 0 |    | CPU 1 |
> >   +-------+    +-------+
> >     |   |        |   |
> >    (a)--|--------+   |                  (a) MMIO serialisation point
> >     |   +-----------(b)---+             (b) RAM serialisation point
> >     |                |    |
> >   +-----+        +-----+  |
> >   | Dev |        | RAM |  |
> >   +-----+        +-----+  |
> >      |                    |
> >      +-----DMA------------+
> > 
> > All accesses to RAM, including the device DMA, go through serialisation
> > point (b). The MMIO accesses go through point (a). I don't know how
> > realistic this is in practice (well, it can be a lot more complex) but
> > with a few rules the above topology can obey the memory model. The
> > simplest is for a DMB to cause the CPU to wait for the acknowledgement
> > that a transaction reached a serialisation point before issuing new ones
> > but there can be other ways like accesses issued on both ports before
> > reaching the corresponding serialisation points. The serialisation
> > points could communicate between them to ensure ordering in the presence
> > of a third observer.
> > 
> > My worry is that in the absence of CPU1 (or transactions from CPU1), the
> > hardware may decide to forward an MMIO access to the device even if it
> > is ordered after a RAM transaction since it doesn't break any
> > observability rules (it might as well consider the device private).
> 
> Right, I did distinguish between private and shared peripherals in my
> initial response. A DSB may be required for private peripherals, but in
> practice this is not the case and any such devices tend to have special
> (arch-specific) drivers which can include the DSB (e.g. the GIC).

My point is that the interconnect may try to be "smarter" and consider
the device private in the absence of transactions from a third observer
(CPU1).

> > > > I guess the question is what does it mean for the device that a third
> > > > observer saw the write64. In one interpretation of observability,
> > > > another write64 from the third observer is ordered after the original
> > > > write64 but to me it still doesn't help clarify any order imposed on the
> > > > device access to 'buff':
> > > > 
> > > > Initial state:
> > > >   buff=0
> > > >   ctrl=0
> > > > 
> > > > P0:		P1:		Device:
> > > >   Wbuff=1	  Wctrl=2	  Ry=buff
> > > >   DMB		  DMB
> > > >   Wctrl=1	  Rx=buff
> > > > 
> > > > If the final 'ctrl' register value is 2 then x==1. But I don't see how
> > > > y==0 or 1 is influenced by Wctrl=2. If x==1 on P1, any other observer,
> > > > including the device, should see the buff value of 1 but this assumes
> > > > that there is some other ordering for when Ry=buff is issued.
> > > 
> > > You need to relate the write to 'ctrl' with the device's read of 'buff'
> > > somehow. Under which circumstances does the device read 'buff' (i.e.
> > > what are the register fields in 'ctrl')?
> > 
> > I don't think we have anything in the memory model that can relate the
> > write to MMIO with the device read from memory (DMA) since the device
> > doesn't do a 'master' access to its own registers (i.e. go through
> > serialisation point (a)). That's where I fail to explain in terms of the
> > memory model why a DMB is sufficient (but I'm far from an expert here).
> 
> The architecture does actually have "out-of-band-ordered-before" for this
> case, but it is built around DSB because it has to deal with the general
> case which includes private peripherals.

IIUC, RW1 in the first bullet point would be the memory write while the
RW2 is the device DMA access to memory. The "implementation defined
instruction sequence" would be the MMIO write. So yes, a DSB would
guarantee that but that may be too strong.

That said, I find the note after the out-of-band-ordered-before
definition more interesting as it talks about early acknowledgement and
global visibility. As per your explanation, if a DMB ensures that the
memory write is globally observable before the MMIO write, the device
must observe the memory write.

A further note in the ARM ARM would be nice, maybe something about the
coherence order for memory-mapped shared peripherals. So far the
definition of the memory-mapped peripherals is even more confusing:

  Memory effects to a Memory-mapped peripheral might not appear in the
  Reads-from or Coherence order relations.

My reading of the above is that this breaks the global observability of
the MMIO write assumption we have to be able to use a DMB.

> > The scenario I have in mind is that P0 might forward the Wctrl=1 before
> > Wbuff=1 reaches serialisation point (b) (e.g. there is some congestion
> > on that port). If Wctlr=2 on P1 arrives at (a) after Wctrl=1,
> > serialisation point (a) could stall it until point (b) confirms that all
> > transactions prior to DMB have been sent so that the P0/P1 ordering is
> > respected. However, this has no effect on the device observing Wbuff=1.
> >
> > I think the other way around holds - if the device observes Wbuff=1, the
> > P1 must observe it as well. But if the device doesn't observe Wbuff=1,
> > nothing breaks AFAICT.
> > 
> > I think we need a whiteboard (or a table in a pub after Plumbers).
> 
> Happy to. I can't tell from your diagram and text whether CPU1 can read
> writes from CPU0 before they've reached (a) or (b).

No, it can't, and neither the device. The case I have in mind is that
the CPU and interconnect may somehow decide to forward the MMIO write
through (a) before the memory write reaches (b).

-- 
Catalin

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-09-14 17:30 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-22  7:53 Should we use "dsb" or "dmb" between write to buffer and write to register Mark Zhang
2022-09-07 17:53 ` Catalin Marinas
2022-09-08 13:50   ` Will Deacon
2022-09-12  8:43     ` Mark Zhang
2022-09-13 11:11       ` Will Deacon
2022-09-12 18:16     ` Catalin Marinas
2022-09-13 11:06       ` Will Deacon
2022-09-14 17:29         ` Catalin Marinas

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).