All of lore.kernel.org
 help / color / mirror / Atom feed
* SMP barriers semantics
@ 2010-03-02 10:52 Catalin Marinas
  2010-03-03  0:55 ` Paul Mackerras
  0 siblings, 1 reply; 22+ messages in thread
From: Catalin Marinas @ 2010-03-02 10:52 UTC (permalink / raw)
  To: linux-arch; +Cc: Russell King

Hi,

We have an issue with the barriers usage/implementation on ARM and I
would like some clarification.

As a background - latest ARM processors have two kinds of barriers - a
lightweight one (DMB) which basically only ensures the ordering of
accesses to the same memory type (the definition is a bit more
complicated but in the context of Linux this is a safe simplification).
The second kind of barrier is a heavyweight one (DSB) which drains the
write buffers.

Both *mb() and smp_*mb() are currently implemented with the lightweight
version (DMB) but this is not enough for coherent DMA operations where a
DSB is needed to drain the write buffer before writing to the device I/O
memory for starting the transfer. My proposal on the ARM lists was to
change mb()/wmb() to DSB but leave the smp_*mb() as a DMB.

The main question - are the Linux SMP barriers supposed to have an
effect outside of cacheable memory accesses (i.e. ordering wrt I/O
accesses)?

My understanding from other comments in the kernel source is that the
SMP barriers are only meant or cacheable memory but there are drivers
that do something like below (e.g. drivers/net/r8169.c):

		/* We need for force the visibility of tp->intr_mask
		 * for other CPUs, as we can loose an MSI interrupt
		 * and potentially wait for a retransmit timeout if we don't.
		 * The posted write to IntrMask is safe, as it will
		 * eventually make it to the chip and we won't loose anything
		 * until it does.
		 */
		tp->intr_mask = 0xffff;
		smp_wmb();
		RTL_W16(IntrMask, tp->intr_event);

Is this supposed to work given the SMP barriers semantics?

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-02 10:52 SMP barriers semantics Catalin Marinas
@ 2010-03-03  0:55 ` Paul Mackerras
  2010-03-03 12:03   ` Catalin Marinas
  2010-03-04  2:23   ` David Dillow
  0 siblings, 2 replies; 22+ messages in thread
From: Paul Mackerras @ 2010-03-03  0:55 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: linux-arch, Russell King

On Tue, Mar 02, 2010 at 10:52:58AM +0000, Catalin Marinas wrote:

> We have an issue with the barriers usage/implementation on ARM and I
> would like some clarification.
> 
> As a background - latest ARM processors have two kinds of barriers - a
> lightweight one (DMB) which basically only ensures the ordering of
> accesses to the same memory type (the definition is a bit more
> complicated but in the context of Linux this is a safe simplification).
> The second kind of barrier is a heavyweight one (DSB) which drains the
> write buffers.
> 
> Both *mb() and smp_*mb() are currently implemented with the lightweight
> version (DMB) but this is not enough for coherent DMA operations where a
> DSB is needed to drain the write buffer before writing to the device I/O
> memory for starting the transfer. My proposal on the ARM lists was to
> change mb()/wmb() to DSB but leave the smp_*mb() as a DMB.
> 
> The main question - are the Linux SMP barriers supposed to have an
> effect outside of cacheable memory accesses (i.e. ordering wrt I/O
> accesses)?

The SMP barriers are only required to order cacheable accesses.  The
plain (non-SMP) barriers (mb, wmb, rmb) are required to order both
cacheable and non-cacheable accesses.

> My understanding from other comments in the kernel source is that the
> SMP barriers are only meant or cacheable memory but there are drivers
> that do something like below (e.g. drivers/net/r8169.c):
> 
> 		/* We need for force the visibility of tp->intr_mask
> 		 * for other CPUs, as we can loose an MSI interrupt
> 		 * and potentially wait for a retransmit timeout if we don't.
> 		 * The posted write to IntrMask is safe, as it will
> 		 * eventually make it to the chip and we won't loose anything
> 		 * until it does.
> 		 */
> 		tp->intr_mask = 0xffff;
> 		smp_wmb();
> 		RTL_W16(IntrMask, tp->intr_event);
> 
> Is this supposed to work given the SMP barriers semantics?

Well, if the smp_wmb() is supposed to make the assignment to
tp->intr_mask globally visible before any effects of the RTL_W16(),
then it's buggy.  But from the comments it appears that the smp_wmb()
might be intended to order the store to tp->intr_mask with respect to
following cacheable stores, rather than with respect to the RTL_W16(),
which would be OK.  I can't say without having a much closer look at
what that driver is actually doing.

Paul.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-03  0:55 ` Paul Mackerras
@ 2010-03-03 12:03   ` Catalin Marinas
  2010-03-12 12:31     ` Ralf Baechle
  2010-03-04  2:23   ` David Dillow
  1 sibling, 1 reply; 22+ messages in thread
From: Catalin Marinas @ 2010-03-03 12:03 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linux-arch, Russell King, Francois Romieu

On Wed, 2010-03-03 at 00:55 +0000, Paul Mackerras wrote:
> On Tue, Mar 02, 2010 at 10:52:58AM +0000, Catalin Marinas wrote:
> > The main question - are the Linux SMP barriers supposed to have an
> > effect outside of cacheable memory accesses (i.e. ordering wrt I/O
> > accesses)?
> 
> The SMP barriers are only required to order cacheable accesses.  The
> plain (non-SMP) barriers (mb, wmb, rmb) are required to order both
> cacheable and non-cacheable accesses.

Thanks for clarification.

> > My understanding from other comments in the kernel source is that the
> > SMP barriers are only meant or cacheable memory but there are drivers
> > that do something like below (e.g. drivers/net/r8169.c):
> >
> >               /* We need for force the visibility of tp->intr_mask
> >                * for other CPUs, as we can loose an MSI interrupt
> >                * and potentially wait for a retransmit timeout if we don't.
> >                * The posted write to IntrMask is safe, as it will
> >                * eventually make it to the chip and we won't loose anything
> >                * until it does.
> >                */
> >               tp->intr_mask = 0xffff;
> >               smp_wmb();
> >               RTL_W16(IntrMask, tp->intr_event);
> >
> > Is this supposed to work given the SMP barriers semantics?
> 
> Well, if the smp_wmb() is supposed to make the assignment to
> tp->intr_mask globally visible before any effects of the RTL_W16(),
> then it's buggy.  But from the comments it appears that the smp_wmb()
> might be intended to order the store to tp->intr_mask with respect to
> following cacheable stores, rather than with respect to the RTL_W16(),
> which would be OK.  I can't say without having a much closer look at
> what that driver is actually doing.

I cc'ed the r8169.c maintainer.

But from the architectural support perspective, we don't need to support
more than a lightweight barrier in this case.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-03  0:55 ` Paul Mackerras
  2010-03-03 12:03   ` Catalin Marinas
@ 2010-03-04  2:23   ` David Dillow
  2010-03-04  9:33     ` Russell King
  1 sibling, 1 reply; 22+ messages in thread
From: David Dillow @ 2010-03-04  2:23 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Catalin Marinas, linux-arch, Russell King, Francois Romieu

On Wed, 2010-03-03 at 11:55 +1100, Paul Mackerras wrote:
> On Tue, Mar 02, 2010 at 10:52:58AM +0000, Catalin Marinas wrote:
> > The main question - are the Linux SMP barriers supposed to have an
> > effect outside of cacheable memory accesses (i.e. ordering wrt I/O
> > accesses)?
> 
> The SMP barriers are only required to order cacheable accesses.  The
> plain (non-SMP) barriers (mb, wmb, rmb) are required to order both
> cacheable and non-cacheable accesses.
> 
> > My understanding from other comments in the kernel source is that the
> > SMP barriers are only meant or cacheable memory but there are drivers
> > that do something like below (e.g. drivers/net/r8169.c):
> > 
> > 		/* We need for force the visibility of tp->intr_mask
> > 		 * for other CPUs, as we can loose an MSI interrupt
> > 		 * and potentially wait for a retransmit timeout if we don't.
> > 		 * The posted write to IntrMask is safe, as it will
> > 		 * eventually make it to the chip and we won't loose anything
> > 		 * until it does.
> > 		 */
> > 		tp->intr_mask = 0xffff;
> > 		smp_wmb();
> > 		RTL_W16(IntrMask, tp->intr_event);
> > 
> > Is this supposed to work given the SMP barriers semantics?
> 
> Well, if the smp_wmb() is supposed to make the assignment to
> tp->intr_mask globally visible before any effects of the RTL_W16(),
> then it's buggy.  But from the comments it appears that the smp_wmb()
> might be intended to order the store to tp->intr_mask with respect to
> following cacheable stores, rather than with respect to the RTL_W16(),
> which would be OK.  I can't say without having a much closer look at
> what that driver is actually doing.

It's buggy. The code was intended to ensure the write to intr_mask was
visible to other CPUs before we told the NIC that it could generate
another interrupt. Give the definition of the barriers above, this
should be wmb() instead of smp_wmb().

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-04  2:23   ` David Dillow
@ 2010-03-04  9:33     ` Russell King
  2010-03-04  9:48       ` Catalin Marinas
  0 siblings, 1 reply; 22+ messages in thread
From: Russell King @ 2010-03-04  9:33 UTC (permalink / raw)
  To: David Dillow; +Cc: Paul Mackerras, Catalin Marinas, linux-arch, Francois Romieu

On Wed, Mar 03, 2010 at 09:23:46PM -0500, David Dillow wrote:
> On Wed, 2010-03-03 at 11:55 +1100, Paul Mackerras wrote:
> > Well, if the smp_wmb() is supposed to make the assignment to
> > tp->intr_mask globally visible before any effects of the RTL_W16(),
> > then it's buggy.  But from the comments it appears that the smp_wmb()
> > might be intended to order the store to tp->intr_mask with respect to
> > following cacheable stores, rather than with respect to the RTL_W16(),
> > which would be OK.  I can't say without having a much closer look at
> > what that driver is actually doing.
> 
> It's buggy. The code was intended to ensure the write to intr_mask was
> visible to other CPUs before we told the NIC that it could generate
> another interrupt. Give the definition of the barriers above, this
> should be wmb() instead of smp_wmb().

There's a whole bunch of other drivers doing exactly the same thing -
just grep drivers/net for smp_wmb(). ;(

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-04  9:33     ` Russell King
@ 2010-03-04  9:48       ` Catalin Marinas
  0 siblings, 0 replies; 22+ messages in thread
From: Catalin Marinas @ 2010-03-04  9:48 UTC (permalink / raw)
  To: Russell King; +Cc: David Dillow, Paul Mackerras, linux-arch, Francois Romieu

On Thu, 2010-03-04 at 09:33 +0000, Russell King wrote:
> On Wed, Mar 03, 2010 at 09:23:46PM -0500, David Dillow wrote:
> > On Wed, 2010-03-03 at 11:55 +1100, Paul Mackerras wrote:
> > > Well, if the smp_wmb() is supposed to make the assignment to
> > > tp->intr_mask globally visible before any effects of the RTL_W16(),
> > > then it's buggy.  But from the comments it appears that the smp_wmb()
> > > might be intended to order the store to tp->intr_mask with respect to
> > > following cacheable stores, rather than with respect to the RTL_W16(),
> > > which would be OK.  I can't say without having a much closer look at
> > > what that driver is actually doing.
> >
> > It's buggy. The code was intended to ensure the write to intr_mask was
> > visible to other CPUs before we told the NIC that it could generate
> > another interrupt. Give the definition of the barriers above, this
> > should be wmb() instead of smp_wmb().
> 
> There's a whole bunch of other drivers doing exactly the same thing -
> just grep drivers/net for smp_wmb(). ;(

Yes, but IMHO we shouldn't penalise the SMP systems by requiring a heavy
barrier rather than just fixing the drivers.

-- 
Catalin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-03 12:03   ` Catalin Marinas
@ 2010-03-12 12:31     ` Ralf Baechle
  2010-03-12 20:38       ` Jamie Lokier
  2010-03-17  2:25       ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 22+ messages in thread
From: Ralf Baechle @ 2010-03-12 12:31 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: Paul Mackerras, linux-arch, Russell King, Francois Romieu

On Wed, Mar 03, 2010 at 12:03:45PM +0000, Catalin Marinas wrote:

> > > My understanding from other comments in the kernel source is that the
> > > SMP barriers are only meant or cacheable memory but there are drivers
> > > that do something like below (e.g. drivers/net/r8169.c):
> > >
> > >               /* We need for force the visibility of tp->intr_mask
> > >                * for other CPUs, as we can loose an MSI interrupt
> > >                * and potentially wait for a retransmit timeout if we don't.
> > >                * The posted write to IntrMask is safe, as it will
> > >                * eventually make it to the chip and we won't loose anything
> > >                * until it does.
> > >                */
> > >               tp->intr_mask = 0xffff;
> > >               smp_wmb();
> > >               RTL_W16(IntrMask, tp->intr_event);
> > >
> > > Is this supposed to work given the SMP barriers semantics?
> > 
> > Well, if the smp_wmb() is supposed to make the assignment to
> > tp->intr_mask globally visible before any effects of the RTL_W16(),
> > then it's buggy.  But from the comments it appears that the smp_wmb()
> > might be intended to order the store to tp->intr_mask with respect to
> > following cacheable stores, rather than with respect to the RTL_W16(),
> > which would be OK.  I can't say without having a much closer look at
> > what that driver is actually doing.
> 
> I cc'ed the r8169.c maintainer.
> 
> But from the architectural support perspective, we don't need to support
> more than a lightweight barrier in this case.

Be afraid, very afraid when you find a non-SMP memory barrier in the
kernel.  A while ago I reviewed a number of uses throughout the kernel and
each one of them was somehow buggy - either entirely unnecessary or should
be replaced with an SMP memory barrier or was simple miss-placed.

  Ralf

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-12 12:31     ` Ralf Baechle
@ 2010-03-12 20:38       ` Jamie Lokier
  2010-03-17  2:25       ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 22+ messages in thread
From: Jamie Lokier @ 2010-03-12 20:38 UTC (permalink / raw)
  To: Ralf Baechle
  Cc: Catalin Marinas, Paul Mackerras, linux-arch, Russell King,
	Francois Romieu

Ralf Baechle wrote:
> On Wed, Mar 03, 2010 at 12:03:45PM +0000, Catalin Marinas wrote:
> > > >               /* We need for force the visibility of tp->intr_mask
> > > >                * for other CPUs, as we can loose an MSI interrupt
> > > >                * and potentially wait for a retransmit timeout if we don't.
> > > >                * The posted write to IntrMask is safe, as it will
> > > >                * eventually make it to the chip and we won't loose anything
> > > >                * until it does.
> > > >                */
> > > >               tp->intr_mask = 0xffff;
> > > >               smp_wmb();
> > > >               RTL_W16(IntrMask, tp->intr_event);
> > > >
> > > > Is this supposed to work given the SMP barriers semantics?
> > > 
> > > Well, if the smp_wmb() is supposed to make the assignment to
> > > tp->intr_mask globally visible before any effects of the RTL_W16(),
> > > then it's buggy.  But from the comments it appears that the smp_wmb()
> > > might be intended to order the store to tp->intr_mask with respect to
> > > following cacheable stores, rather than with respect to the RTL_W16(),
> > > which would be OK.  I can't say without having a much closer look at
> > > what that driver is actually doing.
> > 
> > I cc'ed the r8169.c maintainer.
> > 
> > But from the architectural support perspective, we don't need to support
> > more than a lightweight barrier in this case.

If the ordering relative to RTL_W16 doesn't matter, imho it would be
much clearer to move the RTL_W16() somewhere else, such as before the
comment.  The comment is quite misleading.

> Be afraid, very afraid when you find a non-SMP memory barrier in the
> kernel.  A while ago I reviewed a number of uses throughout the kernel and
> each one of them was somehow buggy - either entirely unnecessary or should
> be replaced with an SMP memory barrier or was simple miss-placed.

It's not hard to think of cases where a non-SMP barrier is necessary
outside of the kernel code API (dma_* etc.), but it would be quite
interesting if those cases never occur with real devices, or if they
can always be transformed to something else.

The RTL_W16 sample is quite interesting - even if it does not apply to
that particular driver, it's suggestive of a pattern where some device
may need to delay triggering an interrupt (on a different CPU) until a
data structure is written, or that some device may require ordered
MMIO reads and writes.  In the case of interrupts, code like that can
probably be transformed.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-12 12:31     ` Ralf Baechle
  2010-03-12 20:38       ` Jamie Lokier
@ 2010-03-17  2:25       ` Benjamin Herrenschmidt
  2010-03-17 10:31         ` Catalin Marinas
  2010-03-17 13:42         ` Jamie Lokier
  1 sibling, 2 replies; 22+ messages in thread
From: Benjamin Herrenschmidt @ 2010-03-17  2:25 UTC (permalink / raw)
  To: Ralf Baechle
  Cc: Catalin Marinas, Paul Mackerras, linux-arch, Russell King,
	Francois Romieu

On Fri, 2010-03-12 at 13:31 +0100, Ralf Baechle wrote:
> 
> Be afraid, very afraid when you find a non-SMP memory barrier in the
> kernel.  A while ago I reviewed a number of uses throughout the kernel and
> each one of them was somehow buggy - either entirely unnecessary or should
> be replaced with an SMP memory barrier or was simple miss-placed. 

Yes, it's a problem that's plaguing us too.... Now that ARM is moving to
a memory model that is similar to powerpc, maybe we should revisit
things a little bit :-)

The way we sort out the issues currently on powerpc is very heavy handed
and consists of sticking full barriers in our MMIO accessors.

We would -like- to improve that but there's just too much drivers out
there that assume some kind of ordering between consistent memory
accesses and IOs and between IOs and locks.

The only "sane" way out here would thus be to define a set of MMIO
accessors with relaxed ordering semantics along with appropriate
barriers. However, when we tried to do that, we mostly failed agreeing
on what the semantics actually should be since most architectures are
going to be subtly different in that regard.

The __raw_* variants are of no use since they have other effects such as
not performing endian swapping on BE architectures. Plus we don't have
the proper explicit barriers anyways.

Maybe we can agree on a set of relaxed accessors defined specifically
with those semantics (more relaxed implies use of raw_*) :

  - order is guaranteed between MMIOs
  - no order is guaranteed between MMIOs and memory (cachable) accesses
  - no order is guaranteed between MMIOs and spinlocks
  - a read access is not guaranteed to be performed until the read value
    is actually consumed by the processor

Along with barriers along the line of (that's where we may want to
discuss a bit I believe)

  - io_to_mem_rbarrier() : orders MMIO read vs. subsequent memory read
  - mem_to_io_wbarrier() : order memory write vs. subsequent MMIO write
                           (includes spin_unlock ?)
  - io_to_mem_barrier()  : order any MMIO s. subsequent memory accesses
  - mem_to_io_barrier()  : order memory accesses vs any subsequent MMIO
  - flush_io_read(val)   : takes value from MMIO read, enforce it's
                           before completed further instructions are
                           issued, acts as compiler&execution barrier. 

What do you guys think ? I think io_to_mem_rbarrier() and
mem_to_io_wbarrier() cover most useful cases, except maybe the IO vs.
locks.

Maybe for the later (locks) we can stick to the arch is responsible for
making that happen, for example by having MMIO writes set a per-cpu flag
and having spin_unlock do a barrier when that is set, and take that out
of the list of non-guaranteed ordering semantics.

Cheers,
Ben.
 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-17  2:25       ` Benjamin Herrenschmidt
@ 2010-03-17 10:31         ` Catalin Marinas
  2010-03-17 13:42         ` Jamie Lokier
  1 sibling, 0 replies; 22+ messages in thread
From: Catalin Marinas @ 2010-03-17 10:31 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Ralf Baechle, Paul Mackerras, linux-arch, Russell King, Francois Romieu

On Wed, 2010-03-17 at 02:25 +0000, Benjamin Herrenschmidt wrote:
> On Fri, 2010-03-12 at 13:31 +0100, Ralf Baechle wrote:
> > Be afraid, very afraid when you find a non-SMP memory barrier in the
> > kernel.  A while ago I reviewed a number of uses throughout the kernel and
> > each one of them was somehow buggy - either entirely unnecessary or should
> > be replaced with an SMP memory barrier or was simple miss-placed.
> 
> Yes, it's a problem that's plaguing us too.... Now that ARM is moving to
> a memory model that is similar to powerpc, maybe we should revisit
> things a little bit :-)
> 
> The way we sort out the issues currently on powerpc is very heavy handed
> and consists of sticking full barriers in our MMIO accessors.
> 
> We would -like- to improve that but there's just too much drivers out
> there that assume some kind of ordering between consistent memory
> accesses and IOs and between IOs and locks.

I find the memory-barriers.txt document pretty much alright, apart from
some clarification on the SMP barriers with respect to MMIO accesses.

> Maybe we can agree on a set of relaxed accessors defined specifically
> with those semantics (more relaxed implies use of raw_*) :
> 
>   - order is guaranteed between MMIOs
>   - no order is guaranteed between MMIOs and memory (cachable) accesses
>   - no order is guaranteed between MMIOs and spinlocks

I would add:

  - no order is guaranteed between MMIOs and interrupts
enabling/disabling

>   - a read access is not guaranteed to be performed until the read value
>     is actually consumed by the processor

I don't think we have this issue on ARM, we have speculative memory
accesses but not lazy accesses. Other architectures may of course
differ.

> Along with barriers along the line of (that's where we may want to
> discuss a bit I believe)
> 
>   - io_to_mem_rbarrier() : orders MMIO read vs. subsequent memory read

I currently assume rmb() to be enough for this (otherwise I don't see
where else the mandatory barriers would be useful).

>   - mem_to_io_wbarrier() : order memory write vs. subsequent MMIO write

wmb()

>                            (includes spin_unlock ?)

memory-barriers.txt suggests mmiowb() in such cases.

>   - io_to_mem_barrier()  : order any MMIO s. subsequent memory accesses

mb()

>   - mem_to_io_barrier()  : order memory accesses vs any subsequent MMIO

mb()

>   - flush_io_read(val)   : takes value from MMIO read, enforce it's
>                            before completed further instructions are
>                            issued, acts as compiler&execution barrier.

Could add a mmiorb()?

Ideally, we should avoid adding more barriers to Linux since many
drivers don't seem to use them anyway.

-- 
Catalin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-17  2:25       ` Benjamin Herrenschmidt
  2010-03-17 10:31         ` Catalin Marinas
@ 2010-03-17 13:42         ` Jamie Lokier
  2010-03-22 12:02           ` Nick Piggin
  1 sibling, 1 reply; 22+ messages in thread
From: Jamie Lokier @ 2010-03-17 13:42 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Ralf Baechle, Catalin Marinas, Paul Mackerras, linux-arch,
	Russell King, Francois Romieu

Benjamin Herrenschmidt wrote:
> Maybe we can agree on a set of relaxed accessors defined specifically
> with those semantics (more relaxed implies use of raw_*) :
> 
>   - order is guaranteed between MMIOs
>   - no order is guaranteed between MMIOs and spinlocks

No order between MMIOs and spinlocks will be fun :-)

>   - a read access is not guaranteed to be performed until the read value
>     is actually consumed by the processor

How do you define 'consumed'?  It's not obvious: see
read_barrier_depends() on Alpha, and elsewhere speculative read
optimisations (including by the compiler on IA64, in theory).

> Along with barriers along the line of (that's where we may want to
> discuss a bit I believe)
> 
>   - io_to_mem_rbarrier() : orders MMIO read vs. subsequent memory read
>   - mem_to_io_wbarrier() : order memory write vs. subsequent MMIO write
>                            (includes spin_unlock ?)
>   - io_to_mem_barrier()  : order any MMIO s. subsequent memory accesses
>   - mem_to_io_barrier()  : order memory accesses vs any subsequent MMIO
>   - flush_io_read(val)   : takes value from MMIO read, enforce it's
>                            before completed further instructions are
>                            issued, acts as compiler&execution barrier. 
> 
> What do you guys think ? I think io_to_mem_rbarrier() and
> mem_to_io_wbarrier() cover most useful cases, except maybe the IO vs.
> locks.

I'm thinking the current scheme with "heavy" barriers that serialise
everything is better - because it's simple to understand.  I think the
above set are likely to be accidentally misused, and even pass review
occasionally when they are wrong, even if it's just due to brain farts.

The "heavy" barrier names could change from rmb/mb/wmb to
io_rmb/io_mb/io_wmb or something.

Is there any real performance advantage expected from more
fine-grained MMIO barriers?

Another approach is a flags argument to a single macro:
barrier(BARRIER_MMIO_READ_BEFORE | BARRIER_MEM_READ_AFTER), etc.
Each arch would expand it to a barrier at least as strong as the one
requested.  C++0x atomics are going down this route, I think.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-17 13:42         ` Jamie Lokier
@ 2010-03-22 12:02           ` Nick Piggin
  2010-03-23  3:42             ` Nick Piggin
  2010-03-23 10:24             ` Catalin Marinas
  0 siblings, 2 replies; 22+ messages in thread
From: Nick Piggin @ 2010-03-22 12:02 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Benjamin Herrenschmidt, Ralf Baechle, Catalin Marinas,
	Paul Mackerras, linux-arch, Russell King, Francois Romieu

On Wed, Mar 17, 2010 at 01:42:43PM +0000, Jamie Lokier wrote:
> Benjamin Herrenschmidt wrote:
> > Maybe we can agree on a set of relaxed accessors defined specifically
> > with those semantics (more relaxed implies use of raw_*) :
> > 
> >   - order is guaranteed between MMIOs
> >   - no order is guaranteed between MMIOs and spinlocks
> 
> No order between MMIOs and spinlocks will be fun :-)

There isn't anyway, and things are pretty messed up in this area
already. We have mmiowb. Some architectures do a bit of mmio vs
lock synchronization. Eg. powerpc. But it doesn't do barriers on
some other locks like bit spinlocks or mutexes or rwsems or
semaphores blah blah.

When this came up I grepped a couple of drivers and found possible
problems straight away.

So IMO, we need to take all these out of lock primitives and just
increase awareness of it. Get rid of mmiowb. wmb() should be enough
to keep mmio stores inside the store to drop any lock (by definition).

Actually I think having an io_acquire_barrier() / io_release_barrier()
for the purpose of keeping ios inside locks is a good idea (paired
inside the actual lock/unlock functions). This basically gives them
a self-documenting property.


> >   - a read access is not guaranteed to be performed until the read value
> >     is actually consumed by the processor
> 
> How do you define 'consumed'?  It's not obvious: see
> read_barrier_depends() on Alpha, and elsewhere speculative read
> optimisations (including by the compiler on IA64, in theory).
> 
> > Along with barriers along the line of (that's where we may want to
> > discuss a bit I believe)
> > 
> >   - io_to_mem_rbarrier() : orders MMIO read vs. subsequent memory read
> >   - mem_to_io_wbarrier() : order memory write vs. subsequent MMIO write
> >                            (includes spin_unlock ?)
> >   - io_to_mem_barrier()  : order any MMIO s. subsequent memory accesses
> >   - mem_to_io_barrier()  : order memory accesses vs any subsequent MMIO
> >   - flush_io_read(val)   : takes value from MMIO read, enforce it's
> >                            before completed further instructions are
> >                            issued, acts as compiler&execution barrier. 
> > 
> > What do you guys think ? I think io_to_mem_rbarrier() and
> > mem_to_io_wbarrier() cover most useful cases, except maybe the IO vs.
> > locks.
> 
> I'm thinking the current scheme with "heavy" barriers that serialise
> everything is better - because it's simple to understand.  I think the
> above set are likely to be accidentally misused, and even pass review
> occasionally when they are wrong, even if it's just due to brain farts.
> 
> The "heavy" barrier names could change from rmb/mb/wmb to
> io_rmb/io_mb/io_wmb or something.
> 
> Is there any real performance advantage expected from more
> fine-grained MMIO barriers?
> 
> Another approach is a flags argument to a single macro:
> barrier(BARRIER_MMIO_READ_BEFORE | BARRIER_MEM_READ_AFTER), etc.

This is just syntax really. Like you I also prefer sticking to simpler
barriers. I would like to see any new barrier introduced on a case by
case basis with before/after numbers and allow discussion on whether it
can be avoided.

So many cases people get the simple barriers wrong...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-22 12:02           ` Nick Piggin
@ 2010-03-23  3:42             ` Nick Piggin
  2010-03-23 10:24             ` Catalin Marinas
  1 sibling, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2010-03-23  3:42 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Benjamin Herrenschmidt, Ralf Baechle, Catalin Marinas,
	Paul Mackerras, linux-arch, Russell King, Francois Romieu

On Mon, Mar 22, 2010 at 11:02:03PM +1100, Nick Piggin wrote:
> On Wed, Mar 17, 2010 at 01:42:43PM +0000, Jamie Lokier wrote:
> > Benjamin Herrenschmidt wrote:
> > > Maybe we can agree on a set of relaxed accessors defined specifically
> > > with those semantics (more relaxed implies use of raw_*) :
> > > 
> > >   - order is guaranteed between MMIOs
> > >   - no order is guaranteed between MMIOs and spinlocks
> > 
> > No order between MMIOs and spinlocks will be fun :-)
> 
> There isn't anyway, and things are pretty messed up in this area
> already. We have mmiowb. Some architectures do a bit of mmio vs
> lock synchronization. Eg. powerpc. But it doesn't do barriers on
> some other locks like bit spinlocks or mutexes or rwsems or
> semaphores blah blah.
> 
> When this came up I grepped a couple of drivers and found possible
> problems straight away.

Ie. with drivers using a mutex to synchronise mmios, which breaks on
powerpc (even though spinlocks would work due to their hack in there).

And on some other architectures the driver would be broken whether it
uses spinlocks or not because it is not using the right barriers.

I think we have basically everything we need right now except to
have things clearly documented and consistent over architectures, and
to have mmio acquire/release to be paired with _any_ locking function
(bitops, atomics, spinlocks, mutexes, etc).

If we enforce *strong* ordering with cacheable accesses and other IO
for regular IO accessors, then a driver would not have to do anything
special for ios or locks or anything. This can be very expensive on
some archs, but IMO it is the best default because any driver that would
work on x86 using these will work on any other arch.

It is much easier for a powerpc or altix developer to take a working
but underperforming driver, identify some of the critical ios and relax
them than it is for them to try to fix a strangely broken driver (and
where the driver developer and 99% of driver testers doesn't have their
hardware).

For relaxed accessors, I think acquire/release (versus cacheable), and
then traditional mb, rmb, wmb should be pretty good start.

Anyone disagree? I would volunteer to post some patches if not...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-22 12:02           ` Nick Piggin
  2010-03-23  3:42             ` Nick Piggin
@ 2010-03-23 10:24             ` Catalin Marinas
  2010-04-06 14:20               ` Nick Piggin
  1 sibling, 1 reply; 22+ messages in thread
From: Catalin Marinas @ 2010-03-23 10:24 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jamie Lokier, Benjamin Herrenschmidt, Ralf Baechle,
	Paul Mackerras, linux-arch, Russell King, Francois Romieu

On Mon, 2010-03-22 at 12:02 +0000, Nick Piggin wrote:
> On Wed, Mar 17, 2010 at 01:42:43PM +0000, Jamie Lokier wrote:
> > Benjamin Herrenschmidt wrote:
> > > Maybe we can agree on a set of relaxed accessors defined specifically
> > > with those semantics (more relaxed implies use of raw_*) :
> > >
> > >   - order is guaranteed between MMIOs
> > >   - no order is guaranteed between MMIOs and spinlocks
> >
> > No order between MMIOs and spinlocks will be fun :-)
> 
> There isn't anyway, and things are pretty messed up in this area
> already. We have mmiowb. Some architectures do a bit of mmio vs
> lock synchronization. Eg. powerpc. But it doesn't do barriers on
> some other locks like bit spinlocks or mutexes or rwsems or
> semaphores blah blah.
> 
> When this came up I grepped a couple of drivers and found possible
> problems straight away.
> 
> So IMO, we need to take all these out of lock primitives and just
> increase awareness of it. Get rid of mmiowb. wmb() should be enough
> to keep mmio stores inside the store to drop any lock (by definition).

I think we have different scenarios for wmb and mmiowb (my
understanding). One is when the driver writes to a coherent DMA buffer
(usually uncached) and it than need to drain the write buffer before
informing the device to start the transfer. That's where wmb() would be
used (with normal uncached memory).

The mmiowb() may need to go beyond the CPU write-buffer level into the
PCI bus etc. but only for relative ordering of the I/O accesses. The
memory-barriers.txt suggests that mmiowb(). My understanding is that
mmiowb() drains any mmio buffers while wmb() drains normal memory
buffers.

> Actually I think having an io_acquire_barrier() / io_release_barrier()
> for the purpose of keeping ios inside locks is a good idea (paired
> inside the actual lock/unlock functions). This basically gives them
> a self-documenting property.

Is there any architecture where mmio accesses are speculated? Do we need
an io_acquire_barrier()? At least on ARM, device memory access is not
restartable by definition, which means that it cannot in general be
speculated. If that's true for other architectures as well, we wouldn't
need anything more than mmiowb() (but I may be wrong here).

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-03-23 10:24             ` Catalin Marinas
@ 2010-04-06 14:20               ` Nick Piggin
  2010-04-06 15:43                 ` Jamie Lokier
  2010-04-23 16:23                 ` Catalin Marinas
  0 siblings, 2 replies; 22+ messages in thread
From: Nick Piggin @ 2010-04-06 14:20 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Jamie Lokier, Benjamin Herrenschmidt, Ralf Baechle,
	Paul Mackerras, linux-arch, Russell King, Francois Romieu

On Tue, Mar 23, 2010 at 10:24:07AM +0000, Catalin Marinas wrote:
> On Mon, 2010-03-22 at 12:02 +0000, Nick Piggin wrote:
> > On Wed, Mar 17, 2010 at 01:42:43PM +0000, Jamie Lokier wrote:
> > > Benjamin Herrenschmidt wrote:
> > > > Maybe we can agree on a set of relaxed accessors defined specifically
> > > > with those semantics (more relaxed implies use of raw_*) :
> > > >
> > > >   - order is guaranteed between MMIOs
> > > >   - no order is guaranteed between MMIOs and spinlocks
> > >
> > > No order between MMIOs and spinlocks will be fun :-)
> > 
> > There isn't anyway, and things are pretty messed up in this area
> > already. We have mmiowb. Some architectures do a bit of mmio vs
> > lock synchronization. Eg. powerpc. But it doesn't do barriers on
> > some other locks like bit spinlocks or mutexes or rwsems or
> > semaphores blah blah.
> > 
> > When this came up I grepped a couple of drivers and found possible
> > problems straight away.
> > 
> > So IMO, we need to take all these out of lock primitives and just
> > increase awareness of it. Get rid of mmiowb. wmb() should be enough
> > to keep mmio stores inside the store to drop any lock (by definition).
> 
> I think we have different scenarios for wmb and mmiowb (my
> understanding). One is when the driver writes to a coherent DMA buffer
> (usually uncached) and it than need to drain the write buffer before
> informing the device to start the transfer. That's where wmb() would be
> used (with normal uncached memory).
> 
> The mmiowb() may need to go beyond the CPU write-buffer level into the
> PCI bus etc. but only for relative ordering of the I/O accesses. The
> memory-barriers.txt suggests that mmiowb(). My understanding is that
> mmiowb() drains any mmio buffers while wmb() drains normal memory
> buffers.

No barriers are defined to drain anything, only order. wmb() is defined
to order all memory stores, so all previous stores cached and IO are
seen before all subsequent stores. And considering that we are talking
about IO, "seen" obviously means seen by the device as well as other
CPUs.

mmiowb was introduced because this was too heavy on some platforms
(specifically ia64-sn2). And basically they weakened wmb() semantics
and introduced this heavier operation to solve the problem of ordering
IO from different CPUs.

Problem with that is that you can't just redefine an existing age old
API to be weaker. And also because semantics of wmb() get much more
complex (look at the mmiowb calls added by non-SGI people and most are
wrong). And also, code that is tested and works on most common platforms
does not work on others.

What is needed is to make the default accessors strongly ordered and
so driver writers can be really dumb about it, and IO / spinlock etc
synchronization "just works".

So as I said, it's much easier for the smart powerpc programmer to spot
the suboptimal sequence of a few mmio accesses and relax them than it is
for the dumb device driver writer to debug strange reports that aren't
reproduceable on their x86.

(I don't call driver writers dumb but the easier the driver APIs are to
use the better)
 

> > Actually I think having an io_acquire_barrier() / io_release_barrier()
> > for the purpose of keeping ios inside locks is a good idea (paired
> > inside the actual lock/unlock functions). This basically gives them
> > a self-documenting property.
> 
> Is there any architecture where mmio accesses are speculated? Do we need
> an io_acquire_barrier()? At least on ARM, device memory access is not
> restartable by definition, which means that it cannot in general be
> speculated. If that's true for other architectures as well, we wouldn't
> need anything more than mmiowb() (but I may be wrong here).

I would like acquire because it looks nice -- may help with reading 
the code, and we might be able to hook debug checks into them.

But hmm, I don't know if we even need acquire/release IO barriers at
all. Might be better to just fix up wmb(), get rid of mmiowb(),
strengthen IO accessors, and then just add special case barriers as
the need arises.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-04-06 14:20               ` Nick Piggin
@ 2010-04-06 15:43                 ` Jamie Lokier
  2010-04-06 16:04                   ` Nick Piggin
  2010-04-23 16:23                 ` Catalin Marinas
  1 sibling, 1 reply; 22+ messages in thread
From: Jamie Lokier @ 2010-04-06 15:43 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Catalin Marinas, Benjamin Herrenschmidt, Ralf Baechle,
	Paul Mackerras, linux-arch, Russell King, Francois Romieu

Nick Piggin wrote:
> On Tue, Mar 23, 2010 at 10:24:07AM +0000, Catalin Marinas wrote:
> 
> But hmm, I don't know if we even need acquire/release IO barriers at
> all. Might be better to just fix up wmb(), get rid of mmiowb(),
> strengthen IO accessors, and then just add special case barriers as
> the need arises.

I must admit I don't understand what wmb() means at this point,
generically from the point of view of arch-independent drivers!  It's
not an inter-CPU ordering (smb_wmb is sufficient), and it doesn't
order all memory and I/O writes (otherwise why mmiowb?).  I suspect a
few people have been unsure, resulting in a bit of confusion about
what goes into different arch implementations of wmb().

For strengthening I/O accessors, do you mean the equivalent of putting
"dmb;dsb" before _and_ after the I/O write inside every call to
writel()?  (That's using ARM as an example: "dmb" means barrier, and
"dsb" means flush write buffers I think, and some ARMs need other,
rather heavier instructions such as a coprocessor instruction or even
an instruction to the L2 cache.)

Because I'm not sure if that's as light as we'd like it to be on
slower CPUs, and __raw_writel or something will get used instead by
some driver writer... leading back to needing to be very clear about
the meaning of wmb/mmiowb.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-04-06 15:43                 ` Jamie Lokier
@ 2010-04-06 16:04                   ` Nick Piggin
  0 siblings, 0 replies; 22+ messages in thread
From: Nick Piggin @ 2010-04-06 16:04 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Catalin Marinas, Benjamin Herrenschmidt, Ralf Baechle,
	Paul Mackerras, linux-arch, Russell King, Francois Romieu

On Tue, Apr 06, 2010 at 04:43:21PM +0100, Jamie Lokier wrote:
> Nick Piggin wrote:
> > On Tue, Mar 23, 2010 at 10:24:07AM +0000, Catalin Marinas wrote:
> > 
> > But hmm, I don't know if we even need acquire/release IO barriers at
> > all. Might be better to just fix up wmb(), get rid of mmiowb(),
> > strengthen IO accessors, and then just add special case barriers as
> > the need arises.
> 
> I must admit I don't understand what wmb() means at this point,
> generically from the point of view of arch-independent drivers!  It's
> not an inter-CPU ordering (smb_wmb is sufficient), and it doesn't
> order all memory and I/O writes (otherwise why mmiowb?).  I suspect a
> few people have been unsure, resulting in a bit of confusion about
> what goes into different arch implementations of wmb().

No, it has forever been defined to order *everything* (and this
includes ordering as-seen from IO devices, not as-sent from CPU,
otherwise it is absolutely useless on its own).

The problem is that it was weakened and made more complex by one
particular arch, and something else added to make up for it. This simply
doesn't work because it is impossible to audit all drivers (or get
driver writers to understand new semantics).

So, wmb() really should mean what it has always meant. sn2 should be
fixed up. Relaxed schemes should be added as required.

 
> For strengthening I/O accessors, do you mean the equivalent of putting
> "dmb;dsb" before _and_ after the I/O write inside every call to
> writel()?  (That's using ARM as an example: "dmb" means barrier, and
> "dsb" means flush write buffers I think, and some ARMs need other,
> rather heavier instructions such as a coprocessor instruction or even
> an instruction to the L2 cache.)
> 
> Because I'm not sure if that's as light as we'd like it to be on
> slower CPUs, and __raw_writel or something will get used instead by
> some driver writer...

writel_relaxed, I guess. It will get used, but it should be in a small
portion of driver code, that is carefully checked.


> leading back to needing to be very clear about
> the meaning of wmb/mmiowb.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-04-06 14:20               ` Nick Piggin
  2010-04-06 15:43                 ` Jamie Lokier
@ 2010-04-23 16:23                 ` Catalin Marinas
  2010-04-23 16:56                   ` Jamie Lokier
  1 sibling, 1 reply; 22+ messages in thread
From: Catalin Marinas @ 2010-04-23 16:23 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Jamie Lokier, Benjamin Herrenschmidt, Ralf Baechle,
	Paul Mackerras, linux-arch, Russell King, Francois Romieu

On Tue, 2010-04-06 at 15:20 +0100, Nick Piggin wrote:
> On Tue, Mar 23, 2010 at 10:24:07AM +0000, Catalin Marinas wrote:
> > On Mon, 2010-03-22 at 12:02 +0000, Nick Piggin wrote:
> > > So IMO, we need to take all these out of lock primitives and just
> > > increase awareness of it. Get rid of mmiowb. wmb() should be enough
> > > to keep mmio stores inside the store to drop any lock (by definition).
> >
> > I think we have different scenarios for wmb and mmiowb (my
> > understanding). One is when the driver writes to a coherent DMA buffer
> > (usually uncached) and it than need to drain the write buffer before
> > informing the device to start the transfer. That's where wmb() would be
> > used (with normal uncached memory).
> >
> > The mmiowb() may need to go beyond the CPU write-buffer level into the
> > PCI bus etc. but only for relative ordering of the I/O accesses. The
> > memory-barriers.txt suggests that mmiowb(). My understanding is that
> > mmiowb() drains any mmio buffers while wmb() drains normal memory
> > buffers.
> 
> No barriers are defined to drain anything, only order. wmb() is defined
> to order all memory stores, so all previous stores cached and IO are
> seen before all subsequent stores. And considering that we are talking
> about IO, "seen" obviously means seen by the device as well as other
> CPUs.

Indeed, the barriers aren't defined to drain anything, though they may
do it on specific implementations (or when "seen" actually requires
draining).

The Documentation/DMA-API.txt file mentions that the CPU write buffer
may need to be flushed after writing coherent memory but the kernel
doesn't define any primitive for doing this. Hence my assumption that
this is the job of wmb().

> What is needed is to make the default accessors strongly ordered and
> so driver writers can be really dumb about it, and IO / spinlock etc
> synchronization "just works".

On ARM, the I/O accessors are ordered with respect to device memory
accesses but not with respect to normal non-cacheable memory
(dma_alloc_coherent). If we want to make the writel etc. accessors
ordered with respect to the normal non-cacheable memory, that would be
really expensive on several ARM platforms. Apart from the CPU barrier (a
full one - DSB - to drain the write buffer), some platforms require
draining the write buffer of the L2 cache as well (by writing to other
registers to the L2 cache controller).

So I'm more in favour of having stronger semantics for wmb() and leaving
the I/O accessors semantics to only ensure device memory ordering.

-- 
Catalin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-04-23 16:23                 ` Catalin Marinas
@ 2010-04-23 16:56                   ` Jamie Lokier
  2010-04-23 17:25                     ` Catalin Marinas
  0 siblings, 1 reply; 22+ messages in thread
From: Jamie Lokier @ 2010-04-23 16:56 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Nick Piggin, Benjamin Herrenschmidt, Ralf Baechle,
	Paul Mackerras, linux-arch, Russell King, Francois Romieu

Catalin Marinas wrote:
> On ARM, the I/O accessors are ordered with respect to device memory
> accesses but not with respect to normal non-cacheable memory
> (dma_alloc_coherent). If we want to make the writel etc. accessors
> ordered with respect to the normal non-cacheable memory, that would be
> really expensive on several ARM platforms. Apart from the CPU barrier (a
> full one - DSB - to drain the write buffer), some platforms require
> draining the write buffer of the L2 cache as well (by writing to other
> registers to the L2 cache controller).

It is useful to call it non-cacheable memory if the only way to make
it device-visible is an instruction to the L2 cache controller asking
it to flush? :-)

On at least one ARM-compatible core I know of, you can disable write
buffering for a region, independent of caching.  Is it possible in
general, to make dma_alloc_coherent on ARM return unbuffered uncached memory?

> So I'm more in favour of having stronger semantics for wmb() and leaving
> the I/O accessors semantics to only ensure device memory ordering.

From the above, I'm thinking the semantics for wmb() should include:

   - Flushes any buffered writes to memory which is mapped uncached,
     but does not have to flush buffered writes to cached memory.

So if the arch always mapped uncached memory also unbuffered, wmb()
won't have to flush the CPU write buffer, which can be expensive as you say.

You probably already planned on this, but it's good to make
it explicit in any docs - if that's an agreeable semantic.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-04-23 16:56                   ` Jamie Lokier
@ 2010-04-23 17:25                     ` Catalin Marinas
  2010-04-24  1:45                       ` Jamie Lokier
  0 siblings, 1 reply; 22+ messages in thread
From: Catalin Marinas @ 2010-04-23 17:25 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Nick Piggin, Benjamin Herrenschmidt, Ralf Baechle,
	Paul Mackerras, linux-arch, Russell King, Francois Romieu

On Fri, 2010-04-23 at 17:56 +0100, Jamie Lokier wrote:
> Catalin Marinas wrote:
> > On ARM, the I/O accessors are ordered with respect to device memory
> > accesses but not with respect to normal non-cacheable memory
> > (dma_alloc_coherent). If we want to make the writel etc. accessors
> > ordered with respect to the normal non-cacheable memory, that would be
> > really expensive on several ARM platforms. Apart from the CPU barrier (a
> > full one - DSB - to drain the write buffer), some platforms require
> > draining the write buffer of the L2 cache as well (by writing to other
> > registers to the L2 cache controller).
> 
> It is useful to call it non-cacheable memory if the only way to make
> it device-visible is an instruction to the L2 cache controller asking
> it to flush? :-)

It's not the cache itself but only its write buffer. Ideally, the DSB
should get through to the L2 cache as well there are platforms that
don't do this.

> On at least one ARM-compatible core I know of, you can disable write
> buffering for a region, independent of caching.  Is it possible in
> general, to make dma_alloc_coherent on ARM return unbuffered uncached memory?

On recent ARM cores, it's only the Strongly Ordered memory that can have
the write buffering disabled. But using such mapping for
dma_alloc_coherent() introduces other problems like cache attribute
aliases for a physical location (since the kernel maps the RAM as Normal
memory already). Such aliases are not allowed hence we moved to Normal
Non-cacheable for dma_alloc_coherent().

> > So I'm more in favour of having stronger semantics for wmb() and leaving
> > the I/O accessors semantics to only ensure device memory ordering.
> 
> From the above, I'm thinking the semantics for wmb() should include:
> 
>    - Flushes any buffered writes to memory which is mapped uncached,
>      but does not have to flush buffered writes to cached memory.

Yes.

> So if the arch always mapped uncached memory also unbuffered, wmb()
> won't have to flush the CPU write buffer, which can be expensive as you say.
> 
> You probably already planned on this, but it's good to make
> it explicit in any docs - if that's an agreeable semantic.

The patches for the ARM platform to make wmb() drain the write buffers
(DSB + L2 cache sync) are already in mainline. That was the only way to
allow drivers using dma_alloc_coherent() to work correctly.

-- 
Catalin

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-04-23 17:25                     ` Catalin Marinas
@ 2010-04-24  1:45                       ` Jamie Lokier
  2010-04-26  9:21                         ` Catalin Marinas
  0 siblings, 1 reply; 22+ messages in thread
From: Jamie Lokier @ 2010-04-24  1:45 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Nick Piggin, Benjamin Herrenschmidt, Ralf Baechle,
	Paul Mackerras, linux-arch, Russell King, Francois Romieu

Catalin Marinas wrote:
> On recent ARM cores, it's only the Strongly Ordered memory that can have
> the write buffering disabled. But using such mapping for
> dma_alloc_coherent() introduces other problems like cache attribute
> aliases for a physical location (since the kernel maps the RAM as Normal
> memory already). Such aliases are not allowed hence we moved to Normal
> Non-cacheable for dma_alloc_coherent().

I'm surprised aliases between Normal-Cached and Normal-Uncached are ok,
while aliases between Normal-Cached and SO-Uncached are not ok.

Is it theoretically ok by the ARM specs, or just optimistic programming? :-)

If optimism, it's easy to imagine an implementation where
(unrequested) speculative reads populate the cached mapping, and then
accesses to the Normal-Uncached alias of it get a cache hit and use
that, wrongly.  Or, conversely, if it definitely does not treat that
as a cache hit, it's hard to imagine why SO-Uncached would be different.

-- Jamie

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: SMP barriers semantics
  2010-04-24  1:45                       ` Jamie Lokier
@ 2010-04-26  9:21                         ` Catalin Marinas
  0 siblings, 0 replies; 22+ messages in thread
From: Catalin Marinas @ 2010-04-26  9:21 UTC (permalink / raw)
  To: Jamie Lokier
  Cc: Nick Piggin, Benjamin Herrenschmidt, Ralf Baechle,
	Paul Mackerras, linux-arch, Russell King, Francois Romieu

On Sat, 2010-04-24 at 02:45 +0100, Jamie Lokier wrote:
> Catalin Marinas wrote:
> > On recent ARM cores, it's only the Strongly Ordered memory that can have
> > the write buffering disabled. But using such mapping for
> > dma_alloc_coherent() introduces other problems like cache attribute
> > aliases for a physical location (since the kernel maps the RAM as Normal
> > memory already). Such aliases are not allowed hence we moved to Normal
> > Non-cacheable for dma_alloc_coherent().
> 
> I'm surprised aliases between Normal-Cached and Normal-Uncached are ok,
> while aliases between Normal-Cached and SO-Uncached are not ok.

In theory, all of the aliases are banned but the Normal cacheable vs
non-cacheable have been allowed in practice and current ARM
implementations can cope with it (such "palliative" measures have been
introduced in 2006). Other aliases aren't guaranteed to work correctly.

Normal Non-cacheable memory uses the write buffer (at both the CPU and
L2 cache level) while the Strongly Ordered doesn't. Normal memory
accesses on the same CPU check the write buffer as well since a CPU is
coherent with itself (on the D side).

-- 
Catalin

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2010-04-26  9:25 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2010-03-02 10:52 SMP barriers semantics Catalin Marinas
2010-03-03  0:55 ` Paul Mackerras
2010-03-03 12:03   ` Catalin Marinas
2010-03-12 12:31     ` Ralf Baechle
2010-03-12 20:38       ` Jamie Lokier
2010-03-17  2:25       ` Benjamin Herrenschmidt
2010-03-17 10:31         ` Catalin Marinas
2010-03-17 13:42         ` Jamie Lokier
2010-03-22 12:02           ` Nick Piggin
2010-03-23  3:42             ` Nick Piggin
2010-03-23 10:24             ` Catalin Marinas
2010-04-06 14:20               ` Nick Piggin
2010-04-06 15:43                 ` Jamie Lokier
2010-04-06 16:04                   ` Nick Piggin
2010-04-23 16:23                 ` Catalin Marinas
2010-04-23 16:56                   ` Jamie Lokier
2010-04-23 17:25                     ` Catalin Marinas
2010-04-24  1:45                       ` Jamie Lokier
2010-04-26  9:21                         ` Catalin Marinas
2010-03-04  2:23   ` David Dillow
2010-03-04  9:33     ` Russell King
2010-03-04  9:48       ` Catalin Marinas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.