[RFC] docs/memory-barriers.txt: Rewrite "KERNEL I/O BARRIER EFFECTS" section
diff mbox series

Message ID 20190211172948.3322-1-will.deacon@arm.com
State Superseded
Headers show
Series
  • [RFC] docs/memory-barriers.txt: Rewrite "KERNEL I/O BARRIER EFFECTS" section
Related show

Commit Message

Will Deacon Feb. 11, 2019, 5:29 p.m. UTC
The "KERNEL I/O BARRIER EFFECTS" section of memory-barriers.txt is vague,
x86-centric, out-of-date, incomplete and demonstrably incorrect in places.
This is largely because I/O ordering is a horrible can of worms, but also
because the document has stagnated as our understanding has evolved.

Attempt to address some of that, by rewriting the section based on
recent(-ish) discussions with Arnd, BenH and others. Maybe one day we'll
find a way to formalise this stuff, but for now let's at least try to
make the English easier to understand.

Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
Cc: Daniel Lustig <dlustig@nvidia.com>
Cc: David Howells <dhowells@redhat.com>
Cc: Alan Stern <stern@rowland.harvard.edu>
cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 Documentation/memory-barriers.txt | 115 ++++++++++++++++++++------------------
 1 file changed, 62 insertions(+), 53 deletions(-)

Comments

Paul E. McKenney Feb. 11, 2019, 8:22 p.m. UTC | #1
On Mon, Feb 11, 2019 at 05:29:48PM +0000, Will Deacon wrote:
> The "KERNEL I/O BARRIER EFFECTS" section of memory-barriers.txt is vague,
> x86-centric, out-of-date, incomplete and demonstrably incorrect in places.
> This is largely because I/O ordering is a horrible can of worms, but also
> because the document has stagnated as our understanding has evolved.
> 
> Attempt to address some of that, by rewriting the section based on
> recent(-ish) discussions with Arnd, BenH and others. Maybe one day we'll
> find a way to formalise this stuff, but for now let's at least try to
> make the English easier to understand.
> 
> Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Arnd Bergmann <arnd@arndb.de>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
> Cc: Daniel Lustig <dlustig@nvidia.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: Alan Stern <stern@rowland.harvard.edu>
> cc: Linus Torvalds <torvalds@linux-foundation.org>
> Signed-off-by: Will Deacon <will.deacon@arm.com>

Hello, Will,

The intent is to replace commit 3f305018dcf3 ("docs/memory-barriers.txt:
Enforce heavy ordering for port I/O accesses"), correct?  Either way is
fine, just guessing based on the conflicts when applying this one.  ;-)

							Thanx, Paul

> ---
>  Documentation/memory-barriers.txt | 115 ++++++++++++++++++++------------------
>  1 file changed, 62 insertions(+), 53 deletions(-)
> 
> diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
> index 1c22b21ae922..d08b49b2c011 100644
> --- a/Documentation/memory-barriers.txt
> +++ b/Documentation/memory-barriers.txt
> @@ -2599,72 +2599,81 @@ likely, then interrupt-disabling locks should be used to guarantee ordering.
>  KERNEL I/O BARRIER EFFECTS
>  ==========================
>  
> -When accessing I/O memory, drivers should use the appropriate accessor
> -functions:
> -
> - (*) inX(), outX():
> -
> -     These are intended to talk to I/O space rather than memory space, but
> -     that's primarily a CPU-specific concept.  The i386 and x86_64 processors
> -     do indeed have special I/O space access cycles and instructions, but many
> -     CPUs don't have such a concept.
> -
> -     The PCI bus, amongst others, defines an I/O space concept which - on such
> -     CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
> -     space.  However, it may also be mapped as a virtual I/O space in the CPU's
> -     memory map, particularly on those CPUs that don't support alternate I/O
> -     spaces.
> -
> -     Accesses to this space may be fully synchronous (as on i386), but
> -     intermediary bridges (such as the PCI host bridge) may not fully honour
> -     that.
> -
> -     They are guaranteed to be fully ordered with respect to each other.
> -
> -     They are not guaranteed to be fully ordered with respect to other types of
> -     memory and I/O operation.
> +Interfacing with peripherals via I/O accesses is deeply architecture and device
> +specific. Therefore, drivers which are inherently non-portable may rely on
> +specific behaviours of their target systems in order to achieve synchronization
> +in the most lightweight manner possible. For drivers intending to be portable
> +between multiple architectures and bus implementations, the kernel offers a
> +series of accessor functions that provide various degrees of ordering
> +guarantees:
>  
>   (*) readX(), writeX():
>  
> -     Whether these are guaranteed to be fully ordered and uncombined with
> -     respect to each other on the issuing CPU depends on the characteristics
> -     defined for the memory window through which they're accessing.  On later
> -     i386 architecture machines, for example, this is controlled by way of the
> -     MTRR registers.
> +     The readX() and writeX() MMIO accessors take a pointer to the peripheral
> +     being accessed as an __iomem * parameter. For pointers mapped with the
> +     default I/O attributes (e.g. those returned by ioremap()), then the
> +     ordering guarantees are as follows:
> +
> +     1. All readX() and writeX() accesses to the same peripheral are ordered
> +        with respect to each other. For example, this ensures that MMIO register
> +	writes by the CPU to a particular device will arrive in program order.
> +
> +     2. A writeX() by the CPU to the peripheral will first wait for the
> +        completion of all prior CPU writes to memory. For example, this ensures
> +        that writes by the CPU to an outbound DMA buffer allocated by
> +        dma_alloc_coherent() will be visible to a DMA engine when the CPU writes
> +        to its MMIO control register to trigger the transfer.
> +
> +     3. A readX() by the CPU from the peripheral will complete before any
> +	subsequent CPU reads from memory can begin. For example, this ensures
> +	that reads by the CPU from an incoming DMA buffer allocated by
> +	dma_alloc_coherent() will not see stale data after reading from the DMA
> +	engine's MMIO status register to establish that the DMA transfer has
> +	completed.
> +
> +     4. A readX() by the CPU from the peripheral will complete before any
> +	subsequent delay() loop can begin execution. For example, this ensures
> +	that two MMIO register writes by the CPU to a peripheral will arrive at
> +	least 1us apart if the first write is immediately read back with readX()
> +	and udelay(1) is called prior to the second writeX().
> +
> +     __iomem pointers obtained with non-default attributes (e.g. those returned
> +     by ioremap_wc()) are unlikely to provide many of these guarantees. If
> +     ordering is required for such mappings, then the mandatory barriers should
> +     be used in conjunction with the _relaxed() accessors defined below.
> +
> + (*) readX_relaxed(), writeX_relaxed():
>  
> -     Ordinarily, these will be guaranteed to be fully ordered and uncombined,
> -     provided they're not accessing a prefetchable device.
> -
> -     However, intermediary hardware (such as a PCI bridge) may indulge in
> -     deferral if it so wishes; to flush a store, a load from the same location
> -     is preferred[*], but a load from the same device or from configuration
> -     space should suffice for PCI.
> -
> -     [*] NOTE! attempting to load from the same location as was written to may
> -	 cause a malfunction - consider the 16550 Rx/Tx serial registers for
> -	 example.
> -
> -     Used with prefetchable I/O memory, an mmiowb() barrier may be required to
> -     force stores to be ordered.
> +     These are similar to readX() and writeX(), but provide weaker memory
> +     ordering guarantees. Specifically, they do not guarantee ordering with
> +     respect to normal memory accesses or delay() loops (i.e bullets 2-4 above)
> +     but they are still guaranteed to be ordered with respect to other accesses
> +     to the same peripheral when operating on __iomem pointers mapped with the
> +     default I/O attributes.
>  
> -     Please refer to the PCI specification for more information on interactions
> -     between PCI transactions.
> + (*) inX(), outX():
>  
> - (*) readX_relaxed(), writeX_relaxed()
> +     The inX() and outX() accessors are intended to access legacy port-mapped
> +     I/O peripherals, which may require special instructions on some
> +     architectures (notably x86). The port number of the peripheral being
> +     accessed is passed as an argument.
>  
> -     These are similar to readX() and writeX(), but provide weaker memory
> -     ordering guarantees.  Specifically, they do not guarantee ordering with
> -     respect to normal memory accesses (e.g. DMA buffers) nor do they guarantee
> -     ordering with respect to LOCK or UNLOCK operations.  If the latter is
> -     required, an mmiowb() barrier can be used.  Note that relaxed accesses to
> -     the same peripheral are guaranteed to be ordered with respect to each
> -     other.
> +     Since many CPU architectures ultimately access these peripherals via an
> +     internal virtual memory mapping, the portable ordering guarantees provided
> +     by inX() and outX() are the same as those provided by readX() and writeX()
> +     respectively when accessing a mapping with the default I/O attributes.
>  
>   (*) ioreadX(), iowriteX()
>  
>       These will perform appropriately for the type of access they're actually
>       doing, be it inX()/outX() or readX()/writeX().
>  
> +All of these accessors assume that the underlying peripheral is little-endian,
> +and will therefore perform byte-swapping operations on big-endian architectures.
> +
> +Composing I/O ordering barriers with SMP ordering barriers and LOCK/UNLOCK
> +operations is a dangerous sport which may require the use of mmiowb(). See the
> +subsection "Acquires vs I/O accesses" for more information.
>  
>  ========================================
>  ASSUMED MINIMUM EXECUTION ORDERING MODEL
> -- 
> 2.11.0
>
Linus Torvalds Feb. 11, 2019, 10:34 p.m. UTC | #2
On Mon, Feb 11, 2019 at 9:30 AM Will Deacon <will.deacon@arm.com> wrote:
>
> +
> +     1. All readX() and writeX() accesses to the same peripheral are ordered
> +        with respect to each other. For example, this ensures that MMIO register
> +       writes by the CPU to a particular device will arrive in program order.

Hmm. I'd like more people look at strengthening this one wrt across
CPUs and locking.

Right now we document mmiowb(), but that "documentation" is really
just a fairy tale. Very *very* few drivers actually do mmiowb() on
their own.

IOW, we should seriously just consider making the rule be that locking
will order mmio too. Because that's practically the rule anyway.

Powerpc already does it. IO within a locked region will serialize with the lock.

                Linus
Benjamin Herrenschmidt Feb. 12, 2019, 4:01 a.m. UTC | #3
On Mon, 2019-02-11 at 14:34 -0800, Linus Torvalds wrote:
> On Mon, Feb 11, 2019 at 9:30 AM Will Deacon <will.deacon@arm.com> wrote:
> > +
> > +     1. All readX() and writeX() accesses to the same peripheral are ordered
> > +        with respect to each other. For example, this ensures that MMIO register
> > +       writes by the CPU to a particular device will arrive in program order.
> 
> Hmm. I'd like more people look at strengthening this one wrt across
> CPUs and locking.
> 
> Right now we document mmiowb(), but that "documentation" is really
> just a fairy tale. Very *very* few drivers actually do mmiowb() on
> their own.
> 
> IOW, we should seriously just consider making the rule be that locking
> will order mmio too. Because that's practically the rule anyway.
> 
> Powerpc already does it. IO within a locked region will serialize with the lock.

Yup. It's a bit ugly but I felt back then that getting drivers to use
mmiowb() properly was going to be a losing battle.

Cheers,
Ben.
Arnd Bergmann Feb. 12, 2019, 1:03 p.m. UTC | #4
On Mon, Feb 11, 2019 at 6:29 PM Will Deacon <will.deacon@arm.com> wrote:

> +     __iomem pointers obtained with non-default attributes (e.g. those returned
> +     by ioremap_wc()) are unlikely to provide many of these guarantees. If
> +     ordering is required for such mappings, then the mandatory barriers should
> +     be used in conjunction with the _relaxed() accessors defined below.

I wonder if we are even able to guarantee consistent behavior across
architectures
in the last case here (wc mapping with relaxed accessors and barriers).

Fortunately, there are only five implementations that actually differ from
ioremap_nocache(): arm32, arm64, ppc32, ppc64 and x86 (both 32/64), so
that is something we can probably figure out between the people on Cc.

The problem with recommending *_relaxed() + barrier() is that it ends
up being more expensive than the non-relaxed accessors whenever
_relaxed() implies the barrier already (true on most architectures other
than arm).

ioremap_wc() in turn is used almost exclusively to map RAM behind
a bus, (typically for frame buffers) and we may be better off not
assuming any particular MMIO barrier semantics for it at all, but possibly
audit the few uses that are not frame buffers.

> +     Since many CPU architectures ultimately access these peripherals via an
> +     internal virtual memory mapping, the portable ordering guarantees provided
> +     by inX() and outX() are the same as those provided by readX() and writeX()
> +     respectively when accessing a mapping with the default I/O attributes.

This is notably weaker than the PCI mandated non-posted write semantics.
As I said earlier, not all architectures or PCI host implementations can provide
non-posted writes though, but maybe we can document that fact here, e.g.

| Device drivers may expect outX() to be a non-posted write, i.e. waiting for
| a completion response from the I/O device, which may not be possible
| on a particular hardware.

>   (*) ioreadX(), iowriteX()
>
>       These will perform appropriately for the type of access they're actually
>       doing, be it inX()/outX() or readX()/writeX().

This probably needs clarification as well then: On architectures that
have a stronger barrier after outX() than writeX() but that use memory
mapped access for both, the statement is currently not true. We could
either strengthen the requirement by requiring CONFIG_GENERIC_IOMAP
on such architectures, or we could document the current behavior
as intentional and explicitly not allow iowriteX() on I/O ports to be posted.

> +All of these accessors assume that the underlying peripheral is little-endian,
> +and will therefore perform byte-swapping operations on big-endian architectures.

This sounds like a useful addition and the only sane way to do it IMHO, but
I think at least traditionally we've had architectures that do not work like
this but that make readX()/writeX() do native big-endian loads and stores, with
a hardware byteswap on the PCI bus.

      Arnd
Will Deacon Feb. 12, 2019, 6:43 p.m. UTC | #5
On Mon, Feb 11, 2019 at 12:22:18PM -0800, Paul E. McKenney wrote:
> On Mon, Feb 11, 2019 at 05:29:48PM +0000, Will Deacon wrote:
> > The "KERNEL I/O BARRIER EFFECTS" section of memory-barriers.txt is vague,
> > x86-centric, out-of-date, incomplete and demonstrably incorrect in places.
> > This is largely because I/O ordering is a horrible can of worms, but also
> > because the document has stagnated as our understanding has evolved.
> > 
> > Attempt to address some of that, by rewriting the section based on
> > recent(-ish) discussions with Arnd, BenH and others. Maybe one day we'll
> > find a way to formalise this stuff, but for now let's at least try to
> > make the English easier to understand.
> > 
> > Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
> > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > Cc: Arnd Bergmann <arnd@arndb.de>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
> > Cc: Daniel Lustig <dlustig@nvidia.com>
> > Cc: David Howells <dhowells@redhat.com>
> > Cc: Alan Stern <stern@rowland.harvard.edu>
> > cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> 
> Hello, Will,
> 
> The intent is to replace commit 3f305018dcf3 ("docs/memory-barriers.txt:
> Enforce heavy ordering for port I/O accesses"), correct?  Either way is
> fine, just guessing based on the conflicts when applying this one.  ;-)

Yup, I decided to abandon the old patch:

http://lkml.kernel.org/r/20190211153043.GC32385@fuggles.cambridge.arm.com

Thanks,

Will
Paul E. McKenney Feb. 12, 2019, 7:24 p.m. UTC | #6
On Tue, Feb 12, 2019 at 06:43:54PM +0000, Will Deacon wrote:
> On Mon, Feb 11, 2019 at 12:22:18PM -0800, Paul E. McKenney wrote:
> > On Mon, Feb 11, 2019 at 05:29:48PM +0000, Will Deacon wrote:
> > > The "KERNEL I/O BARRIER EFFECTS" section of memory-barriers.txt is vague,
> > > x86-centric, out-of-date, incomplete and demonstrably incorrect in places.
> > > This is largely because I/O ordering is a horrible can of worms, but also
> > > because the document has stagnated as our understanding has evolved.
> > > 
> > > Attempt to address some of that, by rewriting the section based on
> > > recent(-ish) discussions with Arnd, BenH and others. Maybe one day we'll
> > > find a way to formalise this stuff, but for now let's at least try to
> > > make the English easier to understand.
> > > 
> > > Cc: "Paul E. McKenney" <paulmck@linux.ibm.com>
> > > Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> > > Cc: Arnd Bergmann <arnd@arndb.de>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
> > > Cc: Daniel Lustig <dlustig@nvidia.com>
> > > Cc: David Howells <dhowells@redhat.com>
> > > Cc: Alan Stern <stern@rowland.harvard.edu>
> > > cc: Linus Torvalds <torvalds@linux-foundation.org>
> > > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > 
> > Hello, Will,
> > 
> > The intent is to replace commit 3f305018dcf3 ("docs/memory-barriers.txt:
> > Enforce heavy ordering for port I/O accesses"), correct?  Either way is
> > fine, just guessing based on the conflicts when applying this one.  ;-)
> 
> Yup, I decided to abandon the old patch:
> 
> http://lkml.kernel.org/r/20190211153043.GC32385@fuggles.cambridge.arm.com

Got it, and thank you for the reminder!

							Thanx, Paul
Will Deacon Feb. 13, 2019, 5:20 p.m. UTC | #7
[+Tony]

On Mon, Feb 11, 2019 at 02:34:31PM -0800, Linus Torvalds wrote:
> On Mon, Feb 11, 2019 at 9:30 AM Will Deacon <will.deacon@arm.com> wrote:
> >
> > +
> > +     1. All readX() and writeX() accesses to the same peripheral are ordered
> > +        with respect to each other. For example, this ensures that MMIO register
> > +       writes by the CPU to a particular device will arrive in program order.
> 
> Hmm. I'd like more people look at strengthening this one wrt across
> CPUs and locking.
> 
> Right now we document mmiowb(), but that "documentation" is really
> just a fairy tale. Very *very* few drivers actually do mmiowb() on
> their own.
> 
> IOW, we should seriously just consider making the rule be that locking
> will order mmio too. Because that's practically the rule anyway.

I would /love/ to get rid of mmiowb() because I think it's both extremely
difficult to use and also pretty much never needed. It reminds me a lot of
smp_read_barrier_depends(), which we finally pushed into READ_ONCE for
Alpha.

> Powerpc already does it. IO within a locked region will serialize with the
> lock.

I thought ia64 was the hold out here? Did they actually have machines that
needed this in practice? If so, I think we can either:

  (a) Add an mmiowb() to their spin_unlock() code, or
  (b) Remove ia64 altogether if nobody complains

I know that Peter has been in favour of (b) for a while...

Will
Linus Torvalds Feb. 13, 2019, 6:27 p.m. UTC | #8
On Wed, Feb 13, 2019 at 9:20 AM Will Deacon <will.deacon@arm.com> wrote:
>
> On Mon, Feb 11, 2019 at 02:34:31PM -0800, Linus Torvalds wrote:
> >
> > IOW, we should seriously just consider making the rule be that locking
> > will order mmio too. Because that's practically the rule anyway.
>
> I would /love/ to get rid of mmiowb() because I think it's both extremely
> difficult to use and also pretty much never needed. It reminds me a lot of
> smp_read_barrier_depends(), which we finally pushed into READ_ONCE for
> Alpha.

Right. Make as much of this implicit as we can.

At least as long as it's _reasonably_ cheap on all architectures that matter.

How expensive would it be on ARM? Does the normal acquire/release
already mean the IO in between is serialized?

> > Powerpc already does it. IO within a locked region will serialize with the
> > lock.
>
> I thought ia64 was the hold out here? Did they actually have machines that
> needed this in practice?

Note that even if mmiowb() is expensive (and I don't think that's
actually even the case on ia64), you can - and probably should - do
what PowerPC does.

Doing an IO barrier on PowerPC is insanely expensive, but they solve
that simply track the whole "have I done any IO" manually. It's not
even that expensive, it just uses a percpu flag.

(Admittedly, PowerPC makes it less obvious that it's a percpu variable
because it's actually in the special "paca" region that is like a
hyper-local percpu area).

> If so, I think we can either:
>
>   (a) Add an mmiowb() to their spin_unlock() code, or
>   (b) Remove ia64 altogether if nobody complains
>
> I know that Peter has been in favour of (b) for a while...

I don't think we're quite ready for (b), but see above: I don't think
adding mmiowb() to the ia64 spin unlock code is even all that
expensive.

Yeah, yeah, there's the SGI "SN" platform that apparently has a bug,
so because of that platform problem maybe it needs the more complex
"use a flag" model.  But even the complex model isn't _hugely_
complex.

But we *could* first just do the mmiowb() unconditionally in the ia64
unlocking code, and then see if anybody notices?

Tony, comments? Are there any SGI SN machines around any more?

                      Linus
Peter Zijlstra Feb. 13, 2019, 6:33 p.m. UTC | #9
On Wed, Feb 13, 2019 at 10:27:09AM -0800, Linus Torvalds wrote:
> Yeah, yeah, there's the SGI "SN" platform that apparently has a bug,
> so because of that platform problem maybe it needs the more complex
> "use a flag" model.  But even the complex model isn't _hugely_
> complex.
> 
> But we *could* first just do the mmiowb() unconditionally in the ia64
> unlocking code, and then see if anybody notices?
> 
> Tony, comments? Are there any SGI SN machines around any more?

I think the last time this came up, it was said that those people still
running Linux on Itanium were running old disto kernels, not upstream.

So yeah, we could probably do whatever and nobody would ever notice,
except maybe Al, who is rumoured to still have an ia64 :-)
Luck, Tony Feb. 13, 2019, 6:43 p.m. UTC | #10
> I think the last time this came up, it was said that those people still
> running Linux on Itanium were running old distro kernels, not upstream.
>
> So yeah, we could probably do whatever and nobody would ever notice,
> except maybe Al, who is rumoured to still have an ia64 :-)

I haven't heard of anyone taking upstream kernels and actually using them
for production work in a long time. It's mostly just a few folks keeping ia64
alive "just because" these days.  I doubt any of them have an SGI Altix
to test on (so realistically Altix was probably broken upstream many releases
ago).

-Tony
Paul E. McKenney Feb. 13, 2019, 7:31 p.m. UTC | #11
On Wed, Feb 13, 2019 at 06:43:41PM +0000, Luck, Tony wrote:
> > I think the last time this came up, it was said that those people still
> > running Linux on Itanium were running old distro kernels, not upstream.
> >
> > So yeah, we could probably do whatever and nobody would ever notice,
> > except maybe Al, who is rumoured to still have an ia64 :-)
> 
> I haven't heard of anyone taking upstream kernels and actually using them
> for production work in a long time. It's mostly just a few folks keeping ia64
> alive "just because" these days.  I doubt any of them have an SGI Altix
> to test on (so realistically Altix was probably broken upstream many releases
> ago).

That would require a well-healed computer collector, wouldn't it?  Just to
run the thing.  ;-)

								Thanx, Paul
Will Deacon Feb. 18, 2019, 4:29 p.m. UTC | #12
Hi Arnd,

On Tue, Feb 12, 2019 at 02:03:04PM +0100, Arnd Bergmann wrote:
> On Mon, Feb 11, 2019 at 6:29 PM Will Deacon <will.deacon@arm.com> wrote:
> 
> > +     __iomem pointers obtained with non-default attributes (e.g. those returned
> > +     by ioremap_wc()) are unlikely to provide many of these guarantees. If
> > +     ordering is required for such mappings, then the mandatory barriers should
> > +     be used in conjunction with the _relaxed() accessors defined below.
> 
> I wonder if we are even able to guarantee consistent behavior across
> architectures
> in the last case here (wc mapping with relaxed accessors and barriers).
> 
> Fortunately, there are only five implementations that actually differ from
> ioremap_nocache(): arm32, arm64, ppc32, ppc64 and x86 (both 32/64), so
> that is something we can probably figure out between the people on Cc.

I'd be keen to try to document this as a follow-up patch, otherwise there's
the risk of biting off more than I can chew, which is easily done with this
stuff! For arm32 (v7) and arm64, ioremap_wc() returns "normal, non-cacheable
memory". Some notable differences between this and the memory type returned
by ioremap() are:

 * ioremap_wc() allows speculative reads
 * ioremap_wc() allows unaligned access
 * ioremap_wc() allows reordering of accesses to different addresses
 * ioremap_wc() allows merging of accesses

so for us, you really only want to use it to map things that look an awful
lot like memory.

> The problem with recommending *_relaxed() + barrier() is that it ends
> up being more expensive than the non-relaxed accessors whenever
> _relaxed() implies the barrier already (true on most architectures other
> than arm).
> 
> ioremap_wc() in turn is used almost exclusively to map RAM behind
> a bus, (typically for frame buffers) and we may be better off not
> assuming any particular MMIO barrier semantics for it at all, but possibly
> audit the few uses that are not frame buffers.

Right, my expectation is actually that you very rarely need ordering
guarantees for wc mappings, and so saying "relaxed + mandatory barriers"
is the best thing to say for portable driver code. I'm deliberately /not/
trying to enumerate arch or device-specific behaviours.

> > +     Since many CPU architectures ultimately access these peripherals via an
> > +     internal virtual memory mapping, the portable ordering guarantees provided
> > +     by inX() and outX() are the same as those provided by readX() and writeX()
> > +     respectively when accessing a mapping with the default I/O attributes.
> 
> This is notably weaker than the PCI mandated non-posted write semantics.
> As I said earlier, not all architectures or PCI host implementations can provide
> non-posted writes though, but maybe we can document that fact here, e.g.
> 
> | Device drivers may expect outX() to be a non-posted write, i.e. waiting for
> | a completion response from the I/O device, which may not be possible
> | on a particular hardware.

I can add something along these lines, since this seems like it could be a
bit of a "gotcha" given the macro names and implicit x86 heritage.

> >   (*) ioreadX(), iowriteX()
> >
> >       These will perform appropriately for the type of access they're actually
> >       doing, be it inX()/outX() or readX()/writeX().
> 
> This probably needs clarification as well then: On architectures that
> have a stronger barrier after outX() than writeX() but that use memory
> mapped access for both, the statement is currently not true. We could
> either strengthen the requirement by requiring CONFIG_GENERIC_IOMAP
> on such architectures, or we could document the current behavior
> as intentional and explicitly not allow iowriteX() on I/O ports to be posted.

At least on arm and arm64, the heavy barrier in outX() is *before* the I/O
access, and so it does nothing to prevent the access from being posted. It
looks like the asm-generic/io.h behaviour is the same in the case that none
of the __io_* barriers are provided by the architecture.

Do you think this is something we actually need to strengthen, or are
drivers that rely on this going to be x86-specific anyway?

> > +All of these accessors assume that the underlying peripheral is little-endian,
> > +and will therefore perform byte-swapping operations on big-endian architectures.
> 
> This sounds like a useful addition and the only sane way to do it IMHO, but
> I think at least traditionally we've had architectures that do not work like
> this but that make readX()/writeX() do native big-endian loads and stores, with
> a hardware byteswap on the PCI bus.

Sure, hence my disclaimer at the beginning about non-portable drivers :)
My goal here is really to document the portable semantics for the common
architectures, so that driver developers and reviewers can get the usual
case right.

Will
Will Deacon Feb. 18, 2019, 4:50 p.m. UTC | #13
On Wed, Feb 13, 2019 at 10:27:09AM -0800, Linus Torvalds wrote:
> On Wed, Feb 13, 2019 at 9:20 AM Will Deacon <will.deacon@arm.com> wrote:
> > On Mon, Feb 11, 2019 at 02:34:31PM -0800, Linus Torvalds wrote:
> > > IOW, we should seriously just consider making the rule be that locking
> > > will order mmio too. Because that's practically the rule anyway.
> >
> > I would /love/ to get rid of mmiowb() because I think it's both extremely
> > difficult to use and also pretty much never needed. It reminds me a lot of
> > smp_read_barrier_depends(), which we finally pushed into READ_ONCE for
> > Alpha.
> 
> Right. Make as much of this implicit as we can.
> 
> At least as long as it's _reasonably_ cheap on all architectures that matter.
> 
> How expensive would it be on ARM? Does the normal acquire/release
> already mean the IO in between is serialized?

Yeah, that should work, but actually I'm wondered whether that means we can
relax our mandatory barriers as well now that we have a multi-copy atomic
memory model (i.e. if acquire/release works for locks between CPUs, then it
should work for DMA buffers between a CPU and a device). I'll chase this up
with the architects...

Either way, mmiowb() is an empty macro for us.

> > > Powerpc already does it. IO within a locked region will serialize with the
> > > lock.
> >
> > I thought ia64 was the hold out here? Did they actually have machines that
> > needed this in practice?
> 
> Note that even if mmiowb() is expensive (and I don't think that's
> actually even the case on ia64), you can - and probably should - do
> what PowerPC does.
> 
> Doing an IO barrier on PowerPC is insanely expensive, but they solve
> that simply track the whole "have I done any IO" manually. It's not
> even that expensive, it just uses a percpu flag.
> 
> (Admittedly, PowerPC makes it less obvious that it's a percpu variable
> because it's actually in the special "paca" region that is like a
> hyper-local percpu area).
> 
> > If so, I think we can either:
> >
> >   (a) Add an mmiowb() to their spin_unlock() code, or
> >   (b) Remove ia64 altogether if nobody complains
> >
> > I know that Peter has been in favour of (b) for a while...
> 
> I don't think we're quite ready for (b), but see above: I don't think
> adding mmiowb() to the ia64 spin unlock code is even all that
> expensive.

Well, I figured it was worth asking the question.

> Yeah, yeah, there's the SGI "SN" platform that apparently has a bug,
> so because of that platform problem maybe it needs the more complex
> "use a flag" model.  But even the complex model isn't _hugely_
> complex.
> 
> But we *could* first just do the mmiowb() unconditionally in the ia64
> unlocking code, and then see if anybody notices?

I'll hack this up as a starting point. We can always try to be clever later
on if it's deemed necessary.

Will
Arnd Bergmann Feb. 18, 2019, 4:59 p.m. UTC | #14
On Mon, Feb 18, 2019 at 5:30 PM Will Deacon <will.deacon@arm.com> wrote:
>
> On Tue, Feb 12, 2019 at 02:03:04PM +0100, Arnd Bergmann wrote:
> > On Mon, Feb 11, 2019 at 6:29 PM Will Deacon <will.deacon@arm.com> wrote:
> >
> > > +     __iomem pointers obtained with non-default attributes (e.g. those returned
> > > +     by ioremap_wc()) are unlikely to provide many of these guarantees. If
> > > +     ordering is required for such mappings, then the mandatory barriers should
> > > +     be used in conjunction with the _relaxed() accessors defined below.
> >
> > I wonder if we are even able to guarantee consistent behavior across
> > architectures
> > in the last case here (wc mapping with relaxed accessors and barriers).
> >
> > Fortunately, there are only five implementations that actually differ from
> > ioremap_nocache(): arm32, arm64, ppc32, ppc64 and x86 (both 32/64), so
> > that is something we can probably figure out between the people on Cc.
>
...
> > The problem with recommending *_relaxed() + barrier() is that it ends
> > up being more expensive than the non-relaxed accessors whenever
> > _relaxed() implies the barrier already (true on most architectures other
> > than arm).
> >
> > ioremap_wc() in turn is used almost exclusively to map RAM behind
> > a bus, (typically for frame buffers) and we may be better off not
> > assuming any particular MMIO barrier semantics for it at all, but possibly
> > audit the few uses that are not frame buffers.
>
> Right, my expectation is actually that you very rarely need ordering
> guarantees for wc mappings, and so saying "relaxed + mandatory barriers"
> is the best thing to say for portable driver code. I'm deliberately /not/
> trying to enumerate arch or device-specific behaviours.

That's fine, my worry is more that you are already saying too much
by describing a behavior for ioremap_wc+relaxed+barrier that is
neither a good idea or guaranteed to do what you describe.

> > > +     Since many CPU architectures ultimately access these peripherals via an
> > > +     internal virtual memory mapping, the portable ordering guarantees provided
> > > +     by inX() and outX() are the same as those provided by readX() and writeX()
> > > +     respectively when accessing a mapping with the default I/O attributes.
> >
> > This is notably weaker than the PCI mandated non-posted write semantics.
> > As I said earlier, not all architectures or PCI host implementations can provide
> > non-posted writes though, but maybe we can document that fact here, e.g.
> >
> > | Device drivers may expect outX() to be a non-posted write, i.e. waiting for
> > | a completion response from the I/O device, which may not be possible
> > | on a particular hardware.
>
> I can add something along these lines, since this seems like it could be a
> bit of a "gotcha" given the macro names and implicit x86 heritage.

Ok.

> > >   (*) ioreadX(), iowriteX()
> > >
> > >       These will perform appropriately for the type of access they're actually
> > >       doing, be it inX()/outX() or readX()/writeX().
> >
> > This probably needs clarification as well then: On architectures that
> > have a stronger barrier after outX() than writeX() but that use memory
> > mapped access for both, the statement is currently not true. We could
> > either strengthen the requirement by requiring CONFIG_GENERIC_IOMAP
> > on such architectures, or we could document the current behavior
> > as intentional and explicitly not allow iowriteX() on I/O ports to be posted.
>
> At least on arm and arm64, the heavy barrier in outX() is *before* the I/O
> access, and so it does nothing to prevent the access from being posted. It
> looks like the asm-generic/io.h behaviour is the same in the case that none
> of the __io_* barriers are provided by the architecture.
>
> Do you think this is something we actually need to strengthen, or are
> drivers that rely on this going to be x86-specific anyway?

I would say we should strengthen the behavior of outX() where possible.
I don't know if arm64 actually has a way of doing that, my understanding
earlier was that the AXI bus was already posted, so there is not much
you can do here to define __io_paw() in a way that will prevent posted
writes.

> > > +All of these accessors assume that the underlying peripheral is little-endian,
> > > +and will therefore perform byte-swapping operations on big-endian architectures.
> >
> > This sounds like a useful addition and the only sane way to do it IMHO, but
> > I think at least traditionally we've had architectures that do not work like
> > this but that make readX()/writeX() do native big-endian loads and stores, with
> > a hardware byteswap on the PCI bus.
>
> Sure, hence my disclaimer at the beginning about non-portable drivers :)
> My goal here is really to document the portable semantics for the common
> architectures, so that driver developers and reviewers can get the usual
> case right.

Fair enough.

      Arnd
Will Deacon Feb. 18, 2019, 5:56 p.m. UTC | #15
On Mon, Feb 18, 2019 at 05:59:13PM +0100, Arnd Bergmann wrote:
> On Mon, Feb 18, 2019 at 5:30 PM Will Deacon <will.deacon@arm.com> wrote:
> >
> > On Tue, Feb 12, 2019 at 02:03:04PM +0100, Arnd Bergmann wrote:
> > > On Mon, Feb 11, 2019 at 6:29 PM Will Deacon <will.deacon@arm.com> wrote:
> > >
> > > > +     __iomem pointers obtained with non-default attributes (e.g. those returned
> > > > +     by ioremap_wc()) are unlikely to provide many of these guarantees. If
> > > > +     ordering is required for such mappings, then the mandatory barriers should
> > > > +     be used in conjunction with the _relaxed() accessors defined below.
> > >
> > > I wonder if we are even able to guarantee consistent behavior across
> > > architectures
> > > in the last case here (wc mapping with relaxed accessors and barriers).
> > >
> > > Fortunately, there are only five implementations that actually differ from
> > > ioremap_nocache(): arm32, arm64, ppc32, ppc64 and x86 (both 32/64), so
> > > that is something we can probably figure out between the people on Cc.
> >
> ...
> > > The problem with recommending *_relaxed() + barrier() is that it ends
> > > up being more expensive than the non-relaxed accessors whenever
> > > _relaxed() implies the barrier already (true on most architectures other
> > > than arm).
> > >
> > > ioremap_wc() in turn is used almost exclusively to map RAM behind
> > > a bus, (typically for frame buffers) and we may be better off not
> > > assuming any particular MMIO barrier semantics for it at all, but possibly
> > > audit the few uses that are not frame buffers.
> >
> > Right, my expectation is actually that you very rarely need ordering
> > guarantees for wc mappings, and so saying "relaxed + mandatory barriers"
> > is the best thing to say for portable driver code. I'm deliberately /not/
> > trying to enumerate arch or device-specific behaviours.
> 
> That's fine, my worry is more that you are already saying too much
> by describing a behavior for ioremap_wc+relaxed+barrier that is
> neither a good idea or guaranteed to do what you describe.

I could drop the mention of relaxed accessors here for now, if you like?
For example:

  "__iomem pointers obtained with non-default attributes (e.g. those returned
   by ioremap_wc()) are unlikely to provide many of these guarantees. If
   ordering is required for such mappings, then the mandatory barriers should
   be used."

which we could flesh out if/when we have a notion of the portable semantics.

> > > >   (*) ioreadX(), iowriteX()
> > > >
> > > >       These will perform appropriately for the type of access they're actually
> > > >       doing, be it inX()/outX() or readX()/writeX().
> > >
> > > This probably needs clarification as well then: On architectures that
> > > have a stronger barrier after outX() than writeX() but that use memory
> > > mapped access for both, the statement is currently not true. We could
> > > either strengthen the requirement by requiring CONFIG_GENERIC_IOMAP
> > > on such architectures, or we could document the current behavior
> > > as intentional and explicitly not allow iowriteX() on I/O ports to be posted.
> >
> > At least on arm and arm64, the heavy barrier in outX() is *before* the I/O
> > access, and so it does nothing to prevent the access from being posted. It
> > looks like the asm-generic/io.h behaviour is the same in the case that none
> > of the __io_* barriers are provided by the architecture.
> >
> > Do you think this is something we actually need to strengthen, or are
> > drivers that rely on this going to be x86-specific anyway?
> 
> I would say we should strengthen the behavior of outX() where possible.
> I don't know if arm64 actually has a way of doing that, my understanding
> earlier was that the AXI bus was already posted, so there is not much
> you can do here to define __io_paw() in a way that will prevent posted
> writes.

If we could map I/O space using different page table attributes (probably by
hacking pci_remap_iospace() ?) then we could disable the
early-write-acknowledge hint and implement __io_paw() as a completion
barrier, although it would be at the mercy of the system as to whether or
not that requires a response from the RC.

I would still prefer to document the weaker semantics as the portable
interface, unless there are portable drivers relying on this today (which
would imply that it's widely supported by other architectures).

Will
Arnd Bergmann Feb. 18, 2019, 8:37 p.m. UTC | #16
On Mon, Feb 18, 2019 at 6:56 PM Will Deacon <will.deacon@arm.com> wrote:
>
> On Mon, Feb 18, 2019 at 05:59:13PM +0100, Arnd Bergmann wrote:
> > On Mon, Feb 18, 2019 at 5:30 PM Will Deacon <will.deacon@arm.com> wrote:
> > >
> > > >
> > > > ioremap_wc() in turn is used almost exclusively to map RAM behind
> > > > a bus, (typically for frame buffers) and we may be better off not
> > > > assuming any particular MMIO barrier semantics for it at all, but possibly
> > > > audit the few uses that are not frame buffers.
> > >
> > > Right, my expectation is actually that you very rarely need ordering
> > > guarantees for wc mappings, and so saying "relaxed + mandatory barriers"
> > > is the best thing to say for portable driver code. I'm deliberately /not/
> > > trying to enumerate arch or device-specific behaviours.
> >
> > That's fine, my worry is more that you are already saying too much
> > by describing a behavior for ioremap_wc+relaxed+barrier that is
> > neither a good idea or guaranteed to do what you describe.
>
> I could drop the mention of relaxed accessors here for now, if you like?
> For example:
>
>   "__iomem pointers obtained with non-default attributes (e.g. those returned
>    by ioremap_wc()) are unlikely to provide many of these guarantees. If
>    ordering is required for such mappings, then the mandatory barriers should
>    be used."
>
> which we could flesh out if/when we have a notion of the portable semantics.

I'd go further then and drop the second sentence entirely until we are sure
what portable behaviour would be.

> >
> > I would say we should strengthen the behavior of outX() where possible.
> > I don't know if arm64 actually has a way of doing that, my understanding
> > earlier was that the AXI bus was already posted, so there is not much
> > you can do here to define __io_paw() in a way that will prevent posted
> > writes.
>
> If we could map I/O space using different page table attributes (probably by
> hacking pci_remap_iospace() ?) then we could disable the
> early-write-acknowledge hint and implement __io_paw() as a completion
> barrier, although it would be at the mercy of the system as to whether or
> not that requires a response from the RC.

Ah, it seems we actually do that on 32-bit ARM, at least on one platform,
see 6a02734d420f ("ARM: mvebu: map PCI I/O regions strongly ordered")
and prior commits.

> I would still prefer to document the weaker semantics as the portable
> interface, unless there are portable drivers relying on this today (which
> would imply that it's widely supported by other architectures).

I don't know of any portable driver that actually relies on it, but
that's mainly because there are very few portable drivers that
use inb()/outb() in the first place. How many of those require
the non-posted behavior I don't know

Adding Thomas, Gregory and Russell to Cc, as they were involved
in the discussion that led to the 32-bit change, maybe they are
aware of a specific example.

       Arnd
Thomas Petazzoni Feb. 19, 2019, 10:27 a.m. UTC | #17
Hello,

On Mon, 18 Feb 2019 21:37:25 +0100
Arnd Bergmann <arnd@arndb.de> wrote:

> > > I would say we should strengthen the behavior of outX() where possible.
> > > I don't know if arm64 actually has a way of doing that, my understanding
> > > earlier was that the AXI bus was already posted, so there is not much
> > > you can do here to define __io_paw() in a way that will prevent posted
> > > writes.  
> >
> > If we could map I/O space using different page table attributes (probably by
> > hacking pci_remap_iospace() ?) then we could disable the
> > early-write-acknowledge hint and implement __io_paw() as a completion
> > barrier, although it would be at the mercy of the system as to whether or
> > not that requires a response from the RC.  
> 
> Ah, it seems we actually do that on 32-bit ARM, at least on one platform,
> see 6a02734d420f ("ARM: mvebu: map PCI I/O regions strongly ordered")
> and prior commits.

Yes, some Marvell Armada 32-bit platforms have an errata that require
the PCI MEM and PCI I/O regions to be mapped strongly ordered.

BTW, this requirement prevents us from using the pci_remap_iospace()
API from drivers/pci, because it assumes page attributes of
pgprot_device(PAGE_KERNEL). That's why we're still using the
ARM-specific pci_ioremap_io() function.

> > I would still prefer to document the weaker semantics as the portable
> > interface, unless there are portable drivers relying on this today (which
> > would imply that it's widely supported by other architectures).  
> 
> I don't know of any portable driver that actually relies on it, but
> that's mainly because there are very few portable drivers that
> use inb()/outb() in the first place. How many of those require
> the non-posted behavior I don't know
> 
> Adding Thomas, Gregory and Russell to Cc, as they were involved
> in the discussion that led to the 32-bit change, maybe they are
> aware of a specific example.

I'm just arriving in the middle of this thread, and I'm not sure to
understand what is the question. If the question is whether PCI I/O is
really used in practice, then I've never seen it be used with Marvell
platforms (but I'm also not aware of all PCIe devices people are
using). I personally have a hacked-up version of the e1000e driver
that intentionally does some PCI I/O accesses, that I use as a way to
validate that PCI I/O support is minimally working, but that's it.

Best regards,

Thomas
Arnd Bergmann Feb. 19, 2019, 11:31 a.m. UTC | #18
On Tue, Feb 19, 2019 at 11:27 AM Thomas Petazzoni
<thomas.petazzoni@bootlin.com> wrote:
> On Mon, 18 Feb 2019 21:37:25 +0100
> Arnd Bergmann <arnd@arndb.de> wrote:
>
> > Ah, it seems we actually do that on 32-bit ARM, at least on one platform,
> > see 6a02734d420f ("ARM: mvebu: map PCI I/O regions strongly ordered")
> > and prior commits.
>
> Yes, some Marvell Armada 32-bit platforms have an errata that require
> the PCI MEM and PCI I/O regions to be mapped strongly ordered.
>
> BTW, this requirement prevents us from using the pci_remap_iospace()
> API from drivers/pci, because it assumes page attributes of
> pgprot_device(PAGE_KERNEL). That's why we're still using the
> ARM-specific pci_ioremap_io() function.

Ok.

> > > I would still prefer to document the weaker semantics as the portable
> > > interface, unless there are portable drivers relying on this today (which
> > > would imply that it's widely supported by other architectures).
> >
> > I don't know of any portable driver that actually relies on it, but
> > that's mainly because there are very few portable drivers that
> > use inb()/outb() in the first place. How many of those require
> > the non-posted behavior I don't know
> >
> > Adding Thomas, Gregory and Russell to Cc, as they were involved
> > in the discussion that led to the 32-bit change, maybe they are
> > aware of a specific example.
>
> I'm just arriving in the middle of this thread, and I'm not sure to
> understand what is the question. If the question is whether PCI I/O is
> really used in practice, then I've never seen it be used with Marvell
> platforms (but I'm also not aware of all PCIe devices people are
> using). I personally have a hacked-up version of the e1000e driver
> that intentionally does some PCI I/O accesses, that I use as a way to
> validate that PCI I/O support is minimally working, but that's it.

The main question is whether we know of a portable (not specific
to a single architecture) driver for PCIe (or PCI, but probably not ISA)
that uses outb/outw/outl in a way that relies on the on-posted
behavior of those, compared to posted writeb/writew/writel that are
only required to complete before another I/O operation on the same
device but whose completion is not ordered with regard to the rest of
the system.

I think an example of this would be a driver using outb() to disable
an interrupt, and then relying on the the interrupt no longer happening
after the outb().

     Arnd
Will Deacon Feb. 19, 2019, 11:34 a.m. UTC | #19
Hi Thomas,

On Tue, Feb 19, 2019 at 11:27:47AM +0100, Thomas Petazzoni wrote:
> On Mon, 18 Feb 2019 21:37:25 +0100
> Arnd Bergmann <arnd@arndb.de> wrote:
> 
> > > > I would say we should strengthen the behavior of outX() where possible.
> > > > I don't know if arm64 actually has a way of doing that, my understanding
> > > > earlier was that the AXI bus was already posted, so there is not much
> > > > you can do here to define __io_paw() in a way that will prevent posted
> > > > writes.  
> > >
> > > If we could map I/O space using different page table attributes (probably by
> > > hacking pci_remap_iospace() ?) then we could disable the
> > > early-write-acknowledge hint and implement __io_paw() as a completion
> > > barrier, although it would be at the mercy of the system as to whether or
> > > not that requires a response from the RC.  
> > 
> > Ah, it seems we actually do that on 32-bit ARM, at least on one platform,
> > see 6a02734d420f ("ARM: mvebu: map PCI I/O regions strongly ordered")
> > and prior commits.
> 
> Yes, some Marvell Armada 32-bit platforms have an errata that require
> the PCI MEM and PCI I/O regions to be mapped strongly ordered.
> 
> BTW, this requirement prevents us from using the pci_remap_iospace()
> API from drivers/pci, because it assumes page attributes of
> pgprot_device(PAGE_KERNEL). That's why we're still using the
> ARM-specific pci_ioremap_io() function.

Ah, I think I vaguely remember that. It was to avoid a hardware deadlock,
right? In which case, I'd rather consider this use of strongly-ordered
memory an exceptional case as opposed to a general property of I/O mappings.

> > > I would still prefer to document the weaker semantics as the portable
> > > interface, unless there are portable drivers relying on this today (which
> > > would imply that it's widely supported by other architectures).  
> > 
> > I don't know of any portable driver that actually relies on it, but
> > that's mainly because there are very few portable drivers that
> > use inb()/outb() in the first place. How many of those require
> > the non-posted behavior I don't know
> > 
> > Adding Thomas, Gregory and Russell to Cc, as they were involved
> > in the discussion that led to the 32-bit change, maybe they are
> > aware of a specific example.
> 
> I'm just arriving in the middle of this thread, and I'm not sure to
> understand what is the question. If the question is whether PCI I/O is
> really used in practice, then I've never seen it be used with Marvell
> platforms (but I'm also not aware of all PCIe devices people are
> using). I personally have a hacked-up version of the e1000e driver
> that intentionally does some PCI I/O accesses, that I use as a way to
> validate that PCI I/O support is minimally working, but that's it.

It was actually even more subtle than that! The question is whether outX()
is relied upon to be non-posted in portable drivers, because at the moment
it's typicall posted for arm/arm64 systems, with the exception of the Armada
erratum above.

Cheers,

Will
Will Deacon Feb. 19, 2019, 11:36 a.m. UTC | #20
On Tue, Feb 19, 2019 at 12:31:50PM +0100, Arnd Bergmann wrote:
> On Tue, Feb 19, 2019 at 11:27 AM Thomas Petazzoni
> <thomas.petazzoni@bootlin.com> wrote:
> > On Mon, 18 Feb 2019 21:37:25 +0100
> > Arnd Bergmann <arnd@arndb.de> wrote:
> >
> > > Ah, it seems we actually do that on 32-bit ARM, at least on one platform,
> > > see 6a02734d420f ("ARM: mvebu: map PCI I/O regions strongly ordered")
> > > and prior commits.
> >
> > Yes, some Marvell Armada 32-bit platforms have an errata that require
> > the PCI MEM and PCI I/O regions to be mapped strongly ordered.
> >
> > BTW, this requirement prevents us from using the pci_remap_iospace()
> > API from drivers/pci, because it assumes page attributes of
> > pgprot_device(PAGE_KERNEL). That's why we're still using the
> > ARM-specific pci_ioremap_io() function.
> 
> Ok.
> 
> > > > I would still prefer to document the weaker semantics as the portable
> > > > interface, unless there are portable drivers relying on this today (which
> > > > would imply that it's widely supported by other architectures).
> > >
> > > I don't know of any portable driver that actually relies on it, but
> > > that's mainly because there are very few portable drivers that
> > > use inb()/outb() in the first place. How many of those require
> > > the non-posted behavior I don't know
> > >
> > > Adding Thomas, Gregory and Russell to Cc, as they were involved
> > > in the discussion that led to the 32-bit change, maybe they are
> > > aware of a specific example.
> >
> > I'm just arriving in the middle of this thread, and I'm not sure to
> > understand what is the question. If the question is whether PCI I/O is
> > really used in practice, then I've never seen it be used with Marvell
> > platforms (but I'm also not aware of all PCIe devices people are
> > using). I personally have a hacked-up version of the e1000e driver
> > that intentionally does some PCI I/O accesses, that I use as a way to
> > validate that PCI I/O support is minimally working, but that's it.
> 
> The main question is whether we know of a portable (not specific
> to a single architecture) driver for PCIe (or PCI, but probably not ISA)
> that uses outb/outw/outl in a way that relies on the on-posted
> behavior of those, compared to posted writeb/writew/writel that are
> only required to complete before another I/O operation on the same
> device but whose completion is not ordered with regard to the rest of
> the system.
> 
> I think an example of this would be a driver using outb() to disable
> an interrupt, and then relying on the the interrupt no longer happening
> after the outb().

Isn't that racy already? i.e. the interrupt could fire just before you
disabled it, but not get delivered by the irq controller until after you'd
disabled it at the device?

Will
Arnd Bergmann Feb. 19, 2019, 1:01 p.m. UTC | #21
On Tue, Feb 19, 2019 at 12:36 PM Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Feb 19, 2019 at 12:31:50PM +0100, Arnd Bergmann wrote:
> > On Tue, Feb 19, 2019 at 11:27 AM Thomas Petazzoni
> > <thomas.petazzoni@bootlin.com> wrote:
> >
> > I think an example of this would be a driver using outb() to disable
> > an interrupt, and then relying on the the interrupt no longer happening
> > after the outb().
>
> Isn't that racy already? i.e. the interrupt could fire just before you
> disabled it, but not get delivered by the irq controller until after you'd
> disabled it at the device?

Probably, I had a hard enough time trying to come up with any example ;-)

One reference to non-posted transaction in a comment is in
drivers/net/ethernet/dec/tulip/de4x5.c:

/*
** The DE4X5 interrupt handler.
**
** I/O Read/Writes through intermediate PCI bridges are never 'posted',
** so that the asserted interrupt always has some real data to work with -
** if these I/O accesses are ever changed to memory accesses, ensure the
** STS write is read immediately to complete the transaction if the adapter
** is not on bus 0. Lost interrupts can still occur when the PCI bus load
** is high and descriptor status bits cannot be set before the associated
** interrupt is asserted and this routine entered.
*/

I found another comment in the via-rhine driver:

/* Beware of PCI posted writes */
#define IOSYNC  do { ioread8(ioaddr + StationAddr); } while (0)

this one is used in the chip reset function, and in the transmit
path.

       Arnd


     Arnd
Will Deacon Feb. 19, 2019, 1:20 p.m. UTC | #22
On Tue, Feb 19, 2019 at 02:01:50PM +0100, Arnd Bergmann wrote:
> On Tue, Feb 19, 2019 at 12:36 PM Will Deacon <will.deacon@arm.com> wrote:
> > On Tue, Feb 19, 2019 at 12:31:50PM +0100, Arnd Bergmann wrote:
> > > On Tue, Feb 19, 2019 at 11:27 AM Thomas Petazzoni
> > > <thomas.petazzoni@bootlin.com> wrote:
> > >
> > > I think an example of this would be a driver using outb() to disable
> > > an interrupt, and then relying on the the interrupt no longer happening
> > > after the outb().
> >
> > Isn't that racy already? i.e. the interrupt could fire just before you
> > disabled it, but not get delivered by the irq controller until after you'd
> > disabled it at the device?
> 
> Probably, I had a hard enough time trying to come up with any example ;-)

You and me both!

> One reference to non-posted transaction in a comment is in
> drivers/net/ethernet/dec/tulip/de4x5.c:
> 
> /*
> ** The DE4X5 interrupt handler.
> **
> ** I/O Read/Writes through intermediate PCI bridges are never 'posted',
> ** so that the asserted interrupt always has some real data to work with -
> ** if these I/O accesses are ever changed to memory accesses, ensure the
> ** STS write is read immediately to complete the transaction if the adapter
> ** is not on bus 0. Lost interrupts can still occur when the PCI bus load
> ** is high and descriptor status bits cannot be set before the associated
> ** interrupt is asserted and this routine entered.
> */

Thankfully, that driver is both old and non-portable:

  depends on VIRT_TO_BUS || ALPHA || PPC || SPARC

so I'm not especially concerned about it. Judging by the comment, we'd
need to add something like:


diff --git a/drivers/net/ethernet/dec/tulip/de4x5.c b/drivers/net/ethernet/dec/tulip/de4x5.c
index 66535d1653f6..c85089f65b0e 100644
--- a/drivers/net/ethernet/dec/tulip/de4x5.c
+++ b/drivers/net/ethernet/dec/tulip/de4x5.c
@@ -1556,6 +1556,10 @@ de4x5_interrupt(int irq, void *dev_id)
 	sts = inl(DE4X5_STS);            /* Read IRQ status */
 	outl(sts, DE4X5_STS);            /* Reset the board interrupts */
 
+	/* Beware of PCI posted writes */
+	if (IS_ENABLED(CONFIG_ARM64) && lp->bus_num)
+		inl(DE4X5_STS);
+
 	if (!(sts & lp->irq_mask)) break;/* All done */
 	handled = 1;
 


if we wanted to get this working reliably on arm64. However, I'll be honest
and say we haven't had much demand for supporting DEC PCI devices yet :)

> I found another comment in the via-rhine driver:
> 
> /* Beware of PCI posted writes */
> #define IOSYNC  do { ioread8(ioaddr + StationAddr); } while (0)
> 
> this one is used in the chip reset function, and in the transmit
> path.

Since this is doing a read-back, I take this comment as saying "posted
writes are a thing, so perform a read-back to force the prior write to
complete", which is fine.

Will
Arnd Bergmann Feb. 19, 2019, 1:45 p.m. UTC | #23
On Tue, Feb 19, 2019 at 2:20 PM Will Deacon <will.deacon@arm.com> wrote:
> On Tue, Feb 19, 2019 at 02:01:50PM +0100, Arnd Bergmann wrote:
> > On Tue, Feb 19, 2019 at 12:36 PM Will Deacon <will.deacon@arm.com> wrote:
> > > On Tue, Feb 19, 2019 at 12:31:50PM +0100, Arnd Bergmann wrote:
> > > > On Tue, Feb 19, 2019 at 11:27 AM Thomas Petazzoni
> > One reference to non-posted transaction in a comment is in
> > drivers/net/ethernet/dec/tulip/de4x5.c:
> >
> > /*
> > ** The DE4X5 interrupt handler.
> > **
> > ** I/O Read/Writes through intermediate PCI bridges are never 'posted',
> > ** so that the asserted interrupt always has some real data to work with -
> > ** if these I/O accesses are ever changed to memory accesses, ensure the
> > ** STS write is read immediately to complete the transaction if the adapter
> > ** is not on bus 0. Lost interrupts can still occur when the PCI bus load
> > ** is high and descriptor status bits cannot be set before the associated
> > ** interrupt is asserted and this routine entered.
> > */
>
> Thankfully, that driver is both old and non-portable:
>
>   depends on VIRT_TO_BUS || ALPHA || PPC || SPARC
>
> so I'm not especially concerned about it. Judging by the comment, we'd
> need to add something like:
>
>
> diff --git a/drivers/net/ethernet/dec/tulip/de4x5.c b/drivers/net/ethernet/dec/tulip/de4x5.c
> index 66535d1653f6..c85089f65b0e 100644
> --- a/drivers/net/ethernet/dec/tulip/de4x5.c
> +++ b/drivers/net/ethernet/dec/tulip/de4x5.c
> @@ -1556,6 +1556,10 @@ de4x5_interrupt(int irq, void *dev_id)
>         sts = inl(DE4X5_STS);            /* Read IRQ status */
>         outl(sts, DE4X5_STS);            /* Reset the board interrupts */
>
> +       /* Beware of PCI posted writes */
> +       if (IS_ENABLED(CONFIG_ARM64) && lp->bus_num)
> +               inl(DE4X5_STS);
> +
>         if (!(sts & lp->irq_mask)) break;/* All done */
>         handled = 1;
>
>
>
> if we wanted to get this working reliably on arm64. However, I'll be honest
> and say we haven't had much demand for supporting DEC PCI devices yet :)

Agreed. I find a total of 590 drivers calling inb/outb and related
helpers, using "git grep -wl 'out\(l\|w\|b\)' drivers/ sound/" (there
are probably a couple more. There are hardly any among those
that one would expect to see in a modern machine:

- The majority is for ISA, PCMCIA or PCI devices, as opposed
  to PCIe, and presumably no longer sold.
- Many are de factor architecture specific, e.g. drivers for
  stuff in a PC chipset.

drivers/staging/comedi/ has some things in it that have
a slight chance of still being put into modern machines
when moving from an old embedded system to a new one,
but being staging, chances are that the drivers also require
other changes before they work on a new architecture.

Overall, I wonder if we'd be better off completely disabling
outb etc on arm64, and only supporting iomap()/ioread()/iowrite()
instead, without giving any guarantees there.
Getting there would definitely require some code changes
(and many Kconfig changes), but it still seems appealing
given how much legacy stuff we could completely ignore
afterwards.

> > I found another comment in the via-rhine driver:
> >
> > /* Beware of PCI posted writes */
> > #define IOSYNC  do { ioread8(ioaddr + StationAddr); } while (0)
> >
> > this one is used in the chip reset function, and in the transmit
> > path.
>
> Since this is doing a read-back, I take this comment as saying "posted
> writes are a thing, so perform a read-back to force the prior write to
> complete", which is fine.

Right.

       Arnd
Will Deacon Feb. 19, 2019, 4:13 p.m. UTC | #24
[+more ppc folks]

On Mon, Feb 18, 2019 at 04:50:12PM +0000, Will Deacon wrote:
> On Wed, Feb 13, 2019 at 10:27:09AM -0800, Linus Torvalds wrote:
> > Note that even if mmiowb() is expensive (and I don't think that's
> > actually even the case on ia64), you can - and probably should - do
> > what PowerPC does.
> > 
> > Doing an IO barrier on PowerPC is insanely expensive, but they solve
> > that simply track the whole "have I done any IO" manually. It's not
> > even that expensive, it just uses a percpu flag.
> > 
> > (Admittedly, PowerPC makes it less obvious that it's a percpu variable
> > because it's actually in the special "paca" region that is like a
> > hyper-local percpu area).

[...]

> > But we *could* first just do the mmiowb() unconditionally in the ia64
> > unlocking code, and then see if anybody notices?
> 
> I'll hack this up as a starting point. We can always try to be clever later
> on if it's deemed necessary.

Ok, so I started hacking this up in core code with the percpu flag (since
riscv apparently needs it), but I've now realised that I don't understand
how the PowerPC trick works after all. Consider the following:

	spin_lock(&foo);	// io_sync = 0
	outb(42, port);		// io_sync = 1
	spin_lock(&bar);	// io_sync = 0
	...
	spin_unlock(&bar);
	spin_unlock(&foo);

The inner lock could even happen in an irq afaict, but we'll end up skipping
the mmiowb()/sync because the io_sync flag is unconditionally cleared by
spin_lock(). Fixing this is complicated by the fact that I/O writes can be
performed in preemptible context with no locks held, so we can end up
spuriously setting the io_sync flag for arbitrary CPUs, hence the desire
to clear it in spin_lock().

If the paca entry was more than a byte, we could probably track that a
spinlock is held and then avoid clearing the flag prematurely, but I have
a feeling that I'm missing something. Anybody know how this is supposed to
work?

Will
Michael Ellerman Feb. 21, 2019, 6:22 a.m. UTC | #25
Will Deacon <will.deacon@arm.com> writes:
> [+more ppc folks]
>
> On Mon, Feb 18, 2019 at 04:50:12PM +0000, Will Deacon wrote:
>> On Wed, Feb 13, 2019 at 10:27:09AM -0800, Linus Torvalds wrote:
>> > Note that even if mmiowb() is expensive (and I don't think that's
>> > actually even the case on ia64), you can - and probably should - do
>> > what PowerPC does.
>> > 
>> > Doing an IO barrier on PowerPC is insanely expensive, but they solve
>> > that simply track the whole "have I done any IO" manually. It's not
>> > even that expensive, it just uses a percpu flag.
>> > 
>> > (Admittedly, PowerPC makes it less obvious that it's a percpu variable
>> > because it's actually in the special "paca" region that is like a
>> > hyper-local percpu area).
>
> [...]
>
>> > But we *could* first just do the mmiowb() unconditionally in the ia64
>> > unlocking code, and then see if anybody notices?
>> 
>> I'll hack this up as a starting point. We can always try to be clever later
>> on if it's deemed necessary.
>
> Ok, so I started hacking this up in core code with the percpu flag (since
> riscv apparently needs it), but I've now realised that I don't understand
> how the PowerPC trick works after all. Consider the following:
>
> 	spin_lock(&foo);	// io_sync = 0
> 	outb(42, port);		// io_sync = 1
> 	spin_lock(&bar);	// io_sync = 0
> 	...
> 	spin_unlock(&bar);
> 	spin_unlock(&foo);
>
> The inner lock could even happen in an irq afaict, but we'll end up skipping
> the mmiowb()/sync because the io_sync flag is unconditionally cleared by
> spin_lock(). Fixing this is complicated by the fact that I/O writes can be
> performed in preemptible context with no locks held, so we can end up
> spuriously setting the io_sync flag for arbitrary CPUs, hence the desire
> to clear it in spin_lock().
>
> If the paca entry was more than a byte, we could probably track that a
> spinlock is held and then avoid clearing the flag prematurely, but I have
> a feeling that I'm missing something. Anybody know how this is supposed to
> work?

I don't think you're missing anything :/

Having two flags like you suggest could work. Or you could just make the
flag into a nesting counter.

Or do you just remove the clearing from spin_lock()? 

That gets you:

 	spin_lock(&foo);
 	outb(42, port);		// io_sync = 1
 	spin_lock(&bar);
 	...
 	spin_unlock(&bar);	// mb(); io_sync = 0
 	spin_unlock(&foo);


And I/O outside of the lock case:

 	outb(42, port);		// io_sync = 1

 	spin_lock(&bar);
 	...
 	spin_unlock(&bar);	// mb(); io_sync = 0


Extra barriers are not ideal, but the odd spurious mb() might be
preferable to doing another compare and branch or increment in every
spin_lock()?

cheers
Will Deacon Feb. 22, 2019, 5:38 p.m. UTC | #26
On Thu, Feb 21, 2019 at 05:22:03PM +1100, Michael Ellerman wrote:
> Will Deacon <will.deacon@arm.com> writes:
> > [+more ppc folks]
> >
> > On Mon, Feb 18, 2019 at 04:50:12PM +0000, Will Deacon wrote:
> >> On Wed, Feb 13, 2019 at 10:27:09AM -0800, Linus Torvalds wrote:
> >> > Note that even if mmiowb() is expensive (and I don't think that's
> >> > actually even the case on ia64), you can - and probably should - do
> >> > what PowerPC does.
> >> > 
> >> > Doing an IO barrier on PowerPC is insanely expensive, but they solve
> >> > that simply track the whole "have I done any IO" manually. It's not
> >> > even that expensive, it just uses a percpu flag.
> >> > 
> >> > (Admittedly, PowerPC makes it less obvious that it's a percpu variable
> >> > because it's actually in the special "paca" region that is like a
> >> > hyper-local percpu area).
> >
> > [...]
> >
> >> > But we *could* first just do the mmiowb() unconditionally in the ia64
> >> > unlocking code, and then see if anybody notices?
> >> 
> >> I'll hack this up as a starting point. We can always try to be clever later
> >> on if it's deemed necessary.
> >
> > Ok, so I started hacking this up in core code with the percpu flag (since
> > riscv apparently needs it), but I've now realised that I don't understand
> > how the PowerPC trick works after all. Consider the following:
> >
> > 	spin_lock(&foo);	// io_sync = 0
> > 	outb(42, port);		// io_sync = 1
> > 	spin_lock(&bar);	// io_sync = 0
> > 	...
> > 	spin_unlock(&bar);
> > 	spin_unlock(&foo);
> >
> > The inner lock could even happen in an irq afaict, but we'll end up skipping
> > the mmiowb()/sync because the io_sync flag is unconditionally cleared by
> > spin_lock(). Fixing this is complicated by the fact that I/O writes can be
> > performed in preemptible context with no locks held, so we can end up
> > spuriously setting the io_sync flag for arbitrary CPUs, hence the desire
> > to clear it in spin_lock().
> >
> > If the paca entry was more than a byte, we could probably track that a
> > spinlock is held and then avoid clearing the flag prematurely, but I have
> > a feeling that I'm missing something. Anybody know how this is supposed to
> > work?
> 
> I don't think you're missing anything :/

Ok, well that's slightly reasurring for me :)

> Having two flags like you suggest could work. Or you could just make the
> flag into a nesting counter.

My work-in-progress asm-generic version uses a counter, but I can't squeese
that into your u8 paca entry. I'll cc you when I post the patches, so
perhaps you can hack up the ppc side.

> Or do you just remove the clearing from spin_lock()? 
> 
> That gets you:
> 
>  	spin_lock(&foo);
>  	outb(42, port);		// io_sync = 1
>  	spin_lock(&bar);
>  	...
>  	spin_unlock(&bar);	// mb(); io_sync = 0
>  	spin_unlock(&foo);
> 
> 
> And I/O outside of the lock case:
> 
>  	outb(42, port);		// io_sync = 1
> 
>  	spin_lock(&bar);
>  	...
>  	spin_unlock(&bar);	// mb(); io_sync = 0
> 
> 
> Extra barriers are not ideal, but the odd spurious mb() might be
> preferable to doing another compare and branch or increment in every
> spin_lock()?

Up to you. I'm working on the assumption that these barriers are insanely
expensive, otherwise we'd just upgrade spin_unlock() and work on things
that are more fun instead ;)

Will

Patch
diff mbox series

diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index 1c22b21ae922..d08b49b2c011 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -2599,72 +2599,81 @@  likely, then interrupt-disabling locks should be used to guarantee ordering.
 KERNEL I/O BARRIER EFFECTS
 ==========================
 
-When accessing I/O memory, drivers should use the appropriate accessor
-functions:
-
- (*) inX(), outX():
-
-     These are intended to talk to I/O space rather than memory space, but
-     that's primarily a CPU-specific concept.  The i386 and x86_64 processors
-     do indeed have special I/O space access cycles and instructions, but many
-     CPUs don't have such a concept.
-
-     The PCI bus, amongst others, defines an I/O space concept which - on such
-     CPUs as i386 and x86_64 - readily maps to the CPU's concept of I/O
-     space.  However, it may also be mapped as a virtual I/O space in the CPU's
-     memory map, particularly on those CPUs that don't support alternate I/O
-     spaces.
-
-     Accesses to this space may be fully synchronous (as on i386), but
-     intermediary bridges (such as the PCI host bridge) may not fully honour
-     that.
-
-     They are guaranteed to be fully ordered with respect to each other.
-
-     They are not guaranteed to be fully ordered with respect to other types of
-     memory and I/O operation.
+Interfacing with peripherals via I/O accesses is deeply architecture and device
+specific. Therefore, drivers which are inherently non-portable may rely on
+specific behaviours of their target systems in order to achieve synchronization
+in the most lightweight manner possible. For drivers intending to be portable
+between multiple architectures and bus implementations, the kernel offers a
+series of accessor functions that provide various degrees of ordering
+guarantees:
 
  (*) readX(), writeX():
 
-     Whether these are guaranteed to be fully ordered and uncombined with
-     respect to each other on the issuing CPU depends on the characteristics
-     defined for the memory window through which they're accessing.  On later
-     i386 architecture machines, for example, this is controlled by way of the
-     MTRR registers.
+     The readX() and writeX() MMIO accessors take a pointer to the peripheral
+     being accessed as an __iomem * parameter. For pointers mapped with the
+     default I/O attributes (e.g. those returned by ioremap()), then the
+     ordering guarantees are as follows:
+
+     1. All readX() and writeX() accesses to the same peripheral are ordered
+        with respect to each other. For example, this ensures that MMIO register
+	writes by the CPU to a particular device will arrive in program order.
+
+     2. A writeX() by the CPU to the peripheral will first wait for the
+        completion of all prior CPU writes to memory. For example, this ensures
+        that writes by the CPU to an outbound DMA buffer allocated by
+        dma_alloc_coherent() will be visible to a DMA engine when the CPU writes
+        to its MMIO control register to trigger the transfer.
+
+     3. A readX() by the CPU from the peripheral will complete before any
+	subsequent CPU reads from memory can begin. For example, this ensures
+	that reads by the CPU from an incoming DMA buffer allocated by
+	dma_alloc_coherent() will not see stale data after reading from the DMA
+	engine's MMIO status register to establish that the DMA transfer has
+	completed.
+
+     4. A readX() by the CPU from the peripheral will complete before any
+	subsequent delay() loop can begin execution. For example, this ensures
+	that two MMIO register writes by the CPU to a peripheral will arrive at
+	least 1us apart if the first write is immediately read back with readX()
+	and udelay(1) is called prior to the second writeX().
+
+     __iomem pointers obtained with non-default attributes (e.g. those returned
+     by ioremap_wc()) are unlikely to provide many of these guarantees. If
+     ordering is required for such mappings, then the mandatory barriers should
+     be used in conjunction with the _relaxed() accessors defined below.
+
+ (*) readX_relaxed(), writeX_relaxed():
 
-     Ordinarily, these will be guaranteed to be fully ordered and uncombined,
-     provided they're not accessing a prefetchable device.
-
-     However, intermediary hardware (such as a PCI bridge) may indulge in
-     deferral if it so wishes; to flush a store, a load from the same location
-     is preferred[*], but a load from the same device or from configuration
-     space should suffice for PCI.
-
-     [*] NOTE! attempting to load from the same location as was written to may
-	 cause a malfunction - consider the 16550 Rx/Tx serial registers for
-	 example.
-
-     Used with prefetchable I/O memory, an mmiowb() barrier may be required to
-     force stores to be ordered.
+     These are similar to readX() and writeX(), but provide weaker memory
+     ordering guarantees. Specifically, they do not guarantee ordering with
+     respect to normal memory accesses or delay() loops (i.e bullets 2-4 above)
+     but they are still guaranteed to be ordered with respect to other accesses
+     to the same peripheral when operating on __iomem pointers mapped with the
+     default I/O attributes.
 
-     Please refer to the PCI specification for more information on interactions
-     between PCI transactions.
+ (*) inX(), outX():
 
- (*) readX_relaxed(), writeX_relaxed()
+     The inX() and outX() accessors are intended to access legacy port-mapped
+     I/O peripherals, which may require special instructions on some
+     architectures (notably x86). The port number of the peripheral being
+     accessed is passed as an argument.
 
-     These are similar to readX() and writeX(), but provide weaker memory
-     ordering guarantees.  Specifically, they do not guarantee ordering with
-     respect to normal memory accesses (e.g. DMA buffers) nor do they guarantee
-     ordering with respect to LOCK or UNLOCK operations.  If the latter is
-     required, an mmiowb() barrier can be used.  Note that relaxed accesses to
-     the same peripheral are guaranteed to be ordered with respect to each
-     other.
+     Since many CPU architectures ultimately access these peripherals via an
+     internal virtual memory mapping, the portable ordering guarantees provided
+     by inX() and outX() are the same as those provided by readX() and writeX()
+     respectively when accessing a mapping with the default I/O attributes.
 
  (*) ioreadX(), iowriteX()
 
      These will perform appropriately for the type of access they're actually
      doing, be it inX()/outX() or readX()/writeX().
 
+All of these accessors assume that the underlying peripheral is little-endian,
+and will therefore perform byte-swapping operations on big-endian architectures.
+
+Composing I/O ordering barriers with SMP ordering barriers and LOCK/UNLOCK
+operations is a dangerous sport which may require the use of mmiowb(). See the
+subsection "Acquires vs I/O accesses" for more information.
 
 ========================================
 ASSUMED MINIMUM EXECUTION ORDERING MODEL