All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] RFC: addition to DMA API
@ 2011-08-31 21:30 ` Mark Salter
  0 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-08-31 21:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-arm-kernel, ming.lei, stern, Mark Salter

This patch set arose out of a discussion on linux-arm concerning a
performance problem with USB on some ARMv7 based platforms. The
problem was tracked down by ming.lei@canonical.com and found to be
the result of CPU writes to DMA-coherent memory being delayed in a
write buffer between the CPU and memory. One proposed patch fixed
only the immediate problem with the USB EHCI driver, but several
folks thought a more general approach was needed, so I put this series
of patches together as a starting point for wider discussion outside
the ARM specific list.

The original problem seen was that USB storage performance was unusually
poor on some ARMv7 based platforms. With my particular setup, I was
seeing hdparm -t report ~5.6MB/s on an SMP Cortex-A9 based platform
where the same disk driver would get ~21MB/s on a Cortex-A8 based system.
My understanding from subsequent discussion is that the A9 cores have
a write buffer between the CPU and memory which could buffer data for a
prolonged period even in the case of DMA coherent mappings. The ARM
architecture code largely mitigates this by doing a write buffer flush
as part of the MMIO functions used to write to device registers. This
avoids problems in almost all drivers because most need to write to a
device register to tell the device when something is written in the
shared DMA coherent memory. In the case of USB, an EHCI host controller
will poll certain DMA coherent memory locations for information coming
from the CPU. In that case, the write buffering negatively affects
performance.

This series of patches adds a new function to the DMA API to deal with
ARMv7 and any future architectures which have write buffering even for
DMA coherent memory. The proposed dma_coherent_write_sync() function
will allow those few drivers which need it to force out write buffer
data in a timely way to avoid performace issues.

Mark Salter (3):
  add dma_coherent_write_sync to DMA API
  define ARM-specific dma_coherent_write_sync
  add dma_coherent_write_sync calls to USB EHCI driver

 Documentation/DMA-API-HOWTO.txt    |   15 +++++++++++++++
 Documentation/DMA-API.txt          |   12 ++++++++++++
 arch/arm/include/asm/dma-mapping.h |   10 ++++++++++
 drivers/usb/host/ehci-q.c          |    7 ++++++-
 include/linux/dma-mapping.h        |    6 ++++++
 5 files changed, 49 insertions(+), 1 deletions(-)

-- 
1.7.6


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-08-31 21:30 ` Mark Salter
  0 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-08-31 21:30 UTC (permalink / raw)
  To: linux-arm-kernel

This patch set arose out of a discussion on linux-arm concerning a
performance problem with USB on some ARMv7 based platforms. The
problem was tracked down by ming.lei at canonical.com and found to be
the result of CPU writes to DMA-coherent memory being delayed in a
write buffer between the CPU and memory. One proposed patch fixed
only the immediate problem with the USB EHCI driver, but several
folks thought a more general approach was needed, so I put this series
of patches together as a starting point for wider discussion outside
the ARM specific list.

The original problem seen was that USB storage performance was unusually
poor on some ARMv7 based platforms. With my particular setup, I was
seeing hdparm -t report ~5.6MB/s on an SMP Cortex-A9 based platform
where the same disk driver would get ~21MB/s on a Cortex-A8 based system.
My understanding from subsequent discussion is that the A9 cores have
a write buffer between the CPU and memory which could buffer data for a
prolonged period even in the case of DMA coherent mappings. The ARM
architecture code largely mitigates this by doing a write buffer flush
as part of the MMIO functions used to write to device registers. This
avoids problems in almost all drivers because most need to write to a
device register to tell the device when something is written in the
shared DMA coherent memory. In the case of USB, an EHCI host controller
will poll certain DMA coherent memory locations for information coming
from the CPU. In that case, the write buffering negatively affects
performance.

This series of patches adds a new function to the DMA API to deal with
ARMv7 and any future architectures which have write buffering even for
DMA coherent memory. The proposed dma_coherent_write_sync() function
will allow those few drivers which need it to force out write buffer
data in a timely way to avoid performace issues.

Mark Salter (3):
  add dma_coherent_write_sync to DMA API
  define ARM-specific dma_coherent_write_sync
  add dma_coherent_write_sync calls to USB EHCI driver

 Documentation/DMA-API-HOWTO.txt    |   15 +++++++++++++++
 Documentation/DMA-API.txt          |   12 ++++++++++++
 arch/arm/include/asm/dma-mapping.h |   10 ++++++++++
 drivers/usb/host/ehci-q.c          |    7 ++++++-
 include/linux/dma-mapping.h        |    6 ++++++
 5 files changed, 49 insertions(+), 1 deletions(-)

-- 
1.7.6

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/3] add dma_coherent_write_sync to DMA API
  2011-08-31 21:30 ` Mark Salter
@ 2011-08-31 21:30   ` Mark Salter
  -1 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-08-31 21:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-arm-kernel, ming.lei, stern, Mark Salter

On ARMv6/7 DMA-coherent memory is bufferable which means that CPU writes to
coherent memory may still be held in a write buffer for a significant amount
of time. This is largely mitigated by having the MMIO write functions force
a write buffer flush before doing the actual write to the MMIO register. This
forces out previous CPU writes to coherent memory for drivers which write to
a register to inform the device that something was written to memory. However,
this does not mitigate the problem for devices which poll the DMA memory for
changes written by the CPU. One such case was found by ming.lei@canonical.com
in the USB EHCI driver. The EHCI host controller relies at least partly on
polling DMA coherent memory for information from the driver.

This patch adds a dma_coherent_write_sync() function to the DMA API which
drivers can use to explicitly force out data which may otherwise be held up
in a write buffer. It is a no-op unless and architecture provides its own
version or the function and sets ARCH_HAS_DMA_COHERENT_WRITE_SYNC.

Signed-off-by: Mark Salter <msalter@redhat.com>
---
 Documentation/DMA-API-HOWTO.txt |   15 +++++++++++++++
 Documentation/DMA-API.txt       |   12 ++++++++++++
 include/linux/dma-mapping.h     |    6 ++++++
 3 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/Documentation/DMA-API-HOWTO.txt b/Documentation/DMA-API-HOWTO.txt
index a0b6250..8c22b8b 100644
--- a/Documentation/DMA-API-HOWTO.txt
+++ b/Documentation/DMA-API-HOWTO.txt
@@ -400,6 +400,21 @@ Make sure you've called dma_pool_free for all memory allocated
 from a pool before you destroy the pool. This function may not
 be called in interrupt context.
 
+Some architectures which supporting DMA coherent memory may still have write
+buffering between the CPU and DMA memory. This buffering may delay CPU writes
+from reaching coherent memory in a timely manner. These delays in turn can
+lead lead to dramatic performance issues in certain cases. An architecture
+may mitigate this problem to a large degree by having a write buffer flush
+implicit in the MMIO functions used to write to device registers. This works
+for the most common cases where a driver needs to write to a register to tell
+a device that something was written to the shared coherent memory. There are
+other cases where the device polls the dma-coherent memory for data written
+by the driver. In such cases, the driver needs to explicity force write buffer
+data to memory by calling:
+
+	dma_coherent_write_sync();
+
+
 			DMA Direction
 
 The interfaces described in subsequent portions of this document
diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
index fe23269..44d6923 100644
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -418,6 +418,18 @@ void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr,
 	....
 
 
+Part Ie - Write buffering to dma-coherent memory
+------------------------------------------------
+
+Some architectures supporting DMA coherent memory may have write
+buffering between the CPU and DMA memory. This buffering may delay
+CPU writes from reaching coherent memory in a timely manner.
+
+    void
+    dma_coherent_write_sync()
+
+Force any outstanding coherent writes to memory.
+
 Part II - Advanced dma_ usage
 -----------------------------
 
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 347fdc3..b29d65c 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -154,6 +154,12 @@ dma_mark_declared_memory_occupied(struct device *dev,
 }
 #endif
 
+#ifndef ARCH_HAS_DMA_COHERENT_WRITE_SYNC
+static inline void dma_coherent_write_sync(void)
+{
+}
+#endif
+
 /*
  * Managed DMA API
  */
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 1/3] add dma_coherent_write_sync to DMA API
@ 2011-08-31 21:30   ` Mark Salter
  0 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-08-31 21:30 UTC (permalink / raw)
  To: linux-arm-kernel

On ARMv6/7 DMA-coherent memory is bufferable which means that CPU writes to
coherent memory may still be held in a write buffer for a significant amount
of time. This is largely mitigated by having the MMIO write functions force
a write buffer flush before doing the actual write to the MMIO register. This
forces out previous CPU writes to coherent memory for drivers which write to
a register to inform the device that something was written to memory. However,
this does not mitigate the problem for devices which poll the DMA memory for
changes written by the CPU. One such case was found by ming.lei at canonical.com
in the USB EHCI driver. The EHCI host controller relies at least partly on
polling DMA coherent memory for information from the driver.

This patch adds a dma_coherent_write_sync() function to the DMA API which
drivers can use to explicitly force out data which may otherwise be held up
in a write buffer. It is a no-op unless and architecture provides its own
version or the function and sets ARCH_HAS_DMA_COHERENT_WRITE_SYNC.

Signed-off-by: Mark Salter <msalter@redhat.com>
---
 Documentation/DMA-API-HOWTO.txt |   15 +++++++++++++++
 Documentation/DMA-API.txt       |   12 ++++++++++++
 include/linux/dma-mapping.h     |    6 ++++++
 3 files changed, 33 insertions(+), 0 deletions(-)

diff --git a/Documentation/DMA-API-HOWTO.txt b/Documentation/DMA-API-HOWTO.txt
index a0b6250..8c22b8b 100644
--- a/Documentation/DMA-API-HOWTO.txt
+++ b/Documentation/DMA-API-HOWTO.txt
@@ -400,6 +400,21 @@ Make sure you've called dma_pool_free for all memory allocated
 from a pool before you destroy the pool. This function may not
 be called in interrupt context.
 
+Some architectures which supporting DMA coherent memory may still have write
+buffering between the CPU and DMA memory. This buffering may delay CPU writes
+from reaching coherent memory in a timely manner. These delays in turn can
+lead lead to dramatic performance issues in certain cases. An architecture
+may mitigate this problem to a large degree by having a write buffer flush
+implicit in the MMIO functions used to write to device registers. This works
+for the most common cases where a driver needs to write to a register to tell
+a device that something was written to the shared coherent memory. There are
+other cases where the device polls the dma-coherent memory for data written
+by the driver. In such cases, the driver needs to explicity force write buffer
+data to memory by calling:
+
+	dma_coherent_write_sync();
+
+
 			DMA Direction
 
 The interfaces described in subsequent portions of this document
diff --git a/Documentation/DMA-API.txt b/Documentation/DMA-API.txt
index fe23269..44d6923 100644
--- a/Documentation/DMA-API.txt
+++ b/Documentation/DMA-API.txt
@@ -418,6 +418,18 @@ void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr,
 	....
 
 
+Part Ie - Write buffering to dma-coherent memory
+------------------------------------------------
+
+Some architectures supporting DMA coherent memory may have write
+buffering between the CPU and DMA memory. This buffering may delay
+CPU writes from reaching coherent memory in a timely manner.
+
+    void
+    dma_coherent_write_sync()
+
+Force any outstanding coherent writes to memory.
+
 Part II - Advanced dma_ usage
 -----------------------------
 
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 347fdc3..b29d65c 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -154,6 +154,12 @@ dma_mark_declared_memory_occupied(struct device *dev,
 }
 #endif
 
+#ifndef ARCH_HAS_DMA_COHERENT_WRITE_SYNC
+static inline void dma_coherent_write_sync(void)
+{
+}
+#endif
+
 /*
  * Managed DMA API
  */
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 2/3] define ARM-specific dma_coherent_write_sync
  2011-08-31 21:30 ` Mark Salter
@ 2011-08-31 21:30   ` Mark Salter
  -1 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-08-31 21:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-arm-kernel, ming.lei, stern, Mark Salter

For ARM kernels using CONFIG_ARM_DMA_MEM_BUFFERABLE, this patch adds an ARM
specific dma_coherent_write_sync() to override the default version. This
routine forces out any data sitting in a write buffer between the CPU and
memory.

Signed-off-by: Mark Salter <msalter@redhat.com>
---
 arch/arm/include/asm/dma-mapping.h |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h
index 7a21d0b..e99562b 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -206,6 +206,16 @@ int dma_mmap_writecombine(struct device *, struct vm_area_struct *,
 		void *, dma_addr_t, size_t);
 
 
+#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
+#define ARCH_HAS_DMA_COHERENT_WRITE_SYNC
+
+static inline void dma_coherent_write_sync(void)
+{
+	dsb();
+	outer_sync();
+}
+#endif
+
 #ifdef CONFIG_DMABOUNCE
 /*
  * For SA-1111, IXP425, and ADI systems  the dma-mapping functions are "magic"
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 2/3] define ARM-specific dma_coherent_write_sync
@ 2011-08-31 21:30   ` Mark Salter
  0 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-08-31 21:30 UTC (permalink / raw)
  To: linux-arm-kernel

For ARM kernels using CONFIG_ARM_DMA_MEM_BUFFERABLE, this patch adds an ARM
specific dma_coherent_write_sync() to override the default version. This
routine forces out any data sitting in a write buffer between the CPU and
memory.

Signed-off-by: Mark Salter <msalter@redhat.com>
---
 arch/arm/include/asm/dma-mapping.h |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h
index 7a21d0b..e99562b 100644
--- a/arch/arm/include/asm/dma-mapping.h
+++ b/arch/arm/include/asm/dma-mapping.h
@@ -206,6 +206,16 @@ int dma_mmap_writecombine(struct device *, struct vm_area_struct *,
 		void *, dma_addr_t, size_t);
 
 
+#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
+#define ARCH_HAS_DMA_COHERENT_WRITE_SYNC
+
+static inline void dma_coherent_write_sync(void)
+{
+	dsb();
+	outer_sync();
+}
+#endif
+
 #ifdef CONFIG_DMABOUNCE
 /*
  * For SA-1111, IXP425, and ADI systems  the dma-mapping functions are "magic"
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 3/3] add dma_coherent_write_sync calls to USB EHCI driver
  2011-08-31 21:30 ` Mark Salter
@ 2011-08-31 21:30   ` Mark Salter
  -1 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-08-31 21:30 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-arm-kernel, ming.lei, stern, Mark Salter

The EHCI driver polls DMA coherent memory for control data written by the
driver. On some architectures, such as ARMv7, the writes from the driver
may get delayed in a write buffer even though it is written to DMA coherent
memory. This delay led to serious performance issues on an ARMv7 based
platform using a USB disk drive. Before using this patch, 'hdparm -t' showed
a read speed of 5.7MB/s. After applying this patch, hdparm showed 23.5MB/s.

Signed-off-by: Mark Salter <msalter@redhat.com>
---
 drivers/usb/host/ehci-q.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
index 0917e3a..75d9838 100644
--- a/drivers/usb/host/ehci-q.c
+++ b/drivers/usb/host/ehci-q.c
@@ -114,6 +114,7 @@ qh_update (struct ehci_hcd *ehci, struct ehci_qh *qh, struct ehci_qtd *qtd)
 	/* HC must see latest qtd and qh data before we clear ACTIVE+HALT */
 	wmb ();
 	hw->hw_token &= cpu_to_hc32(ehci, QTD_TOGGLE | QTD_STS_PING);
+	dma_coherent_write_sync();
 }
 
 /* if it weren't for a common silicon quirk (writing the dummy into the qh
@@ -404,6 +405,7 @@ qh_completions (struct ehci_hcd *ehci, struct ehci_qh *qh)
 					wmb();
 					hw->hw_token = cpu_to_hc32(ehci,
 							token);
+					dma_coherent_write_sync();
 					goto retry_xacterr;
 				}
 				stopped = 1;
@@ -753,8 +755,10 @@ qh_urb_transaction (
 	}
 
 	/* by default, enable interrupt on urb completion */
-	if (likely (!(urb->transfer_flags & URB_NO_INTERRUPT)))
+	if (likely(!(urb->transfer_flags & URB_NO_INTERRUPT))) {
 		qtd->hw_token |= cpu_to_hc32(ehci, QTD_IOC);
+		dma_coherent_write_sync();
+	}
 	return head;
 
 cleanup:
@@ -1081,6 +1085,7 @@ static struct ehci_qh *qh_append_tds (
 			/* let the hc process these next qtds */
 			wmb ();
 			dummy->hw_token = token;
+			dma_coherent_write_sync();
 
 			urb->hcpriv = qh_get (qh);
 		}
-- 
1.7.6


^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH 3/3] add dma_coherent_write_sync calls to USB EHCI driver
@ 2011-08-31 21:30   ` Mark Salter
  0 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-08-31 21:30 UTC (permalink / raw)
  To: linux-arm-kernel

The EHCI driver polls DMA coherent memory for control data written by the
driver. On some architectures, such as ARMv7, the writes from the driver
may get delayed in a write buffer even though it is written to DMA coherent
memory. This delay led to serious performance issues on an ARMv7 based
platform using a USB disk drive. Before using this patch, 'hdparm -t' showed
a read speed of 5.7MB/s. After applying this patch, hdparm showed 23.5MB/s.

Signed-off-by: Mark Salter <msalter@redhat.com>
---
 drivers/usb/host/ehci-q.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
index 0917e3a..75d9838 100644
--- a/drivers/usb/host/ehci-q.c
+++ b/drivers/usb/host/ehci-q.c
@@ -114,6 +114,7 @@ qh_update (struct ehci_hcd *ehci, struct ehci_qh *qh, struct ehci_qtd *qtd)
 	/* HC must see latest qtd and qh data before we clear ACTIVE+HALT */
 	wmb ();
 	hw->hw_token &= cpu_to_hc32(ehci, QTD_TOGGLE | QTD_STS_PING);
+	dma_coherent_write_sync();
 }
 
 /* if it weren't for a common silicon quirk (writing the dummy into the qh
@@ -404,6 +405,7 @@ qh_completions (struct ehci_hcd *ehci, struct ehci_qh *qh)
 					wmb();
 					hw->hw_token = cpu_to_hc32(ehci,
 							token);
+					dma_coherent_write_sync();
 					goto retry_xacterr;
 				}
 				stopped = 1;
@@ -753,8 +755,10 @@ qh_urb_transaction (
 	}
 
 	/* by default, enable interrupt on urb completion */
-	if (likely (!(urb->transfer_flags & URB_NO_INTERRUPT)))
+	if (likely(!(urb->transfer_flags & URB_NO_INTERRUPT))) {
 		qtd->hw_token |= cpu_to_hc32(ehci, QTD_IOC);
+		dma_coherent_write_sync();
+	}
 	return head;
 
 cleanup:
@@ -1081,6 +1085,7 @@ static struct ehci_qh *qh_append_tds (
 			/* let the hc process these next qtds */
 			wmb ();
 			dummy->hw_token = token;
+			dma_coherent_write_sync();
 
 			urb->hcpriv = qh_get (qh);
 		}
-- 
1.7.6

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-08-31 21:30 ` Mark Salter
@ 2011-09-01  2:09   ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2011-09-01  2:09 UTC (permalink / raw)
  To: Mark Salter; +Cc: linux-kernel, linux-arm-kernel, stern

Hi,

On Thu, Sep 1, 2011 at 5:30 AM, Mark Salter <msalter@redhat.com> wrote:
> This patch set arose out of a discussion on linux-arm concerning a
> performance problem with USB on some ARMv7 based platforms. The
> problem was tracked down by ming.lei@canonical.com and found to be
> the result of CPU writes to DMA-coherent memory being delayed in a
> write buffer between the CPU and memory. One proposed patch fixed
> only the immediate problem with the USB EHCI driver, but several
> folks thought a more general approach was needed, so I put this series
> of patches together as a starting point for wider discussion outside
> the ARM specific list.

After some further thoughts, I think it is not a good idea to introduce a
general DMA API to handle this case, see below:

1. The side-effect of new API is that it will make descriptors of dma in a
partial update, such as qtd in the ehci case, even ehci can handle this
successful, but it is really not good to make DMA bus master see a
partial update of descriptor, and I am not  sure that other kind of bus masters
can handle this correctly, which may introduce other problems. A proper memory
barrier will always make dma master see a atomic update of dma descriptors,
which should be the preferred way to take by device driver.

2, most of such cases can be handled correctly by mb/wmb/rmb barriers.
The ehci case I reported is in the one of the most tricky code path in
ehci driver,
and it should be a special case, and up to now, we only have found this case
can't be handled by memory barriers. Is there other cases which can't be handled
correctly by mb/wmb/rmb? If so, please point it out.

3, The new DMA API for the purpose to be introduced is much easier to
understand, and much easier to use than memory barrier, so it is very
possible to make device driver guys misuse or abuse it instead of using
memory barrier first to handle the case.

>
> The original problem seen was that USB storage performance was unusually
> poor on some ARMv7 based platforms. With my particular setup, I was
> seeing hdparm -t report ~5.6MB/s on an SMP Cortex-A9 based platform
> where the same disk driver would get ~21MB/s on a Cortex-A8 based system.
> My understanding from subsequent discussion is that the A9 cores have
> a write buffer between the CPU and memory which could buffer data for a
> prolonged period even in the case of DMA coherent mappings. The ARM
> architecture code largely mitigates this by doing a write buffer flush
> as part of the MMIO functions used to write to device registers. This
> avoids problems in almost all drivers because most need to write to a
> device register to tell the device when something is written in the
> shared DMA coherent memory. In the case of USB, an EHCI host controller
> will poll certain DMA coherent memory locations for information coming
> from the CPU. In that case, the write buffering negatively affects
> performance.
>
> This series of patches adds a new function to the DMA API to deal with
> ARMv7 and any future architectures which have write buffering even for
> DMA coherent memory. The proposed dma_coherent_write_sync() function
> will allow those few drivers which need it to force out write buffer
> data in a timely way to avoid performace issues.
>
> Mark Salter (3):
>  add dma_coherent_write_sync to DMA API
>  define ARM-specific dma_coherent_write_sync
>  add dma_coherent_write_sync calls to USB EHCI driver
>
>  Documentation/DMA-API-HOWTO.txt    |   15 +++++++++++++++
>  Documentation/DMA-API.txt          |   12 ++++++++++++
>  arch/arm/include/asm/dma-mapping.h |   10 ++++++++++
>  drivers/usb/host/ehci-q.c          |    7 ++++++-
>  include/linux/dma-mapping.h        |    6 ++++++
>  5 files changed, 49 insertions(+), 1 deletions(-)
>
> --
> 1.7.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01  2:09   ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2011-09-01  2:09 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Thu, Sep 1, 2011 at 5:30 AM, Mark Salter <msalter@redhat.com> wrote:
> This patch set arose out of a discussion on linux-arm concerning a
> performance problem with USB on some ARMv7 based platforms. The
> problem was tracked down by ming.lei at canonical.com and found to be
> the result of CPU writes to DMA-coherent memory being delayed in a
> write buffer between the CPU and memory. One proposed patch fixed
> only the immediate problem with the USB EHCI driver, but several
> folks thought a more general approach was needed, so I put this series
> of patches together as a starting point for wider discussion outside
> the ARM specific list.

After some further thoughts, I think it is not a good idea to introduce a
general DMA API to handle this case, see below:

1. The side-effect of new API is that it will make descriptors of dma in a
partial update, such as qtd in the ehci case, even ehci can handle this
successful, but it is really not good to make DMA bus master see a
partial update of descriptor, and I am not  sure that other kind of bus masters
can handle this correctly, which may introduce other problems. A proper memory
barrier will always make dma master see a atomic update of dma descriptors,
which should be the preferred way to take by device driver.

2, most of such cases can be handled correctly by mb/wmb/rmb barriers.
The ehci case I reported is in the one of the most tricky code path in
ehci driver,
and it should be a special case, and up to now, we only have found this case
can't be handled by memory barriers. Is there other cases which can't be handled
correctly by mb/wmb/rmb? If so, please point it out.

3, The new DMA API for the purpose to be introduced is much easier to
understand, and much easier to use than memory barrier, so it is very
possible to make device driver guys misuse or abuse it instead of using
memory barrier first to handle the case.

>
> The original problem seen was that USB storage performance was unusually
> poor on some ARMv7 based platforms. With my particular setup, I was
> seeing hdparm -t report ~5.6MB/s on an SMP Cortex-A9 based platform
> where the same disk driver would get ~21MB/s on a Cortex-A8 based system.
> My understanding from subsequent discussion is that the A9 cores have
> a write buffer between the CPU and memory which could buffer data for a
> prolonged period even in the case of DMA coherent mappings. The ARM
> architecture code largely mitigates this by doing a write buffer flush
> as part of the MMIO functions used to write to device registers. This
> avoids problems in almost all drivers because most need to write to a
> device register to tell the device when something is written in the
> shared DMA coherent memory. In the case of USB, an EHCI host controller
> will poll certain DMA coherent memory locations for information coming
> from the CPU. In that case, the write buffering negatively affects
> performance.
>
> This series of patches adds a new function to the DMA API to deal with
> ARMv7 and any future architectures which have write buffering even for
> DMA coherent memory. The proposed dma_coherent_write_sync() function
> will allow those few drivers which need it to force out write buffer
> data in a timely way to avoid performace issues.
>
> Mark Salter (3):
> ?add dma_coherent_write_sync to DMA API
> ?define ARM-specific dma_coherent_write_sync
> ?add dma_coherent_write_sync calls to USB EHCI driver
>
> ?Documentation/DMA-API-HOWTO.txt ? ?| ? 15 +++++++++++++++
> ?Documentation/DMA-API.txt ? ? ? ? ?| ? 12 ++++++++++++
> ?arch/arm/include/asm/dma-mapping.h | ? 10 ++++++++++
> ?drivers/usb/host/ehci-q.c ? ? ? ? ?| ? ?7 ++++++-
> ?include/linux/dma-mapping.h ? ? ? ?| ? ?6 ++++++
> ?5 files changed, 49 insertions(+), 1 deletions(-)
>
> --
> 1.7.6
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 3/3] add dma_coherent_write_sync calls to USB EHCI driver
  2011-08-31 21:30   ` Mark Salter
@ 2011-09-01  2:33     ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2011-09-01  2:33 UTC (permalink / raw)
  To: Mark Salter; +Cc: linux-kernel, linux-arm-kernel, stern

Hi,

On Thu, Sep 1, 2011 at 5:30 AM, Mark Salter <msalter@redhat.com> wrote:
> The EHCI driver polls DMA coherent memory for control data written by the
> driver. On some architectures, such as ARMv7, the writes from the driver
> may get delayed in a write buffer even though it is written to DMA coherent
> memory. This delay led to serious performance issues on an ARMv7 based
> platform using a USB disk drive. Before using this patch, 'hdparm -t' showed
> a read speed of 5.7MB/s. After applying this patch, hdparm showed 23.5MB/s.
>
> Signed-off-by: Mark Salter <msalter@redhat.com>
> ---
>  drivers/usb/host/ehci-q.c |    7 ++++++-
>  1 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
> index 0917e3a..75d9838 100644
> --- a/drivers/usb/host/ehci-q.c
> +++ b/drivers/usb/host/ehci-q.c
> @@ -114,6 +114,7 @@ qh_update (struct ehci_hcd *ehci, struct ehci_qh *qh, struct ehci_qtd *qtd)
>        /* HC must see latest qtd and qh data before we clear ACTIVE+HALT */
>        wmb ();
>        hw->hw_token &= cpu_to_hc32(ehci, QTD_TOGGLE | QTD_STS_PING);
> +       dma_coherent_write_sync();

It is not needed at all, just before the qh is linked into hw queue,
there is one
wmb to handle sync of qh correctly. Even the wmb can be removed as the patch
I have posted out in usb mail list.

>  }
>
>  /* if it weren't for a common silicon quirk (writing the dummy into the qh
> @@ -404,6 +405,7 @@ qh_completions (struct ehci_hcd *ehci, struct ehci_qh *qh)
>                                        wmb();
>                                        hw->hw_token = cpu_to_hc32(ehci,
>                                                        token);
> +                                       dma_coherent_write_sync();

It is in a cold path, and if adding the helper or not does not matter.

>                                        goto retry_xacterr;
>                                }
>                                stopped = 1;
> @@ -753,8 +755,10 @@ qh_urb_transaction (
>        }
>
>        /* by default, enable interrupt on urb completion */
> -       if (likely (!(urb->transfer_flags & URB_NO_INTERRUPT)))
> +       if (likely(!(urb->transfer_flags & URB_NO_INTERRUPT))) {
>                qtd->hw_token |= cpu_to_hc32(ehci, QTD_IOC);
> +               dma_coherent_write_sync();

It is not needed at all, the wmb in qh_append_tds will handle sync of
qtd correctly.

> +       }
>        return head;
>
>  cleanup:
> @@ -1081,6 +1085,7 @@ static struct ehci_qh *qh_append_tds (
>                        /* let the hc process these next qtds */
>                        wmb ();
>                        dummy->hw_token = token;
> +                       dma_coherent_write_sync();

It is the only one which does make sense up to now, see discussion in

      http://marc.info/?t=131472029700001&r=1&w=2
      http://marc.info/?t=131445642100002&r=1&w=2

thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 3/3] add dma_coherent_write_sync calls to USB EHCI driver
@ 2011-09-01  2:33     ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2011-09-01  2:33 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Thu, Sep 1, 2011 at 5:30 AM, Mark Salter <msalter@redhat.com> wrote:
> The EHCI driver polls DMA coherent memory for control data written by the
> driver. On some architectures, such as ARMv7, the writes from the driver
> may get delayed in a write buffer even though it is written to DMA coherent
> memory. This delay led to serious performance issues on an ARMv7 based
> platform using a USB disk drive. Before using this patch, 'hdparm -t' showed
> a read speed of 5.7MB/s. After applying this patch, hdparm showed 23.5MB/s.
>
> Signed-off-by: Mark Salter <msalter@redhat.com>
> ---
> ?drivers/usb/host/ehci-q.c | ? ?7 ++++++-
> ?1 files changed, 6 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/usb/host/ehci-q.c b/drivers/usb/host/ehci-q.c
> index 0917e3a..75d9838 100644
> --- a/drivers/usb/host/ehci-q.c
> +++ b/drivers/usb/host/ehci-q.c
> @@ -114,6 +114,7 @@ qh_update (struct ehci_hcd *ehci, struct ehci_qh *qh, struct ehci_qtd *qtd)
> ? ? ? ?/* HC must see latest qtd and qh data before we clear ACTIVE+HALT */
> ? ? ? ?wmb ();
> ? ? ? ?hw->hw_token &= cpu_to_hc32(ehci, QTD_TOGGLE | QTD_STS_PING);
> + ? ? ? dma_coherent_write_sync();

It is not needed at all, just before the qh is linked into hw queue,
there is one
wmb to handle sync of qh correctly. Even the wmb can be removed as the patch
I have posted out in usb mail list.

> ?}
>
> ?/* if it weren't for a common silicon quirk (writing the dummy into the qh
> @@ -404,6 +405,7 @@ qh_completions (struct ehci_hcd *ehci, struct ehci_qh *qh)
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?wmb();
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?hw->hw_token = cpu_to_hc32(ehci,
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?token);
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? dma_coherent_write_sync();

It is in a cold path, and if adding the helper or not does not matter.

> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?goto retry_xacterr;
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?}
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?stopped = 1;
> @@ -753,8 +755,10 @@ qh_urb_transaction (
> ? ? ? ?}
>
> ? ? ? ?/* by default, enable interrupt on urb completion */
> - ? ? ? if (likely (!(urb->transfer_flags & URB_NO_INTERRUPT)))
> + ? ? ? if (likely(!(urb->transfer_flags & URB_NO_INTERRUPT))) {
> ? ? ? ? ? ? ? ?qtd->hw_token |= cpu_to_hc32(ehci, QTD_IOC);
> + ? ? ? ? ? ? ? dma_coherent_write_sync();

It is not needed at all, the wmb in qh_append_tds will handle sync of
qtd correctly.

> + ? ? ? }
> ? ? ? ?return head;
>
> ?cleanup:
> @@ -1081,6 +1085,7 @@ static struct ehci_qh *qh_append_tds (
> ? ? ? ? ? ? ? ? ? ? ? ?/* let the hc process these next qtds */
> ? ? ? ? ? ? ? ? ? ? ? ?wmb ();
> ? ? ? ? ? ? ? ? ? ? ? ?dummy->hw_token = token;
> + ? ? ? ? ? ? ? ? ? ? ? dma_coherent_write_sync();

It is the only one which does make sense up to now, see discussion in

      http://marc.info/?t=131472029700001&r=1&w=2
      http://marc.info/?t=131445642100002&r=1&w=2

thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] add dma_coherent_write_sync to DMA API
  2011-08-31 21:30   ` Mark Salter
@ 2011-09-01  2:59     ` Josh Cartwright
  -1 siblings, 0 replies; 64+ messages in thread
From: Josh Cartwright @ 2011-09-01  2:59 UTC (permalink / raw)
  To: Mark Salter; +Cc: linux-kernel, linux-arm-kernel, ming.lei, stern

On Wed, Aug 31, 2011 at 05:30:12PM -0400, Mark Salter wrote:
> On ARMv6/7 DMA-coherent memory is bufferable which means that CPU writes to
> coherent memory may still be held in a write buffer for a significant amount
> of time. This is largely mitigated by having the MMIO write functions force
> a write buffer flush before doing the actual write to the MMIO register. This
> forces out previous CPU writes to coherent memory for drivers which write to
> a register to inform the device that something was written to memory. However,
> this does not mitigate the problem for devices which poll the DMA memory for
> changes written by the CPU. One such case was found by ming.lei@canonical.com
> in the USB EHCI driver. The EHCI host controller relies at least partly on
> polling DMA coherent memory for information from the driver.
> 
> This patch adds a dma_coherent_write_sync() function to the DMA API which
> drivers can use to explicitly force out data which may otherwise be held up
> in a write buffer. It is a no-op unless and architecture provides its own
> version or the function and sets ARCH_HAS_DMA_COHERENT_WRITE_SYNC.
> 
> Signed-off-by: Mark Salter <msalter@redhat.com>
> ---
>  Documentation/DMA-API-HOWTO.txt |   15 +++++++++++++++
>  Documentation/DMA-API.txt       |   12 ++++++++++++
>  include/linux/dma-mapping.h     |    6 ++++++
>  3 files changed, 33 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/DMA-API-HOWTO.txt b/Documentation/DMA-API-HOWTO.txt
> index a0b6250..8c22b8b 100644
> --- a/Documentation/DMA-API-HOWTO.txt
> +++ b/Documentation/DMA-API-HOWTO.txt
> @@ -400,6 +400,21 @@ Make sure you've called dma_pool_free for all memory allocated
>  from a pool before you destroy the pool. This function may not
>  be called in interrupt context.
>  
> +Some architectures which supporting DMA coherent memory may still have write
> +buffering between the CPU and DMA memory. This buffering may delay CPU writes
> +from reaching coherent memory in a timely manner. These delays in turn can
> +lead lead to dramatic performance issues in certain cases. An architecture

'lead lead' -> 'lead'

-- 
                                       joshc

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/3] add dma_coherent_write_sync to DMA API
@ 2011-09-01  2:59     ` Josh Cartwright
  0 siblings, 0 replies; 64+ messages in thread
From: Josh Cartwright @ 2011-09-01  2:59 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Aug 31, 2011 at 05:30:12PM -0400, Mark Salter wrote:
> On ARMv6/7 DMA-coherent memory is bufferable which means that CPU writes to
> coherent memory may still be held in a write buffer for a significant amount
> of time. This is largely mitigated by having the MMIO write functions force
> a write buffer flush before doing the actual write to the MMIO register. This
> forces out previous CPU writes to coherent memory for drivers which write to
> a register to inform the device that something was written to memory. However,
> this does not mitigate the problem for devices which poll the DMA memory for
> changes written by the CPU. One such case was found by ming.lei at canonical.com
> in the USB EHCI driver. The EHCI host controller relies at least partly on
> polling DMA coherent memory for information from the driver.
> 
> This patch adds a dma_coherent_write_sync() function to the DMA API which
> drivers can use to explicitly force out data which may otherwise be held up
> in a write buffer. It is a no-op unless and architecture provides its own
> version or the function and sets ARCH_HAS_DMA_COHERENT_WRITE_SYNC.
> 
> Signed-off-by: Mark Salter <msalter@redhat.com>
> ---
>  Documentation/DMA-API-HOWTO.txt |   15 +++++++++++++++
>  Documentation/DMA-API.txt       |   12 ++++++++++++
>  include/linux/dma-mapping.h     |    6 ++++++
>  3 files changed, 33 insertions(+), 0 deletions(-)
> 
> diff --git a/Documentation/DMA-API-HOWTO.txt b/Documentation/DMA-API-HOWTO.txt
> index a0b6250..8c22b8b 100644
> --- a/Documentation/DMA-API-HOWTO.txt
> +++ b/Documentation/DMA-API-HOWTO.txt
> @@ -400,6 +400,21 @@ Make sure you've called dma_pool_free for all memory allocated
>  from a pool before you destroy the pool. This function may not
>  be called in interrupt context.
>  
> +Some architectures which supporting DMA coherent memory may still have write
> +buffering between the CPU and DMA memory. This buffering may delay CPU writes
> +from reaching coherent memory in a timely manner. These delays in turn can
> +lead lead to dramatic performance issues in certain cases. An architecture

'lead lead' -> 'lead'

-- 
                                       joshc

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-01  2:09   ` Ming Lei
@ 2011-09-01  3:09     ` Alan Stern
  -1 siblings, 0 replies; 64+ messages in thread
From: Alan Stern @ 2011-09-01  3:09 UTC (permalink / raw)
  To: Ming Lei; +Cc: Mark Salter, linux-kernel, linux-arm-kernel

On Thu, 1 Sep 2011, Ming Lei wrote:

> Hi,
> 
> On Thu, Sep 1, 2011 at 5:30 AM, Mark Salter <msalter@redhat.com> wrote:
> > This patch set arose out of a discussion on linux-arm concerning a
> > performance problem with USB on some ARMv7 based platforms. The
> > problem was tracked down by ming.lei@canonical.com and found to be
> > the result of CPU writes to DMA-coherent memory being delayed in a
> > write buffer between the CPU and memory. One proposed patch fixed
> > only the immediate problem with the USB EHCI driver, but several
> > folks thought a more general approach was needed, so I put this series
> > of patches together as a starting point for wider discussion outside
> > the ARM specific list.
> 
> After some further thoughts, I think it is not a good idea to introduce a
> general DMA API to handle this case, see below:
> 
> 1. The side-effect of new API is that it will make descriptors of dma in a
> partial update, such as qtd in the ehci case, even ehci can handle this
> successful, but it is really not good to make DMA bus master see a
> partial update of descriptor, and I am not  sure that other kind of bus masters
> can handle this correctly, which may introduce other problems. A proper memory
> barrier will always make dma master see a atomic update of dma descriptors,
> which should be the preferred way to take by device driver.

No, this is completely wrong.

Firstly, you are forgetting about other architectures, ones in which
writes to coherent memory aren't buffered.  On those architectures
there's no way to prevent the DMA bus master from seeing an
intermediate state of the data structures.  Therefore the driver has to
be written so that even when this happens, everything will work
correctly.

Secondly, even when write flushes are used, you can't guarantee that
the DMA bus master will see an atomic update.  It might turn out that
the hardware occasionally flushes some writes very quickly, before the
data-structure updates are complete.

Thirdly, you are mixing up memory barriers with write flushes.  The
barriers are used to make sure that writes are done in the correct
order, whereas the flushes are used to make sure that writes are done
reasonably quickly.  One has nothing to do with the other, even if by
coincidence on ARM a memory barrier causes a write flush.  On other
architectures this might not be true.

> 2, most of such cases can be handled correctly by mb/wmb/rmb barriers.

No, they can't.  See the third point above.

> The ehci case I reported is in the one of the most tricky code path in
> ehci driver,
> and it should be a special case, and up to now, we only have found this case
> can't be handled by memory barriers. Is there other cases which can't be handled
> correctly by mb/wmb/rmb? If so, please point it out.
> 
> 3, The new DMA API for the purpose to be introduced is much easier to
> understand, and much easier to use than memory barrier, so it is very
> possible to make device driver guys misuse or abuse it instead of using
> memory barrier first to handle the case.

That criticism could apply to almost any new feature.  We shouldn't be 
afraid to adopt something new merely because it's so easy to use that 
it might be misused.

Alan Stern


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01  3:09     ` Alan Stern
  0 siblings, 0 replies; 64+ messages in thread
From: Alan Stern @ 2011-09-01  3:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 1 Sep 2011, Ming Lei wrote:

> Hi,
> 
> On Thu, Sep 1, 2011 at 5:30 AM, Mark Salter <msalter@redhat.com> wrote:
> > This patch set arose out of a discussion on linux-arm concerning a
> > performance problem with USB on some ARMv7 based platforms. The
> > problem was tracked down by ming.lei at canonical.com and found to be
> > the result of CPU writes to DMA-coherent memory being delayed in a
> > write buffer between the CPU and memory. One proposed patch fixed
> > only the immediate problem with the USB EHCI driver, but several
> > folks thought a more general approach was needed, so I put this series
> > of patches together as a starting point for wider discussion outside
> > the ARM specific list.
> 
> After some further thoughts, I think it is not a good idea to introduce a
> general DMA API to handle this case, see below:
> 
> 1. The side-effect of new API is that it will make descriptors of dma in a
> partial update, such as qtd in the ehci case, even ehci can handle this
> successful, but it is really not good to make DMA bus master see a
> partial update of descriptor, and I am not  sure that other kind of bus masters
> can handle this correctly, which may introduce other problems. A proper memory
> barrier will always make dma master see a atomic update of dma descriptors,
> which should be the preferred way to take by device driver.

No, this is completely wrong.

Firstly, you are forgetting about other architectures, ones in which
writes to coherent memory aren't buffered.  On those architectures
there's no way to prevent the DMA bus master from seeing an
intermediate state of the data structures.  Therefore the driver has to
be written so that even when this happens, everything will work
correctly.

Secondly, even when write flushes are used, you can't guarantee that
the DMA bus master will see an atomic update.  It might turn out that
the hardware occasionally flushes some writes very quickly, before the
data-structure updates are complete.

Thirdly, you are mixing up memory barriers with write flushes.  The
barriers are used to make sure that writes are done in the correct
order, whereas the flushes are used to make sure that writes are done
reasonably quickly.  One has nothing to do with the other, even if by
coincidence on ARM a memory barrier causes a write flush.  On other
architectures this might not be true.

> 2, most of such cases can be handled correctly by mb/wmb/rmb barriers.

No, they can't.  See the third point above.

> The ehci case I reported is in the one of the most tricky code path in
> ehci driver,
> and it should be a special case, and up to now, we only have found this case
> can't be handled by memory barriers. Is there other cases which can't be handled
> correctly by mb/wmb/rmb? If so, please point it out.
> 
> 3, The new DMA API for the purpose to be introduced is much easier to
> understand, and much easier to use than memory barrier, so it is very
> possible to make device driver guys misuse or abuse it instead of using
> memory barrier first to handle the case.

That criticism could apply to almost any new feature.  We shouldn't be 
afraid to adopt something new merely because it's so easy to use that 
it might be misused.

Alan Stern

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-01  3:09     ` Alan Stern
@ 2011-09-01  3:41       ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2011-09-01  3:41 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-kernel, linux-arm-kernel, Mark Salter

Hi,

On Thu, Sep 1, 2011 at 11:09 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Thu, 1 Sep 2011, Ming Lei wrote:
>
>> Hi,
>>
>> On Thu, Sep 1, 2011 at 5:30 AM, Mark Salter <msalter@redhat.com> wrote:
>> > This patch set arose out of a discussion on linux-arm concerning a
>> > performance problem with USB on some ARMv7 based platforms. The
>> > problem was tracked down by ming.lei@canonical.com and found to be
>> > the result of CPU writes to DMA-coherent memory being delayed in a
>> > write buffer between the CPU and memory. One proposed patch fixed
>> > only the immediate problem with the USB EHCI driver, but several
>> > folks thought a more general approach was needed, so I put this series
>> > of patches together as a starting point for wider discussion outside
>> > the ARM specific list.
>>
>> After some further thoughts, I think it is not a good idea to introduce a
>> general DMA API to handle this case, see below:
>>
>> 1. The side-effect of new API is that it will make descriptors of dma in a
>> partial update, such as qtd in the ehci case, even ehci can handle this
>> successful, but it is really not good to make DMA bus master see a
>> partial update of descriptor, and I am not  sure that other kind of bus masters
>> can handle this correctly, which may introduce other problems. A proper memory
>> barrier will always make dma master see a atomic update of dma descriptors,
>> which should be the preferred way to take by device driver.
>
> No, this is completely wrong.
>
> Firstly, you are forgetting about other architectures, ones in which
> writes to coherent memory aren't buffered.  On those architectures
> there's no way to prevent the DMA bus master from seeing an
> intermediate state of the data structures.  Therefore the driver has to
> be written so that even when this happens, everything will work
> correctly.
>
> Secondly, even when write flushes are used, you can't guarantee that
> the DMA bus master will see an atomic update.  It might turn out that
> the hardware occasionally flushes some writes very quickly, before the
> data-structure updates are complete.
>
> Thirdly, you are mixing up memory barriers with write flushes.  The
> barriers are used to make sure that writes are done in the correct
> order, whereas the flushes are used to make sure that writes are done
> reasonably quickly.  One has nothing to do with the other, even if by
> coincidence on ARM a memory barrier causes a write flush.  On other
> architectures this might not be true.

I agree all about above, but what I described is from another view.
I post out the example before explaining my idea further:


	CPU			device	
	A=1;
	wmb
	B=2;
					read B
					read A

one wmb is used to order 'A=1' and 'B=2', which will make the two write
operations reach to physical memory as the order: 'A=1' first, 'B=2' second.
Then the device can observe the two write events as the order above,
so if device has seen 'B==2', then device will surely see 'A==1'.

Suppose writing to A is operation to update dma descriptor, the above example
can make device always see a atomic update of descriptor, can't it?

My idea is that the memory access patterns are to be considered for
writer of device driver. For example, many memory access patterns on
EHCI hardware are described in detail.  Of course, device driver should
make full use of the background info, below is a example from ehci driver:

qh_link_async():

	/*prepare qh descriptor*/
	qh->qh_next = head->qh_next;
	qh->hw->hw_next = head->hw->hw_next;
	wmb ();

	/*link the qh descriptor into hardware queue*/
	head->qh_next.qh = qh;
	head->hw->hw_next = dma;

so once EHCI fetches a qh with the address of 'dma', it will always see
consistent content of qh descriptor, which could not be updated partially.

>
>> 2, most of such cases can be handled correctly by mb/wmb/rmb barriers.
>
> No, they can't.  See the third point above.

The example above has demoed that barriers can do it, hasn't it?

>
>> The ehci case I reported is in the one of the most tricky code path in
>> ehci driver,
>> and it should be a special case, and up to now, we only have found this case
>> can't be handled by memory barriers. Is there other cases which can't be handled
>> correctly by mb/wmb/rmb? If so, please point it out.
>>
>> 3, The new DMA API for the purpose to be introduced is much easier to
>> understand, and much easier to use than memory barrier, so it is very
>> possible to make device driver guys misuse or abuse it instead of using
>> memory barrier first to handle the case.
>
> That criticism could apply to almost any new feature.  We shouldn't be
> afraid to adopt something new merely because it's so easy to use that
> it might be misused.

This point depends on the #1 and #2.

thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01  3:41       ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2011-09-01  3:41 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Thu, Sep 1, 2011 at 11:09 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Thu, 1 Sep 2011, Ming Lei wrote:
>
>> Hi,
>>
>> On Thu, Sep 1, 2011 at 5:30 AM, Mark Salter <msalter@redhat.com> wrote:
>> > This patch set arose out of a discussion on linux-arm concerning a
>> > performance problem with USB on some ARMv7 based platforms. The
>> > problem was tracked down by ming.lei at canonical.com and found to be
>> > the result of CPU writes to DMA-coherent memory being delayed in a
>> > write buffer between the CPU and memory. One proposed patch fixed
>> > only the immediate problem with the USB EHCI driver, but several
>> > folks thought a more general approach was needed, so I put this series
>> > of patches together as a starting point for wider discussion outside
>> > the ARM specific list.
>>
>> After some further thoughts, I think it is not a good idea to introduce a
>> general DMA API to handle this case, see below:
>>
>> 1. The side-effect of new API is that it will make descriptors of dma in a
>> partial update, such as qtd in the ehci case, even ehci can handle this
>> successful, but it is really not good to make DMA bus master see a
>> partial update of descriptor, and I am not ?sure that other kind of bus masters
>> can handle this correctly, which may introduce other problems. A proper memory
>> barrier will always make dma master see a atomic update of dma descriptors,
>> which should be the preferred way to take by device driver.
>
> No, this is completely wrong.
>
> Firstly, you are forgetting about other architectures, ones in which
> writes to coherent memory aren't buffered. ?On those architectures
> there's no way to prevent the DMA bus master from seeing an
> intermediate state of the data structures. ?Therefore the driver has to
> be written so that even when this happens, everything will work
> correctly.
>
> Secondly, even when write flushes are used, you can't guarantee that
> the DMA bus master will see an atomic update. ?It might turn out that
> the hardware occasionally flushes some writes very quickly, before the
> data-structure updates are complete.
>
> Thirdly, you are mixing up memory barriers with write flushes. ?The
> barriers are used to make sure that writes are done in the correct
> order, whereas the flushes are used to make sure that writes are done
> reasonably quickly. ?One has nothing to do with the other, even if by
> coincidence on ARM a memory barrier causes a write flush. ?On other
> architectures this might not be true.

I agree all about above, but what I described is from another view.
I post out the example before explaining my idea further:


	CPU			device	
	A=1;
	wmb
	B=2;
					read B
					read A

one wmb is used to order 'A=1' and 'B=2', which will make the two write
operations reach to physical memory as the order: 'A=1' first, 'B=2' second.
Then the device can observe the two write events as the order above,
so if device has seen 'B==2', then device will surely see 'A==1'.

Suppose writing to A is operation to update dma descriptor, the above example
can make device always see a atomic update of descriptor, can't it?

My idea is that the memory access patterns are to be considered for
writer of device driver. For example, many memory access patterns on
EHCI hardware are described in detail.  Of course, device driver should
make full use of the background info, below is a example from ehci driver:

qh_link_async():

	/*prepare qh descriptor*/
	qh->qh_next = head->qh_next;
	qh->hw->hw_next = head->hw->hw_next;
	wmb ();

	/*link the qh descriptor into hardware queue*/
	head->qh_next.qh = qh;
	head->hw->hw_next = dma;

so once EHCI fetches a qh with the address of 'dma', it will always see
consistent content of qh descriptor, which could not be updated partially.

>
>> 2, most of such cases can be handled correctly by mb/wmb/rmb barriers.
>
> No, they can't. ?See the third point above.

The example above has demoed that barriers can do it, hasn't it?

>
>> The ehci case I reported is in the one of the most tricky code path in
>> ehci driver,
>> and it should be a special case, and up to now, we only have found this case
>> can't be handled by memory barriers. Is there other cases which can't be handled
>> correctly by mb/wmb/rmb? If so, please point it out.
>>
>> 3, The new DMA API for the purpose to be introduced is much easier to
>> understand, and much easier to use than memory barrier, so it is very
>> possible to make device driver guys misuse or abuse it instead of using
>> memory barrier first to handle the case.
>
> That criticism could apply to almost any new feature. ?We shouldn't be
> afraid to adopt something new merely because it's so easy to use that
> it might be misused.

This point depends on the #1 and #2.

thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-01  3:41       ` Ming Lei
@ 2011-09-01  8:45         ` Will Deacon
  -1 siblings, 0 replies; 64+ messages in thread
From: Will Deacon @ 2011-09-01  8:45 UTC (permalink / raw)
  To: Ming Lei; +Cc: Alan Stern, linux-kernel, linux-arm-kernel, Mark Salter

On Thu, Sep 01, 2011 at 04:41:46AM +0100, Ming Lei wrote:
> Hi,
> 
> On Thu, Sep 1, 2011 at 11:09 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > No, this is completely wrong.
> >
> > Firstly, you are forgetting about other architectures, ones in which
> > writes to coherent memory aren't buffered.  On those architectures
> > there's no way to prevent the DMA bus master from seeing an
> > intermediate state of the data structures.  Therefore the driver has to
> > be written so that even when this happens, everything will work
> > correctly.
> >
> > Secondly, even when write flushes are used, you can't guarantee that
> > the DMA bus master will see an atomic update.  It might turn out that
> > the hardware occasionally flushes some writes very quickly, before the
> > data-structure updates are complete.
> >
> > Thirdly, you are mixing up memory barriers with write flushes.  The
> > barriers are used to make sure that writes are done in the correct
> > order, whereas the flushes are used to make sure that writes are done
> > reasonably quickly.  One has nothing to do with the other, even if by
> > coincidence on ARM a memory barrier causes a write flush.  On other
> > architectures this might not be true.
> 
> I agree all about above, but what I described is from another view.
> I post out the example before explaining my idea further:
> 
> 
> 	CPU			device	
> 	A=1;
> 	wmb
> 	B=2;
> 					read B
> 					read A
> 
> one wmb is used to order 'A=1' and 'B=2', which will make the two write
> operations reach to physical memory as the order: 'A=1' first, 'B=2' second.
> Then the device can observe the two write events as the order above,
> so if device has seen 'B==2', then device will surely see 'A==1'.
> 
> Suppose writing to A is operation to update dma descriptor, the above example
> can make device always see a atomic update of descriptor, can't it?
> 
> My idea is that the memory access patterns are to be considered for
> writer of device driver. For example, many memory access patterns on
> EHCI hardware are described in detail.  Of course, device driver should
> make full use of the background info, below is a example from ehci driver:
> 
> qh_link_async():
> 
> 	/*prepare qh descriptor*/
> 	qh->qh_next = head->qh_next;
> 	qh->hw->hw_next = head->hw->hw_next;
> 	wmb ();
> 
> 	/*link the qh descriptor into hardware queue*/
> 	head->qh_next.qh = qh;
> 	head->hw->hw_next = dma;
> 
> so once EHCI fetches a qh with the address of 'dma', it will always see
> consistent content of qh descriptor, which could not be updated partially.

I'm struggling to see what you're getting at here. The proposal has
*absolutely nothing* to do with memory barriers. All of the existing
barriers will remain - they are needed for correctness. What changes is the
addition of an /optional/ flush operation in order to guarantee some sort of
immediacy for writes to the coherent buffer.

> >> 3, The new DMA API for the purpose to be introduced is much easier to
> >> understand, and much easier to use than memory barrier, so it is very
> >> possible to make device driver guys misuse or abuse it instead of using
> >> memory barrier first to handle the case.
> >
> > That criticism could apply to almost any new feature.  We shouldn't be
> > afraid to adopt something new merely because it's so easy to use that
> > it might be misused.
> 
> This point depends on the #1 and #2.

Huh? I don't see the connection. If your worry is that people will start
littering their code with flush calls, I don't think that's especially
likely. The usual problem (from what I've seen) is that barriers tend to be
missing rather than overused so I don't see why this would be different for
a what has been proposed.

Will

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01  8:45         ` Will Deacon
  0 siblings, 0 replies; 64+ messages in thread
From: Will Deacon @ 2011-09-01  8:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 01, 2011 at 04:41:46AM +0100, Ming Lei wrote:
> Hi,
> 
> On Thu, Sep 1, 2011 at 11:09 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> >
> > No, this is completely wrong.
> >
> > Firstly, you are forgetting about other architectures, ones in which
> > writes to coherent memory aren't buffered. ?On those architectures
> > there's no way to prevent the DMA bus master from seeing an
> > intermediate state of the data structures. ?Therefore the driver has to
> > be written so that even when this happens, everything will work
> > correctly.
> >
> > Secondly, even when write flushes are used, you can't guarantee that
> > the DMA bus master will see an atomic update. ?It might turn out that
> > the hardware occasionally flushes some writes very quickly, before the
> > data-structure updates are complete.
> >
> > Thirdly, you are mixing up memory barriers with write flushes. ?The
> > barriers are used to make sure that writes are done in the correct
> > order, whereas the flushes are used to make sure that writes are done
> > reasonably quickly. ?One has nothing to do with the other, even if by
> > coincidence on ARM a memory barrier causes a write flush. ?On other
> > architectures this might not be true.
> 
> I agree all about above, but what I described is from another view.
> I post out the example before explaining my idea further:
> 
> 
> 	CPU			device	
> 	A=1;
> 	wmb
> 	B=2;
> 					read B
> 					read A
> 
> one wmb is used to order 'A=1' and 'B=2', which will make the two write
> operations reach to physical memory as the order: 'A=1' first, 'B=2' second.
> Then the device can observe the two write events as the order above,
> so if device has seen 'B==2', then device will surely see 'A==1'.
> 
> Suppose writing to A is operation to update dma descriptor, the above example
> can make device always see a atomic update of descriptor, can't it?
> 
> My idea is that the memory access patterns are to be considered for
> writer of device driver. For example, many memory access patterns on
> EHCI hardware are described in detail.  Of course, device driver should
> make full use of the background info, below is a example from ehci driver:
> 
> qh_link_async():
> 
> 	/*prepare qh descriptor*/
> 	qh->qh_next = head->qh_next;
> 	qh->hw->hw_next = head->hw->hw_next;
> 	wmb ();
> 
> 	/*link the qh descriptor into hardware queue*/
> 	head->qh_next.qh = qh;
> 	head->hw->hw_next = dma;
> 
> so once EHCI fetches a qh with the address of 'dma', it will always see
> consistent content of qh descriptor, which could not be updated partially.

I'm struggling to see what you're getting at here. The proposal has
*absolutely nothing* to do with memory barriers. All of the existing
barriers will remain - they are needed for correctness. What changes is the
addition of an /optional/ flush operation in order to guarantee some sort of
immediacy for writes to the coherent buffer.

> >> 3, The new DMA API for the purpose to be introduced is much easier to
> >> understand, and much easier to use than memory barrier, so it is very
> >> possible to make device driver guys misuse or abuse it instead of using
> >> memory barrier first to handle the case.
> >
> > That criticism could apply to almost any new feature. ?We shouldn't be
> > afraid to adopt something new merely because it's so easy to use that
> > it might be misused.
> 
> This point depends on the #1 and #2.

Huh? I don't see the connection. If your worry is that people will start
littering their code with flush calls, I don't think that's especially
likely. The usual problem (from what I've seen) is that barriers tend to be
missing rather than overused so I don't see why this would be different for
a what has been proposed.

Will

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-08-31 21:30 ` Mark Salter
@ 2011-09-01  9:11   ` Will Deacon
  -1 siblings, 0 replies; 64+ messages in thread
From: Will Deacon @ 2011-09-01  9:11 UTC (permalink / raw)
  To: Mark Salter; +Cc: linux-kernel, ming.lei, stern, linux-arm-kernel

Hi Mark,

On Wed, Aug 31, 2011 at 10:30:11PM +0100, Mark Salter wrote:
> This patch set arose out of a discussion on linux-arm concerning a
> performance problem with USB on some ARMv7 based platforms. The
> problem was tracked down by ming.lei@canonical.com and found to be
> the result of CPU writes to DMA-coherent memory being delayed in a
> write buffer between the CPU and memory. One proposed patch fixed
> only the immediate problem with the USB EHCI driver, but several
> folks thought a more general approach was needed, so I put this series
> of patches together as a starting point for wider discussion outside
> the ARM specific list.

[...]

Whilst I'm in favour of this because it seems to be solving a real problem
(especially on OMAP4), I think we should get to the bottom of the issue
before augmenting the coherent DMA API. We currently have the following
unanswered questions:

(1) Why does a nosmp kernel not suffer from this problem?

(2) Why do *some* Tegra platforms seem to suffer whilst others do not?

(3) Can this problem be solved by configuring the hardware appropriately?

If (3) is true, then we might be better off solving it that way, although
I'd be interested to see if flushing CPU write buffers makes a difference to
I/O performance on other architectures using non-cacheable memory for
coherent DMA.

Will

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01  9:11   ` Will Deacon
  0 siblings, 0 replies; 64+ messages in thread
From: Will Deacon @ 2011-09-01  9:11 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Mark,

On Wed, Aug 31, 2011 at 10:30:11PM +0100, Mark Salter wrote:
> This patch set arose out of a discussion on linux-arm concerning a
> performance problem with USB on some ARMv7 based platforms. The
> problem was tracked down by ming.lei at canonical.com and found to be
> the result of CPU writes to DMA-coherent memory being delayed in a
> write buffer between the CPU and memory. One proposed patch fixed
> only the immediate problem with the USB EHCI driver, but several
> folks thought a more general approach was needed, so I put this series
> of patches together as a starting point for wider discussion outside
> the ARM specific list.

[...]

Whilst I'm in favour of this because it seems to be solving a real problem
(especially on OMAP4), I think we should get to the bottom of the issue
before augmenting the coherent DMA API. We currently have the following
unanswered questions:

(1) Why does a nosmp kernel not suffer from this problem?

(2) Why do *some* Tegra platforms seem to suffer whilst others do not?

(3) Can this problem be solved by configuring the hardware appropriately?

If (3) is true, then we might be better off solving it that way, although
I'd be interested to see if flushing CPU write buffers makes a difference to
I/O performance on other architectures using non-cacheable memory for
coherent DMA.

Will

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-01  8:45         ` Will Deacon
@ 2011-09-01  9:14           ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2011-09-01  9:14 UTC (permalink / raw)
  To: Will Deacon; +Cc: Alan Stern, linux-kernel, linux-arm-kernel, Mark Salter

On Thu, Sep 1, 2011 at 4:45 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Thu, Sep 01, 2011 at 04:41:46AM +0100, Ming Lei wrote:
>> Hi,
>>
>> On Thu, Sep 1, 2011 at 11:09 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
>> >
>> > No, this is completely wrong.
>> >
>> > Firstly, you are forgetting about other architectures, ones in which
>> > writes to coherent memory aren't buffered.  On those architectures
>> > there's no way to prevent the DMA bus master from seeing an
>> > intermediate state of the data structures.  Therefore the driver has to
>> > be written so that even when this happens, everything will work
>> > correctly.
>> >
>> > Secondly, even when write flushes are used, you can't guarantee that
>> > the DMA bus master will see an atomic update.  It might turn out that
>> > the hardware occasionally flushes some writes very quickly, before the
>> > data-structure updates are complete.
>> >
>> > Thirdly, you are mixing up memory barriers with write flushes.  The
>> > barriers are used to make sure that writes are done in the correct
>> > order, whereas the flushes are used to make sure that writes are done
>> > reasonably quickly.  One has nothing to do with the other, even if by
>> > coincidence on ARM a memory barrier causes a write flush.  On other
>> > architectures this might not be true.
>>
>> I agree all about above, but what I described is from another view.
>> I post out the example before explaining my idea further:
>>
>>
>>       CPU                     device
>>       A=1;
>>       wmb
>>       B=2;
>>                                       read B
>>                                       read A
>>
>> one wmb is used to order 'A=1' and 'B=2', which will make the two write
>> operations reach to physical memory as the order: 'A=1' first, 'B=2' second.
>> Then the device can observe the two write events as the order above,
>> so if device has seen 'B==2', then device will surely see 'A==1'.
>>
>> Suppose writing to A is operation to update dma descriptor, the above example
>> can make device always see a atomic update of descriptor, can't it?
>>
>> My idea is that the memory access patterns are to be considered for
>> writer of device driver. For example, many memory access patterns on
>> EHCI hardware are described in detail.  Of course, device driver should
>> make full use of the background info, below is a example from ehci driver:
>>
>> qh_link_async():
>>
>>       /*prepare qh descriptor*/
>>       qh->qh_next = head->qh_next;
>>       qh->hw->hw_next = head->hw->hw_next;
>>       wmb ();
>>
>>       /*link the qh descriptor into hardware queue*/
>>       head->qh_next.qh = qh;
>>       head->hw->hw_next = dma;
>>
>> so once EHCI fetches a qh with the address of 'dma', it will always see
>> consistent content of qh descriptor, which could not be updated partially.
>
> I'm struggling to see what you're getting at here. The proposal has
> *absolutely nothing* to do with memory barriers. All of the existing
> barriers will remain - they are needed for correctness. What changes is the
> addition of an /optional/ flush operation in order to guarantee some sort of
> immediacy for writes to the coherent buffer.

First, most of barriers have made this kind of flush not necessary, as
explained int the example above. Even for ehci driver, the flush to be
added is only __one__ special case, so could you list other cases in
which the flush operation is a must and memory barrier can't do it?

If one can understand dma master access order on shared memory
in the context, it is not difficult to use memory barrier to avoid the
flush operation.

Second, as I said before, the flush operation is very easy to cause dma
descriptor updated in partial, as qtd descriptor updated in the ehci case.
It is one of the side effects of the flush to be introduced.  Fortunately, EHCI
can handle this correctly, but can other hardwares always handle this correctly?

>
>> >> 3, The new DMA API for the purpose to be introduced is much easier to
>> >> understand, and much easier to use than memory barrier, so it is very
>> >> possible to make device driver guys misuse or abuse it instead of using
>> >> memory barrier first to handle the case.
>> >
>> > That criticism could apply to almost any new feature.  We shouldn't be
>> > afraid to adopt something new merely because it's so easy to use that
>> > it might be misused.
>>
>> This point depends on the #1 and #2.
>
> Huh? I don't see the connection. If your worry is that people will start
> littering their code with flush calls, I don't think that's especially
> likely. The usual problem (from what I've seen) is that barriers tend to be
> missing rather than overused so I don't see why this would be different for
> a what has been proposed.

Barrier is often missed because it is very difficult to use and one have to
understand the dma master access order on shared memory by . The
flush to be introduced is very easy to understand, so it is very likely to be
abused, isn't it?


thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01  9:14           ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2011-09-01  9:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 1, 2011 at 4:45 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Thu, Sep 01, 2011 at 04:41:46AM +0100, Ming Lei wrote:
>> Hi,
>>
>> On Thu, Sep 1, 2011 at 11:09 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
>> >
>> > No, this is completely wrong.
>> >
>> > Firstly, you are forgetting about other architectures, ones in which
>> > writes to coherent memory aren't buffered. ?On those architectures
>> > there's no way to prevent the DMA bus master from seeing an
>> > intermediate state of the data structures. ?Therefore the driver has to
>> > be written so that even when this happens, everything will work
>> > correctly.
>> >
>> > Secondly, even when write flushes are used, you can't guarantee that
>> > the DMA bus master will see an atomic update. ?It might turn out that
>> > the hardware occasionally flushes some writes very quickly, before the
>> > data-structure updates are complete.
>> >
>> > Thirdly, you are mixing up memory barriers with write flushes. ?The
>> > barriers are used to make sure that writes are done in the correct
>> > order, whereas the flushes are used to make sure that writes are done
>> > reasonably quickly. ?One has nothing to do with the other, even if by
>> > coincidence on ARM a memory barrier causes a write flush. ?On other
>> > architectures this might not be true.
>>
>> I agree all about above, but what I described is from another view.
>> I post out the example before explaining my idea further:
>>
>>
>> ? ? ? CPU ? ? ? ? ? ? ? ? ? ? device
>> ? ? ? A=1;
>> ? ? ? wmb
>> ? ? ? B=2;
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? read B
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? read A
>>
>> one wmb is used to order 'A=1' and 'B=2', which will make the two write
>> operations reach to physical memory as the order: 'A=1' first, 'B=2' second.
>> Then the device can observe the two write events as the order above,
>> so if device has seen 'B==2', then device will surely see 'A==1'.
>>
>> Suppose writing to A is operation to update dma descriptor, the above example
>> can make device always see a atomic update of descriptor, can't it?
>>
>> My idea is that the memory access patterns are to be considered for
>> writer of device driver. For example, many memory access patterns on
>> EHCI hardware are described in detail. ?Of course, device driver should
>> make full use of the background info, below is a example from ehci driver:
>>
>> qh_link_async():
>>
>> ? ? ? /*prepare qh descriptor*/
>> ? ? ? qh->qh_next = head->qh_next;
>> ? ? ? qh->hw->hw_next = head->hw->hw_next;
>> ? ? ? wmb ();
>>
>> ? ? ? /*link the qh descriptor into hardware queue*/
>> ? ? ? head->qh_next.qh = qh;
>> ? ? ? head->hw->hw_next = dma;
>>
>> so once EHCI fetches a qh with the address of 'dma', it will always see
>> consistent content of qh descriptor, which could not be updated partially.
>
> I'm struggling to see what you're getting at here. The proposal has
> *absolutely nothing* to do with memory barriers. All of the existing
> barriers will remain - they are needed for correctness. What changes is the
> addition of an /optional/ flush operation in order to guarantee some sort of
> immediacy for writes to the coherent buffer.

First, most of barriers have made this kind of flush not necessary, as
explained int the example above. Even for ehci driver, the flush to be
added is only __one__ special case, so could you list other cases in
which the flush operation is a must and memory barrier can't do it?

If one can understand dma master access order on shared memory
in the context, it is not difficult to use memory barrier to avoid the
flush operation.

Second, as I said before, the flush operation is very easy to cause dma
descriptor updated in partial, as qtd descriptor updated in the ehci case.
It is one of the side effects of the flush to be introduced.  Fortunately, EHCI
can handle this correctly, but can other hardwares always handle this correctly?

>
>> >> 3, The new DMA API for the purpose to be introduced is much easier to
>> >> understand, and much easier to use than memory barrier, so it is very
>> >> possible to make device driver guys misuse or abuse it instead of using
>> >> memory barrier first to handle the case.
>> >
>> > That criticism could apply to almost any new feature. ?We shouldn't be
>> > afraid to adopt something new merely because it's so easy to use that
>> > it might be misused.
>>
>> This point depends on the #1 and #2.
>
> Huh? I don't see the connection. If your worry is that people will start
> littering their code with flush calls, I don't think that's especially
> likely. The usual problem (from what I've seen) is that barriers tend to be
> missing rather than overused so I don't see why this would be different for
> a what has been proposed.

Barrier is often missed because it is very difficult to use and one have to
understand the dma master access order on shared memory by . The
flush to be introduced is very easy to understand, so it is very likely to be
abused, isn't it?


thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] add dma_coherent_write_sync to DMA API
  2011-08-31 21:30   ` Mark Salter
@ 2011-09-01  9:57     ` Michał Mirosław
  -1 siblings, 0 replies; 64+ messages in thread
From: Michał Mirosław @ 2011-09-01  9:57 UTC (permalink / raw)
  To: Mark Salter; +Cc: linux-kernel, ming.lei, stern, linux-arm-kernel

2011/8/31 Mark Salter <msalter@redhat.com>:
> On ARMv6/7 DMA-coherent memory is bufferable which means that CPU writes to
> coherent memory may still be held in a write buffer for a significant amount
> of time. This is largely mitigated by having the MMIO write functions force
> a write buffer flush before doing the actual write to the MMIO register. This
> forces out previous CPU writes to coherent memory for drivers which write to
> a register to inform the device that something was written to memory. However,
> this does not mitigate the problem for devices which poll the DMA memory for
> changes written by the CPU. One such case was found by ming.lei@canonical.com
> in the USB EHCI driver. The EHCI host controller relies at least partly on
> polling DMA coherent memory for information from the driver.
>
> This patch adds a dma_coherent_write_sync() function to the DMA API which
> drivers can use to explicitly force out data which may otherwise be held up
> in a write buffer. It is a no-op unless and architecture provides its own
> version or the function and sets ARCH_HAS_DMA_COHERENT_WRITE_SYNC.
[...]
> --- a/Documentation/DMA-API.txt
> +++ b/Documentation/DMA-API.txt
> @@ -418,6 +418,18 @@ void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr,
>        ....
>
>
> +Part Ie - Write buffering to dma-coherent memory
> +------------------------------------------------
> +
> +Some architectures supporting DMA coherent memory may have write
> +buffering between the CPU and DMA memory. This buffering may delay
> +CPU writes from reaching coherent memory in a timely manner.
> +
> +    void
> +    dma_coherent_write_sync()
> +
> +Force any outstanding coherent writes to memory.

This should be merged or referenced in part Ia.

BTW, if there's no time limit on write buffers flushing, or if write
buffers can cause reordering of the writes, then the memory accesses
need to be managed just like non-DMA-coherent memory. So what differs
then in DMA-coherent vs non-DMA-coherent mappings then?

Best Regards,
Michał Mirosław

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/3] add dma_coherent_write_sync to DMA API
@ 2011-09-01  9:57     ` Michał Mirosław
  0 siblings, 0 replies; 64+ messages in thread
From: Michał Mirosław @ 2011-09-01  9:57 UTC (permalink / raw)
  To: linux-arm-kernel

2011/8/31 Mark Salter <msalter@redhat.com>:
> On ARMv6/7 DMA-coherent memory is bufferable which means that CPU writes to
> coherent memory may still be held in a write buffer for a significant amount
> of time. This is largely mitigated by having the MMIO write functions force
> a write buffer flush before doing the actual write to the MMIO register. This
> forces out previous CPU writes to coherent memory for drivers which write to
> a register to inform the device that something was written to memory. However,
> this does not mitigate the problem for devices which poll the DMA memory for
> changes written by the CPU. One such case was found by ming.lei at canonical.com
> in the USB EHCI driver. The EHCI host controller relies at least partly on
> polling DMA coherent memory for information from the driver.
>
> This patch adds a dma_coherent_write_sync() function to the DMA API which
> drivers can use to explicitly force out data which may otherwise be held up
> in a write buffer. It is a no-op unless and architecture provides its own
> version or the function and sets ARCH_HAS_DMA_COHERENT_WRITE_SYNC.
[...]
> --- a/Documentation/DMA-API.txt
> +++ b/Documentation/DMA-API.txt
> @@ -418,6 +418,18 @@ void whizco_dma_map_sg_attrs(struct device *dev, dma_addr_t dma_addr,
> ? ? ? ?....
>
>
> +Part Ie - Write buffering to dma-coherent memory
> +------------------------------------------------
> +
> +Some architectures supporting DMA coherent memory may have write
> +buffering between the CPU and DMA memory. This buffering may delay
> +CPU writes from reaching coherent memory in a timely manner.
> +
> + ? ?void
> + ? ?dma_coherent_write_sync()
> +
> +Force any outstanding coherent writes to memory.

This should be merged or referenced in part Ia.

BTW, if there's no time limit on write buffers flushing, or if write
buffers can cause reordering of the writes, then the memory accesses
need to be managed just like non-DMA-coherent memory. So what differs
then in DMA-coherent vs non-DMA-coherent mappings then?

Best Regards,
Micha? Miros?aw

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] add dma_coherent_write_sync to DMA API
  2011-09-01  9:57     ` Michał Mirosław
@ 2011-09-01 12:36       ` Mark Salter
  -1 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-09-01 12:36 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: linux-kernel, ming.lei, stern, linux-arm-kernel

On Thu, 2011-09-01 at 11:57 +0200, Michał Mirosław wrote:
> BTW, if there's no time limit on write buffers flushing, or if write
> buffers can cause reordering of the writes, then the memory accesses
> need to be managed just like non-DMA-coherent memory. So what differs
> then in DMA-coherent vs non-DMA-coherent mappings then?

My understanding is that ordering is preserved, but an ARM guy should
probably verify that.

IIUC, the write buffers could hold data indefinitely. As a practical
matter other writes needing to go out to memory will force buffered
data out eventually. Again, this is my understanding which may be
faulty. My feeling is that this extended write buffering makes it
hard to call the dma memory fully coherent, but other limitations on
ARMv7 make the buffering hard to avoid.

--Mark



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/3] add dma_coherent_write_sync to DMA API
@ 2011-09-01 12:36       ` Mark Salter
  0 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-09-01 12:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2011-09-01 at 11:57 +0200, Micha? Miros?aw wrote:
> BTW, if there's no time limit on write buffers flushing, or if write
> buffers can cause reordering of the writes, then the memory accesses
> need to be managed just like non-DMA-coherent memory. So what differs
> then in DMA-coherent vs non-DMA-coherent mappings then?

My understanding is that ordering is preserved, but an ARM guy should
probably verify that.

IIUC, the write buffers could hold data indefinitely. As a practical
matter other writes needing to go out to memory will force buffered
data out eventually. Again, this is my understanding which may be
faulty. My feeling is that this extended write buffering makes it
hard to call the dma memory fully coherent, but other limitations on
ARMv7 make the buffering hard to avoid.

--Mark

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-01  3:41       ` Ming Lei
@ 2011-09-01 15:22         ` Alan Stern
  -1 siblings, 0 replies; 64+ messages in thread
From: Alan Stern @ 2011-09-01 15:22 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, linux-arm-kernel, Mark Salter

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=UTF-8, Size: 2570 bytes --]

On Thu, 1 Sep 2011, Ming Lei wrote:

> I agree all about above, but what I described is from another view.
> I post out the example before explaining my idea further:
> 
> 
> 	CPU			device	
> 	A=1;
> 	wmb
> 	B=2;
> 					read B
> 					read A
> 
> one wmb is used to order 'A=1' and 'B=2', which will make the two write
> operations reach to physical memory as the order: 'A=1' first, 'B=2' second.
> Then the device can observe the two write events as the order above,
> so if device has seen 'B==2', then device will surely see 'A==1'.
> 
> Suppose writing to A is operation to update dma descriptor, the above example
> can make device always see a atomic update of descriptor, can't it?

Suppose A and B are _both_ part of the dma descriptor.  The device 
might see A==1 and B==0, if the memory accesses occur like this:

	CPU		device
	---		------
	A = 1;
	wmb();
			read B
			read A
	B = 2;

When this happens, the device will observe a non-atomic update of the 
descriptor.  There's no way to prevent this.

> My idea is that the memory access patterns are to be considered for
> writer of device driver. For example, many memory access patterns on
> EHCI hardware are described in detail.  Of course, device driver should
> make full use of the background info, below is a example from ehci driver:
> 
> qh_link_async():
> 
> 	/*prepare qh descriptor*/
> 	qh->qh_next = head->qh_next;
> 	qh->hw->hw_next = head->hw->hw_next;
> 	wmb ();
> 
> 	/*link the qh descriptor into hardware queue*/
> 	head->qh_next.qh = qh;
> 	head->hw->hw_next = dma;
> 
> so once EHCI fetches a qh with the address of 'dma', it will always see
> consistent content of qh descriptor, which could not be updated partially.

Yes, of course.  That's what memory barriers are intended for, to make 
sure that writes occur in the correct order.  Without the wmb(), the 
CPU might decide to write out the value of head->hw->hw_next before 
writing out the value of qh->hw->hw_next.  Then the device might see an 
inconsistent set of values.

None of this has anything to do with the write flushes you want to add.

> >> 2, most of such cases can be handled correctly by mb/wmb/rmb barriers.
> >
> > No, they can't.  See the third point above.
> 
> The example above has demoed that barriers can do it, hasn't it?

The memory barrier in your qh_link_async() example can make sure that
the device always sees consistent data.  It doesn't guarantee that the
write to head->hw->hw_next will be flushed to memory in a reasonably 
short time, which is the problem you are trying to solve.

Alan Stern


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01 15:22         ` Alan Stern
  0 siblings, 0 replies; 64+ messages in thread
From: Alan Stern @ 2011-09-01 15:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 1 Sep 2011, Ming Lei wrote:

> I agree all about above, but what I described is from another view.
> I post out the example before explaining my idea further:
> 
> 
> 	CPU			device	
> 	A=1;
> 	wmb
> 	B=2;
> 					read B
> 					read A
> 
> one wmb is used to order 'A=1' and 'B=2', which will make the two write
> operations reach to physical memory as the order: 'A=1' first, 'B=2' second.
> Then the device can observe the two write events as the order above,
> so if device has seen 'B==2', then device will surely see 'A==1'.
> 
> Suppose writing to A is operation to update dma descriptor, the above example
> can make device always see a atomic update of descriptor, can't it?

Suppose A and B are _both_ part of the dma descriptor.  The device 
might see A==1 and B==0, if the memory accesses occur like this:

	CPU		device
	---		------
	A = 1;
	wmb();
			read B
			read A
	B = 2;

When this happens, the device will observe a non-atomic update of the 
descriptor.  There's no way to prevent this.

> My idea is that the memory access patterns are to be considered for
> writer of device driver. For example, many memory access patterns on
> EHCI hardware are described in detail.  Of course, device driver should
> make full use of the background info, below is a example from ehci driver:
> 
> qh_link_async():
> 
> 	/*prepare qh descriptor*/
> 	qh->qh_next = head->qh_next;
> 	qh->hw->hw_next = head->hw->hw_next;
> 	wmb ();
> 
> 	/*link the qh descriptor into hardware queue*/
> 	head->qh_next.qh = qh;
> 	head->hw->hw_next = dma;
> 
> so once EHCI fetches a qh with the address of 'dma', it will always see
> consistent content of qh descriptor, which could not be updated partially.

Yes, of course.  That's what memory barriers are intended for, to make 
sure that writes occur in the correct order.  Without the wmb(), the 
CPU might decide to write out the value of head->hw->hw_next before 
writing out the value of qh->hw->hw_next.  Then the device might see an 
inconsistent set of values.

None of this has anything to do with the write flushes you want to add.

> >> 2, most of such cases can be handled correctly by mb/wmb/rmb barriers.
> >
> > No, they can't. ?See the third point above.
> 
> The example above has demoed that barriers can do it, hasn't it?

The memory barrier in your qh_link_async() example can make sure that
the device always sees consistent data.  It doesn't guarantee that the
write to head->hw->hw_next will be flushed to memory in a reasonably 
short time, which is the problem you are trying to solve.

Alan Stern

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-01  9:14           ` Ming Lei
@ 2011-09-01 15:42             ` Alan Stern
  -1 siblings, 0 replies; 64+ messages in thread
From: Alan Stern @ 2011-09-01 15:42 UTC (permalink / raw)
  To: Ming Lei; +Cc: Will Deacon, linux-kernel, linux-arm-kernel, Mark Salter

On Thu, 1 Sep 2011, Ming Lei wrote:

> First, most of barriers have made this kind of flush not necessary, as
> explained int the example above.

You don't seem to understand the difference between memory barriers and 
write flushes.  Neither makes the other unnecessary -- they do 
different things.

Now, there is good reason to question why write flushes should be 
needed at all.  According to the definition in 
Documentation/DMA-API.txt (the document doesn't distinguish between 
coherent memory and consistent memory):

  Consistent memory is memory for which a write by either the device or
  the processor can immediately be read by the processor or device
  without having to worry about caching effects.  (You may however need
  to make sure to flush the processor's write buffers before telling
  devices to read that memory.)

As far as I can tell, we are talking about a cache flush rather than a 
processor write buffer flush.  If that's so, it would appear that the 
memory in question is not truly coherent.

>  Even for ehci driver, the flush to be
> added is only __one__ special case, so could you list other cases in
> which the flush operation is a must and memory barrier can't do it?

There are other places in ehci-hcd where write flushes would help
improve performance or correctness, although the one in qh_append_tds()  
is probably the most critical.  One such place is in qh_link_async();
others are in qh_link_periodic(), qh_unlink_periodic(), itd_link_urb(),
and sitd_link_urb().

And there are similar places in the other USB host controller drivers.

> If one can understand dma master access order on shared memory
> in the context, it is not difficult to use memory barrier to avoid the
> flush operation.

That simply is not true.  If it were, you would not have needed to 
submit your original patch.

> Second, as I said before, the flush operation is very easy to cause dma
> descriptor updated in partial, as qtd descriptor updated in the ehci case.

No, not if memory barriers are used correctly.

> It is one of the side effects of the flush to be introduced.  Fortunately, EHCI
> can handle this correctly, but can other hardwares always handle this correctly?

If they can't, they are already broken.  Adding write flushes won't 
make them any worse.

Alan Stern


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01 15:42             ` Alan Stern
  0 siblings, 0 replies; 64+ messages in thread
From: Alan Stern @ 2011-09-01 15:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 1 Sep 2011, Ming Lei wrote:

> First, most of barriers have made this kind of flush not necessary, as
> explained int the example above.

You don't seem to understand the difference between memory barriers and 
write flushes.  Neither makes the other unnecessary -- they do 
different things.

Now, there is good reason to question why write flushes should be 
needed at all.  According to the definition in 
Documentation/DMA-API.txt (the document doesn't distinguish between 
coherent memory and consistent memory):

  Consistent memory is memory for which a write by either the device or
  the processor can immediately be read by the processor or device
  without having to worry about caching effects.  (You may however need
  to make sure to flush the processor's write buffers before telling
  devices to read that memory.)

As far as I can tell, we are talking about a cache flush rather than a 
processor write buffer flush.  If that's so, it would appear that the 
memory in question is not truly coherent.

>  Even for ehci driver, the flush to be
> added is only __one__ special case, so could you list other cases in
> which the flush operation is a must and memory barrier can't do it?

There are other places in ehci-hcd where write flushes would help
improve performance or correctness, although the one in qh_append_tds()  
is probably the most critical.  One such place is in qh_link_async();
others are in qh_link_periodic(), qh_unlink_periodic(), itd_link_urb(),
and sitd_link_urb().

And there are similar places in the other USB host controller drivers.

> If one can understand dma master access order on shared memory
> in the context, it is not difficult to use memory barrier to avoid the
> flush operation.

That simply is not true.  If it were, you would not have needed to 
submit your original patch.

> Second, as I said before, the flush operation is very easy to cause dma
> descriptor updated in partial, as qtd descriptor updated in the ehci case.

No, not if memory barriers are used correctly.

> It is one of the side effects of the flush to be introduced.  Fortunately, EHCI
> can handle this correctly, but can other hardwares always handle this correctly?

If they can't, they are already broken.  Adding write flushes won't 
make them any worse.

Alan Stern

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-01 15:22         ` Alan Stern
@ 2011-09-01 15:56           ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2011-09-01 15:56 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-kernel, linux-arm-kernel, Mark Salter

On Thu, Sep 1, 2011 at 11:22 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Thu, 1 Sep 2011, Ming Lei wrote:
>
>> I agree all about above, but what I described is from another view.
>> I post out the example before explaining my idea further:
>>
>>
>>       CPU                     device
>>       A=1;
>>       wmb
>>       B=2;
>>                                       read B
>>                                       read A
>>
>> one wmb is used to order 'A=1' and 'B=2', which will make the two write
>> operations reach to physical memory as the order: 'A=1' first, 'B=2' second.
>> Then the device can observe the two write events as the order above,
>> so if device has seen 'B==2', then device will surely see 'A==1'.
>>
>> Suppose writing to A is operation to update dma descriptor, the above example
>> can make device always see a atomic update of descriptor, can't it?
>
> Suppose A and B are _both_ part of the dma descriptor.  The device
> might see A==1 and B==0, if the memory accesses occur like this:
>
>        CPU             device
>        ---             ------
>        A = 1;
>        wmb();
>                        read B
>                        read A
>        B = 2;
>
> When this happens, the device will observe a non-atomic update of the
> descriptor.  There's no way to prevent this.

If device doesn't find that B is 2, it will not fetch descriptor of A,
and will observe
a atomic update, which is just EHCI does for many cases(such as 4.10.2).

>
>> My idea is that the memory access patterns are to be considered for
>> writer of device driver. For example, many memory access patterns on
>> EHCI hardware are described in detail.  Of course, device driver should
>> make full use of the background info, below is a example from ehci driver:
>>
>> qh_link_async():
>>
>>       /*prepare qh descriptor*/
>>       qh->qh_next = head->qh_next;
>>       qh->hw->hw_next = head->hw->hw_next;
>>       wmb ();
>>
>>       /*link the qh descriptor into hardware queue*/
>>       head->qh_next.qh = qh;
>>       head->hw->hw_next = dma;
>>
>> so once EHCI fetches a qh with the address of 'dma', it will always see
>> consistent content of qh descriptor, which could not be updated partially.
>
> Yes, of course.  That's what memory barriers are intended for, to make
> sure that writes occur in the correct order.  Without the wmb(), the
> CPU might decide to write out the value of head->hw->hw_next before
> writing out the value of qh->hw->hw_next.  Then the device might see an
> inconsistent set of values.
>
> None of this has anything to do with the write flushes you want to add.
>
>> >> 2, most of such cases can be handled correctly by mb/wmb/rmb barriers.
>> >
>> > No, they can't.  See the third point above.
>>
>> The example above has demoed that barriers can do it, hasn't it?
>
> The memory barrier in your qh_link_async() example can make sure that
> the device always sees consistent data.  It doesn't guarantee that the
> write to head->hw->hw_next will be flushed to memory in a reasonably
> short time, which is the problem you are trying to solve.

Yes, up to now, it is the only case in which the flush can address to,
and in which kind of cases device will poll DMA coherent memory contiguously,
I am not sure if there are other devices except for EHCI(maybe have uhci/ohci).
If there are many such kind of devices, the flush operation introduced will
make sense.

As far as I know, for most of devices, dma operation of bus master is triggered
by writing into mmio register instead of writing into coherent memory, and the
flush is not required in this case surely.


thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01 15:56           ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2011-09-01 15:56 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 1, 2011 at 11:22 PM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Thu, 1 Sep 2011, Ming Lei wrote:
>
>> I agree all about above, but what I described is from another view.
>> I post out the example before explaining my idea further:
>>
>>
>> ? ? ? CPU ? ? ? ? ? ? ? ? ? ? device
>> ? ? ? A=1;
>> ? ? ? wmb
>> ? ? ? B=2;
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? read B
>> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? read A
>>
>> one wmb is used to order 'A=1' and 'B=2', which will make the two write
>> operations reach to physical memory as the order: 'A=1' first, 'B=2' second.
>> Then the device can observe the two write events as the order above,
>> so if device has seen 'B==2', then device will surely see 'A==1'.
>>
>> Suppose writing to A is operation to update dma descriptor, the above example
>> can make device always see a atomic update of descriptor, can't it?
>
> Suppose A and B are _both_ part of the dma descriptor. ?The device
> might see A==1 and B==0, if the memory accesses occur like this:
>
> ? ? ? ?CPU ? ? ? ? ? ? device
> ? ? ? ?--- ? ? ? ? ? ? ------
> ? ? ? ?A = 1;
> ? ? ? ?wmb();
> ? ? ? ? ? ? ? ? ? ? ? ?read B
> ? ? ? ? ? ? ? ? ? ? ? ?read A
> ? ? ? ?B = 2;
>
> When this happens, the device will observe a non-atomic update of the
> descriptor. ?There's no way to prevent this.

If device doesn't find that B is 2, it will not fetch descriptor of A,
and will observe
a atomic update, which is just EHCI does for many cases(such as 4.10.2).

>
>> My idea is that the memory access patterns are to be considered for
>> writer of device driver. For example, many memory access patterns on
>> EHCI hardware are described in detail. ?Of course, device driver should
>> make full use of the background info, below is a example from ehci driver:
>>
>> qh_link_async():
>>
>> ? ? ? /*prepare qh descriptor*/
>> ? ? ? qh->qh_next = head->qh_next;
>> ? ? ? qh->hw->hw_next = head->hw->hw_next;
>> ? ? ? wmb ();
>>
>> ? ? ? /*link the qh descriptor into hardware queue*/
>> ? ? ? head->qh_next.qh = qh;
>> ? ? ? head->hw->hw_next = dma;
>>
>> so once EHCI fetches a qh with the address of 'dma', it will always see
>> consistent content of qh descriptor, which could not be updated partially.
>
> Yes, of course. ?That's what memory barriers are intended for, to make
> sure that writes occur in the correct order. ?Without the wmb(), the
> CPU might decide to write out the value of head->hw->hw_next before
> writing out the value of qh->hw->hw_next. ?Then the device might see an
> inconsistent set of values.
>
> None of this has anything to do with the write flushes you want to add.
>
>> >> 2, most of such cases can be handled correctly by mb/wmb/rmb barriers.
>> >
>> > No, they can't. ?See the third point above.
>>
>> The example above has demoed that barriers can do it, hasn't it?
>
> The memory barrier in your qh_link_async() example can make sure that
> the device always sees consistent data. ?It doesn't guarantee that the
> write to head->hw->hw_next will be flushed to memory in a reasonably
> short time, which is the problem you are trying to solve.

Yes, up to now, it is the only case in which the flush can address to,
and in which kind of cases device will poll DMA coherent memory contiguously,
I am not sure if there are other devices except for EHCI(maybe have uhci/ohci).
If there are many such kind of devices, the flush operation introduced will
make sense.

As far as I know, for most of devices, dma operation of bus master is triggered
by writing into mmio register instead of writing into coherent memory, and the
flush is not required in this case surely.


thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-01 15:42             ` Alan Stern
@ 2011-09-01 16:04               ` Russell King - ARM Linux
  -1 siblings, 0 replies; 64+ messages in thread
From: Russell King - ARM Linux @ 2011-09-01 16:04 UTC (permalink / raw)
  To: Alan Stern
  Cc: Ming Lei, Will Deacon, linux-kernel, linux-arm-kernel, Mark Salter

On Thu, Sep 01, 2011 at 11:42:23AM -0400, Alan Stern wrote:
> On Thu, 1 Sep 2011, Ming Lei wrote:
> 
> > First, most of barriers have made this kind of flush not necessary, as
> > explained int the example above.
> 
> You don't seem to understand the difference between memory barriers and 
> write flushes.  Neither makes the other unnecessary -- they do 
> different things.
> 
> Now, there is good reason to question why write flushes should be 
> needed at all.  According to the definition in 
> Documentation/DMA-API.txt (the document doesn't distinguish between 
> coherent memory and consistent memory):
> 
>   Consistent memory is memory for which a write by either the device or
>   the processor can immediately be read by the processor or device
>   without having to worry about caching effects.  (You may however need
>   to make sure to flush the processor's write buffers before telling
>   devices to read that memory.)
> 
> As far as I can tell, we are talking about a cache flush rather than a 
> processor write buffer flush.  If that's so, it would appear that the 
> memory in question is not truly coherent.

DMA coherent memory on ARM is implemented on ARMv5 and below by using
'noncacheable nonbufferable' memory.  There is no weak memory model to
worry about, and this memory type is seen as 'strongly ordered' - the
CPU stalls until the read or write has completed.  So no problem there.

On ARMv6 and above, the attributes change:

1. Memory type: [Normal, Device, Strongly ordered]
   All mappings of a physical address space are absolutely required to be
   of the same memory type, otherwise the result is unpredictable.  There
   is no mitigation against this.

2. For "normal memory", a variety of options are available to adjust the
   hints to the cache and memory subsystem - the options here are
   [Non-cacheable, write-back write alloc, write-through non-write alloc,
    write-back, non-write alloc.]

   Strictly to the ARM ARM, all mappings must, again, have the same
   attributes to avoid unpredictable behaviour.  There is a _temporary_
   architectural relaxation of this requirement provided certain conditions
   are met - which may become permanent.

We remap system memory (which has its standard direct-mapped kernel mapping
as 'normal memory, write-back' for DMA coherent memory into a separate
region marking it 'normal memory, non-cacheable'.  Strictly this violates
the architecture - but we have no other way at present to obtain DMA
coherent memory as we can't unmap the standard direct-mapped kernel mapping
(we'd have to touch _every_ page table in the system, and then issue TLB
flushes which may have to be smp_call_function'd, which you can't do from
IRQ context - one of the contexts which dma_alloc_coherent must work from.)

So far, no one has reported any ill effects - and there's been much pressure
from lots of people to ignore the architecture reference manual over this,
including from the CMA guys.

It _is_ possible that "unpredictable" means that we may hit cache lines in
the [VP]IPT cache via the non-cacheable mapping which have been created
by speculative loads via the cacheable mapping - and this is something
that has been worrying me for a long time.

I've tried several ways to fix this but the result has been regressions.
So far, I have no fix for this which will not cause a regression, which
will satisfy the ARM ARM, which will satisfy peoples expectations, etc.
There is a plan with CMA to try and do something about this, but that's
a long way off yet.

So, in summary what I'm saying is that _in theory_ our DMA coherent memory
on ARMv6+ should have nothing more than write buffering to contend with,
but that doesn't stop this being the first real concrete report proving
that what I've been going on about regarding the architectural requirements
over the last few years is actually very real and valid.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01 16:04               ` Russell King - ARM Linux
  0 siblings, 0 replies; 64+ messages in thread
From: Russell King - ARM Linux @ 2011-09-01 16:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 01, 2011 at 11:42:23AM -0400, Alan Stern wrote:
> On Thu, 1 Sep 2011, Ming Lei wrote:
> 
> > First, most of barriers have made this kind of flush not necessary, as
> > explained int the example above.
> 
> You don't seem to understand the difference between memory barriers and 
> write flushes.  Neither makes the other unnecessary -- they do 
> different things.
> 
> Now, there is good reason to question why write flushes should be 
> needed at all.  According to the definition in 
> Documentation/DMA-API.txt (the document doesn't distinguish between 
> coherent memory and consistent memory):
> 
>   Consistent memory is memory for which a write by either the device or
>   the processor can immediately be read by the processor or device
>   without having to worry about caching effects.  (You may however need
>   to make sure to flush the processor's write buffers before telling
>   devices to read that memory.)
> 
> As far as I can tell, we are talking about a cache flush rather than a 
> processor write buffer flush.  If that's so, it would appear that the 
> memory in question is not truly coherent.

DMA coherent memory on ARM is implemented on ARMv5 and below by using
'noncacheable nonbufferable' memory.  There is no weak memory model to
worry about, and this memory type is seen as 'strongly ordered' - the
CPU stalls until the read or write has completed.  So no problem there.

On ARMv6 and above, the attributes change:

1. Memory type: [Normal, Device, Strongly ordered]
   All mappings of a physical address space are absolutely required to be
   of the same memory type, otherwise the result is unpredictable.  There
   is no mitigation against this.

2. For "normal memory", a variety of options are available to adjust the
   hints to the cache and memory subsystem - the options here are
   [Non-cacheable, write-back write alloc, write-through non-write alloc,
    write-back, non-write alloc.]

   Strictly to the ARM ARM, all mappings must, again, have the same
   attributes to avoid unpredictable behaviour.  There is a _temporary_
   architectural relaxation of this requirement provided certain conditions
   are met - which may become permanent.

We remap system memory (which has its standard direct-mapped kernel mapping
as 'normal memory, write-back' for DMA coherent memory into a separate
region marking it 'normal memory, non-cacheable'.  Strictly this violates
the architecture - but we have no other way at present to obtain DMA
coherent memory as we can't unmap the standard direct-mapped kernel mapping
(we'd have to touch _every_ page table in the system, and then issue TLB
flushes which may have to be smp_call_function'd, which you can't do from
IRQ context - one of the contexts which dma_alloc_coherent must work from.)

So far, no one has reported any ill effects - and there's been much pressure
from lots of people to ignore the architecture reference manual over this,
including from the CMA guys.

It _is_ possible that "unpredictable" means that we may hit cache lines in
the [VP]IPT cache via the non-cacheable mapping which have been created
by speculative loads via the cacheable mapping - and this is something
that has been worrying me for a long time.

I've tried several ways to fix this but the result has been regressions.
So far, I have no fix for this which will not cause a regression, which
will satisfy the ARM ARM, which will satisfy peoples expectations, etc.
There is a plan with CMA to try and do something about this, but that's
a long way off yet.

So, in summary what I'm saying is that _in theory_ our DMA coherent memory
on ARMv6+ should have nothing more than write buffering to contend with,
but that doesn't stop this being the first real concrete report proving
that what I've been going on about regarding the architectural requirements
over the last few years is actually very real and valid.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-01 15:56           ` Ming Lei
@ 2011-09-01 16:48             ` Alan Stern
  -1 siblings, 0 replies; 64+ messages in thread
From: Alan Stern @ 2011-09-01 16:48 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, linux-arm-kernel, Mark Salter

[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #1: Type: TEXT/PLAIN; charset=UTF-8, Size: 1767 bytes --]

On Thu, 1 Sep 2011, Ming Lei wrote:

> > Suppose A and B are _both_ part of the dma descriptor.  The device
> > might see A==1 and B==0, if the memory accesses occur like this:
> >
> >        CPU             device
> >        ---             ------
> >        A = 1;
> >        wmb();
> >                        read B
> >                        read A
> >        B = 2;
> >
> > When this happens, the device will observe a non-atomic update of the
> > descriptor.  There's no way to prevent this.
> 
> If device doesn't find that B is 2, it will not fetch descriptor of A,
> and will observe
> a atomic update, which is just EHCI does for many cases(such as 4.10.2).

You didn't read what I wrote above.  Suppose A and B are _both_ part of 
the same descriptor, like hw_token and hw_qtd_next.


> > The memory barrier in your qh_link_async() example can make sure that
> > the device always sees consistent data.  It doesn't guarantee that the
> > write to head->hw->hw_next will be flushed to memory in a reasonably
> > short time, which is the problem you are trying to solve.
> 
> Yes, up to now, it is the only case in which the flush can address to,
> and in which kind of cases device will poll DMA coherent memory contiguously,
> I am not sure if there are other devices except for EHCI(maybe have uhci/ohci).

Yes: UHCI, OHCI, EHCI, and XHCI all poll memory constantly.

> If there are many such kind of devices, the flush operation introduced will
> make sense.
> 
> As far as I know, for most of devices, dma operation of bus master is triggered
> by writing into mmio register instead of writing into coherent memory, and the
> flush is not required in this case surely.

That's right.  It is needed only when the device polls automatically.

Alan Stern


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01 16:48             ` Alan Stern
  0 siblings, 0 replies; 64+ messages in thread
From: Alan Stern @ 2011-09-01 16:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 1 Sep 2011, Ming Lei wrote:

> > Suppose A and B are _both_ part of the dma descriptor. ?The device
> > might see A==1 and B==0, if the memory accesses occur like this:
> >
> > ? ? ? ?CPU ? ? ? ? ? ? device
> > ? ? ? ?--- ? ? ? ? ? ? ------
> > ? ? ? ?A = 1;
> > ? ? ? ?wmb();
> > ? ? ? ? ? ? ? ? ? ? ? ?read B
> > ? ? ? ? ? ? ? ? ? ? ? ?read A
> > ? ? ? ?B = 2;
> >
> > When this happens, the device will observe a non-atomic update of the
> > descriptor. ?There's no way to prevent this.
> 
> If device doesn't find that B is 2, it will not fetch descriptor of A,
> and will observe
> a atomic update, which is just EHCI does for many cases(such as 4.10.2).

You didn't read what I wrote above.  Suppose A and B are _both_ part of 
the same descriptor, like hw_token and hw_qtd_next.


> > The memory barrier in your qh_link_async() example can make sure that
> > the device always sees consistent data. ?It doesn't guarantee that the
> > write to head->hw->hw_next will be flushed to memory in a reasonably
> > short time, which is the problem you are trying to solve.
> 
> Yes, up to now, it is the only case in which the flush can address to,
> and in which kind of cases device will poll DMA coherent memory contiguously,
> I am not sure if there are other devices except for EHCI(maybe have uhci/ohci).

Yes: UHCI, OHCI, EHCI, and XHCI all poll memory constantly.

> If there are many such kind of devices, the flush operation introduced will
> make sense.
> 
> As far as I know, for most of devices, dma operation of bus master is triggered
> by writing into mmio register instead of writing into coherent memory, and the
> flush is not required in this case surely.

That's right.  It is needed only when the device polls automatically.

Alan Stern

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-01 16:04               ` Russell King - ARM Linux
@ 2011-09-01 17:31                 ` Will Deacon
  -1 siblings, 0 replies; 64+ messages in thread
From: Will Deacon @ 2011-09-01 17:31 UTC (permalink / raw)
  To: Russell King - ARM Linux
  Cc: Alan Stern, Ming Lei, linux-kernel, linux-arm-kernel, Mark Salter

Hi Russell,

On Thu, Sep 01, 2011 at 05:04:29PM +0100, Russell King - ARM Linux wrote:
> DMA coherent memory on ARM is implemented on ARMv5 and below by using
> 'noncacheable nonbufferable' memory.  There is no weak memory model to
> worry about, and this memory type is seen as 'strongly ordered' - the
> CPU stalls until the read or write has completed.  So no problem there.
> 
> On ARMv6 and above, the attributes change:
> 
> 1. Memory type: [Normal, Device, Strongly ordered]
>    All mappings of a physical address space are absolutely required to be
>    of the same memory type, otherwise the result is unpredictable.  There
>    is no mitigation against this.
> 
> 2. For "normal memory", a variety of options are available to adjust the
>    hints to the cache and memory subsystem - the options here are
>    [Non-cacheable, write-back write alloc, write-through non-write alloc,
>     write-back, non-write alloc.]
> 
>    Strictly to the ARM ARM, all mappings must, again, have the same
>    attributes to avoid unpredictable behaviour.  There is a _temporary_
>    architectural relaxation of this requirement provided certain conditions
>    are met - which may become permanent.

This looks set to appear in revision C of the ARM ARM.

> It _is_ possible that "unpredictable" means that we may hit cache lines in
> the [VP]IPT cache via the non-cacheable mapping which have been created
> by speculative loads via the cacheable mapping - and this is something
> that has been worrying me for a long time.

Whilst this can happen, this will only cause problems for reads performed
by the CPU (as these may hit a line speculatively loaded via the cacheable
alias). Setting bit 22 in the auxillary control register gets arounds this:

http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1

Given that I believe our coherent DMA memory is `cacheable, bufferable, do
not allocate' in terms of AXI attributes, then writes will go straight to
the write buffer on the PL310.

> So, in summary what I'm saying is that _in theory_ our DMA coherent memory
> on ARMv6+ should have nothing more than write buffering to contend with,
> but that doesn't stop this being the first real concrete report proving
> that what I've been going on about regarding the architectural requirements
> over the last few years is actually very real and valid.

I don't think what we're seeing in this case is caused by mismatched memory
attributes, especially as passing `nosmp' on the command-line makes the
performance issue disappear.

Will

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01 17:31                 ` Will Deacon
  0 siblings, 0 replies; 64+ messages in thread
From: Will Deacon @ 2011-09-01 17:31 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Russell,

On Thu, Sep 01, 2011 at 05:04:29PM +0100, Russell King - ARM Linux wrote:
> DMA coherent memory on ARM is implemented on ARMv5 and below by using
> 'noncacheable nonbufferable' memory.  There is no weak memory model to
> worry about, and this memory type is seen as 'strongly ordered' - the
> CPU stalls until the read or write has completed.  So no problem there.
> 
> On ARMv6 and above, the attributes change:
> 
> 1. Memory type: [Normal, Device, Strongly ordered]
>    All mappings of a physical address space are absolutely required to be
>    of the same memory type, otherwise the result is unpredictable.  There
>    is no mitigation against this.
> 
> 2. For "normal memory", a variety of options are available to adjust the
>    hints to the cache and memory subsystem - the options here are
>    [Non-cacheable, write-back write alloc, write-through non-write alloc,
>     write-back, non-write alloc.]
> 
>    Strictly to the ARM ARM, all mappings must, again, have the same
>    attributes to avoid unpredictable behaviour.  There is a _temporary_
>    architectural relaxation of this requirement provided certain conditions
>    are met - which may become permanent.

This looks set to appear in revision C of the ARM ARM.

> It _is_ possible that "unpredictable" means that we may hit cache lines in
> the [VP]IPT cache via the non-cacheable mapping which have been created
> by speculative loads via the cacheable mapping - and this is something
> that has been worrying me for a long time.

Whilst this can happen, this will only cause problems for reads performed
by the CPU (as these may hit a line speculatively loaded via the cacheable
alias). Setting bit 22 in the auxillary control register gets arounds this:

http://www.arm.linux.org.uk/developer/patches/viewpatch.php?id=6529/1

Given that I believe our coherent DMA memory is `cacheable, bufferable, do
not allocate' in terms of AXI attributes, then writes will go straight to
the write buffer on the PL310.

> So, in summary what I'm saying is that _in theory_ our DMA coherent memory
> on ARMv6+ should have nothing more than write buffering to contend with,
> but that doesn't stop this being the first real concrete report proving
> that what I've been going on about regarding the architectural requirements
> over the last few years is actually very real and valid.

I don't think what we're seeing in this case is caused by mismatched memory
attributes, especially as passing `nosmp' on the command-line makes the
performance issue disappear.

Will

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-01 17:31                 ` Will Deacon
@ 2011-09-01 18:07                   ` Russell King - ARM Linux
  -1 siblings, 0 replies; 64+ messages in thread
From: Russell King - ARM Linux @ 2011-09-01 18:07 UTC (permalink / raw)
  To: Will Deacon
  Cc: Alan Stern, Ming Lei, linux-kernel, linux-arm-kernel, Mark Salter

On Thu, Sep 01, 2011 at 06:31:49PM +0100, Will Deacon wrote:
> Given that I believe our coherent DMA memory is `cacheable, bufferable, do
> not allocate' in terms of AXI attributes, then writes will go straight to
> the write buffer on the PL310.

It's TEXCB=001, which due to the PRRR and NMRR registers is memory type
10, inner policy 00 outer policy 00, which translated means (as I said)
"Normal memory", "non-cacheable" at both the inner and outer levels.

How that gets translated to AXI attributes is outside the scope of the
ARM ARM, it's probably in some mega secret AXI document somewhere which
I don't have access to.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01 18:07                   ` Russell King - ARM Linux
  0 siblings, 0 replies; 64+ messages in thread
From: Russell King - ARM Linux @ 2011-09-01 18:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 01, 2011 at 06:31:49PM +0100, Will Deacon wrote:
> Given that I believe our coherent DMA memory is `cacheable, bufferable, do
> not allocate' in terms of AXI attributes, then writes will go straight to
> the write buffer on the PL310.

It's TEXCB=001, which due to the PRRR and NMRR registers is memory type
10, inner policy 00 outer policy 00, which translated means (as I said)
"Normal memory", "non-cacheable" at both the inner and outer levels.

How that gets translated to AXI attributes is outside the scope of the
ARM ARM, it's probably in some mega secret AXI document somewhere which
I don't have access to.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-01 17:31                 ` Will Deacon
@ 2011-09-01 19:14                   ` Mark Salter
  -1 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-09-01 19:14 UTC (permalink / raw)
  To: Will Deacon
  Cc: Russell King - ARM Linux, Alan Stern, Ming Lei, linux-kernel,
	linux-arm-kernel

On Thu, 2011-09-01 at 18:31 +0100, Will Deacon wrote:
> I don't think what we're seeing in this case is caused by mismatched memory
> attributes, especially as passing `nosmp' on the command-line makes the
> performance issue disappear.

I'm coming to think we are dealing with two different problems.

We have the original problem where adding the write buffer flush
to EHCI gives a 4x performance boost to USB. Also adding nosmp to
the cmdline gives pretty much the same boost. This is looking like
something other than just data getting held up in a write buffer.

On the other hand, on a nosmp kernel, I get about 3-4% performance
boost for hdparm -t using the write buffer flush patch vs. without
it.

So, regardless of what turns out to be the actual cause of the 4x
problem, it may still be worthwhile to have the explicit write
buffer sync API if we can't avoid using buffered mappings for DMA.

--Mark



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-01 19:14                   ` Mark Salter
  0 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-09-01 19:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2011-09-01 at 18:31 +0100, Will Deacon wrote:
> I don't think what we're seeing in this case is caused by mismatched memory
> attributes, especially as passing `nosmp' on the command-line makes the
> performance issue disappear.

I'm coming to think we are dealing with two different problems.

We have the original problem where adding the write buffer flush
to EHCI gives a 4x performance boost to USB. Also adding nosmp to
the cmdline gives pretty much the same boost. This is looking like
something other than just data getting held up in a write buffer.

On the other hand, on a nosmp kernel, I get about 3-4% performance
boost for hdparm -t using the write buffer flush patch vs. without
it.

So, regardless of what turns out to be the actual cause of the 4x
problem, it may still be worthwhile to have the explicit write
buffer sync API if we can't avoid using buffered mappings for DMA.

--Mark

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-01 16:48             ` Alan Stern
@ 2011-09-02  0:59               ` Ming Lei
  -1 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2011-09-02  0:59 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-kernel, linux-arm-kernel, Mark Salter

Hi,

On Fri, Sep 2, 2011 at 12:48 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Thu, 1 Sep 2011, Ming Lei wrote:
>
>> > Suppose A and B are _both_ part of the dma descriptor.  The device
>> > might see A==1 and B==0, if the memory accesses occur like this:
>> >
>> >        CPU             device
>> >        ---             ------
>> >        A = 1;
>> >        wmb();
>> >                        read B
>> >                        read A
>> >        B = 2;
>> >
>> > When this happens, the device will observe a non-atomic update of the
>> > descriptor.  There's no way to prevent this.
>>
>> If device doesn't find that B is 2, it will not fetch descriptor of A,
>> and will observe
>> a atomic update, which is just EHCI does for many cases(such as 4.10.2).
>
> You didn't read what I wrote above.  Suppose A and B are _both_ part of
> the same descriptor, like hw_token and hw_qtd_next.

I think the case(keep writing order between parts in a same dma descriptor)
is only in constant dma-poll master case, just like ehci/uhci.

General case is that memory barrier is required before linking one dma
descriptor into hardware queue but after the dma descriptor is prepared.

>
>> > The memory barrier in your qh_link_async() example can make sure that
>> > the device always sees consistent data.  It doesn't guarantee that the
>> > write to head->hw->hw_next will be flushed to memory in a reasonably
>> > short time, which is the problem you are trying to solve.
>>
>> Yes, up to now, it is the only case in which the flush can address to,
>> and in which kind of cases device will poll DMA coherent memory contiguously,
>> I am not sure if there are other devices except for EHCI(maybe have uhci/ohci).
>
> Yes: UHCI, OHCI, EHCI, and XHCI all poll memory constantly.

In fact, the flush may be not required for ohci and xhci case, since there is
already one mmio register writing at the end of .enqueue path in ohci/xhci
driver.(just a glance at the code of ohci/xhci, please correct if I am wrong)

For UHCI, looks like it has not been used on ARM, so maybe can ignore it.
UHCI is to support a slow usb 1.1 transfer, so I am wondering if the flush
can produce a obvious performance boost.

So looks like the flush only makes sense on EHCI.

If the above is not wrong, is it really needed to introduce a general DMA API
only for EHCI?


thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-02  0:59               ` Ming Lei
  0 siblings, 0 replies; 64+ messages in thread
From: Ming Lei @ 2011-09-02  0:59 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Fri, Sep 2, 2011 at 12:48 AM, Alan Stern <stern@rowland.harvard.edu> wrote:
> On Thu, 1 Sep 2011, Ming Lei wrote:
>
>> > Suppose A and B are _both_ part of the dma descriptor. ?The device
>> > might see A==1 and B==0, if the memory accesses occur like this:
>> >
>> > ? ? ? ?CPU ? ? ? ? ? ? device
>> > ? ? ? ?--- ? ? ? ? ? ? ------
>> > ? ? ? ?A = 1;
>> > ? ? ? ?wmb();
>> > ? ? ? ? ? ? ? ? ? ? ? ?read B
>> > ? ? ? ? ? ? ? ? ? ? ? ?read A
>> > ? ? ? ?B = 2;
>> >
>> > When this happens, the device will observe a non-atomic update of the
>> > descriptor. ?There's no way to prevent this.
>>
>> If device doesn't find that B is 2, it will not fetch descriptor of A,
>> and will observe
>> a atomic update, which is just EHCI does for many cases(such as 4.10.2).
>
> You didn't read what I wrote above. ?Suppose A and B are _both_ part of
> the same descriptor, like hw_token and hw_qtd_next.

I think the case(keep writing order between parts in a same dma descriptor)
is only in constant dma-poll master case, just like ehci/uhci.

General case is that memory barrier is required before linking one dma
descriptor into hardware queue but after the dma descriptor is prepared.

>
>> > The memory barrier in your qh_link_async() example can make sure that
>> > the device always sees consistent data. ?It doesn't guarantee that the
>> > write to head->hw->hw_next will be flushed to memory in a reasonably
>> > short time, which is the problem you are trying to solve.
>>
>> Yes, up to now, it is the only case in which the flush can address to,
>> and in which kind of cases device will poll DMA coherent memory contiguously,
>> I am not sure if there are other devices except for EHCI(maybe have uhci/ohci).
>
> Yes: UHCI, OHCI, EHCI, and XHCI all poll memory constantly.

In fact, the flush may be not required for ohci and xhci case, since there is
already one mmio register writing at the end of .enqueue path in ohci/xhci
driver.(just a glance at the code of ohci/xhci, please correct if I am wrong)

For UHCI, looks like it has not been used on ARM, so maybe can ignore it.
UHCI is to support a slow usb 1.1 transfer, so I am wondering if the flush
can produce a obvious performance boost.

So looks like the flush only makes sense on EHCI.

If the above is not wrong, is it really needed to introduce a general DMA API
only for EHCI?


thanks,
--
Ming Lei

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 0/3] RFC: addition to DMA API
  2011-09-02  0:59               ` Ming Lei
@ 2011-09-02 13:53                 ` Alan Stern
  -1 siblings, 0 replies; 64+ messages in thread
From: Alan Stern @ 2011-09-02 13:53 UTC (permalink / raw)
  To: Ming Lei; +Cc: linux-kernel, linux-arm-kernel, Mark Salter

On Fri, 2 Sep 2011, Ming Lei wrote:

> >> I am not sure if there are other devices except for EHCI(maybe have uhci/ohci).
> >
> > Yes: UHCI, OHCI, EHCI, and XHCI all poll memory constantly.
> 
> In fact, the flush may be not required for ohci and xhci case, since there is
> already one mmio register writing at the end of .enqueue path in ohci/xhci
> driver.(just a glance at the code of ohci/xhci, please correct if I am wrong)

I don't know about xhci, but you're right about ohci.  However, there's 
no guarantee that the mmio write will always remain there; somebody 
might change the code so that the write takes place only when it is 
needed instead of every time.

> For UHCI, looks like it has not been used on ARM, so maybe can ignore it.
> UHCI is to support a slow usb 1.1 transfer, so I am wondering if the flush
> can produce a obvious performance boost.

Believe me, even with USB-1.1 a 20-ms delay will be noticeable.

> So looks like the flush only makes sense on EHCI.

Assuming ARM is the only architecture that needs it.

> If the above is not wrong, is it really needed to introduce a general DMA API
> only for EHCI?

Maybe not.  I'm hoping that people will identify the underlying cause
for these delayed write-backs and fix it.  Then no changes at all would
be needed in the USB stack.

Alan Stern


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 0/3] RFC: addition to DMA API
@ 2011-09-02 13:53                 ` Alan Stern
  0 siblings, 0 replies; 64+ messages in thread
From: Alan Stern @ 2011-09-02 13:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 2 Sep 2011, Ming Lei wrote:

> >> I am not sure if there are other devices except for EHCI(maybe have uhci/ohci).
> >
> > Yes: UHCI, OHCI, EHCI, and XHCI all poll memory constantly.
> 
> In fact, the flush may be not required for ohci and xhci case, since there is
> already one mmio register writing at the end of .enqueue path in ohci/xhci
> driver.(just a glance at the code of ohci/xhci, please correct if I am wrong)

I don't know about xhci, but you're right about ohci.  However, there's 
no guarantee that the mmio write will always remain there; somebody 
might change the code so that the write takes place only when it is 
needed instead of every time.

> For UHCI, looks like it has not been used on ARM, so maybe can ignore it.
> UHCI is to support a slow usb 1.1 transfer, so I am wondering if the flush
> can produce a obvious performance boost.

Believe me, even with USB-1.1 a 20-ms delay will be noticeable.

> So looks like the flush only makes sense on EHCI.

Assuming ARM is the only architecture that needs it.

> If the above is not wrong, is it really needed to introduce a general DMA API
> only for EHCI?

Maybe not.  I'm hoping that people will identify the underlying cause
for these delayed write-backs and fix it.  Then no changes at all would
be needed in the USB stack.

Alan Stern

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 1/3] add dma_coherent_write_sync to DMA API
  2011-09-01 12:36       ` Mark Salter
@ 2011-09-06 14:30         ` Catalin Marinas
  -1 siblings, 0 replies; 64+ messages in thread
From: Catalin Marinas @ 2011-09-06 14:30 UTC (permalink / raw)
  To: Mark Salter
  Cc: Michał Mirosław, ming.lei, stern, linux-kernel,
	linux-arm-kernel

(coming late to this thread due to holidays)

2011/9/1 Mark Salter <msalter@redhat.com>:
> On Thu, 2011-09-01 at 11:57 +0200, Michał Mirosław wrote:
>> BTW, if there's no time limit on write buffers flushing, or if write
>> buffers can cause reordering of the writes, then the memory accesses
>> need to be managed just like non-DMA-coherent memory. So what differs
>> then in DMA-coherent vs non-DMA-coherent mappings then?
>
> My understanding is that ordering is preserved, but an ARM guy should
> probably verify that.

On ARMv6 onwards the coherent DMA is Normal Non-cacheable memory and
this is buffered and can be reordered.

-- 
Catalin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 1/3] add dma_coherent_write_sync to DMA API
@ 2011-09-06 14:30         ` Catalin Marinas
  0 siblings, 0 replies; 64+ messages in thread
From: Catalin Marinas @ 2011-09-06 14:30 UTC (permalink / raw)
  To: linux-arm-kernel

(coming late to this thread due to holidays)

2011/9/1 Mark Salter <msalter@redhat.com>:
> On Thu, 2011-09-01 at 11:57 +0200, Micha? Miros?aw wrote:
>> BTW, if there's no time limit on write buffers flushing, or if write
>> buffers can cause reordering of the writes, then the memory accesses
>> need to be managed just like non-DMA-coherent memory. So what differs
>> then in DMA-coherent vs non-DMA-coherent mappings then?
>
> My understanding is that ordering is preserved, but an ARM guy should
> probably verify that.

On ARMv6 onwards the coherent DMA is Normal Non-cacheable memory and
this is buffered and can be reordered.

-- 
Catalin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] define ARM-specific dma_coherent_write_sync
  2011-08-31 21:30   ` Mark Salter
@ 2011-09-06 14:32     ` Catalin Marinas
  -1 siblings, 0 replies; 64+ messages in thread
From: Catalin Marinas @ 2011-09-06 14:32 UTC (permalink / raw)
  To: Mark Salter; +Cc: linux-kernel, linux-arm-kernel, ming.lei, stern

On 31 August 2011 22:30, Mark Salter <msalter@redhat.com> wrote:
> For ARM kernels using CONFIG_ARM_DMA_MEM_BUFFERABLE, this patch adds an ARM
> specific dma_coherent_write_sync() to override the default version. This
> routine forces out any data sitting in a write buffer between the CPU and
> memory.
>
> Signed-off-by: Mark Salter <msalter@redhat.com>
> ---
>  arch/arm/include/asm/dma-mapping.h |   10 ++++++++++
>  1 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h
> index 7a21d0b..e99562b 100644
> --- a/arch/arm/include/asm/dma-mapping.h
> +++ b/arch/arm/include/asm/dma-mapping.h
> @@ -206,6 +206,16 @@ int dma_mmap_writecombine(struct device *, struct vm_area_struct *,
>                void *, dma_addr_t, size_t);
>
>
> +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
> +#define ARCH_HAS_DMA_COHERENT_WRITE_SYNC
> +
> +static inline void dma_coherent_write_sync(void)
> +{
> +       dsb();
> +       outer_sync();
> +}

That's what mb() and wmb() do already, at least on ARM. Why do we need
another API? IIRC from past discussions on linux-arch around barriers,
the mb() should be sufficient in the case of DMA coherent buffers.
That's why macros like writel() on ARM have the mb() added by default
(for cases where you start the DMA transfer by writing to a device
register).

-- 
Catalin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 2/3] define ARM-specific dma_coherent_write_sync
@ 2011-09-06 14:32     ` Catalin Marinas
  0 siblings, 0 replies; 64+ messages in thread
From: Catalin Marinas @ 2011-09-06 14:32 UTC (permalink / raw)
  To: linux-arm-kernel

On 31 August 2011 22:30, Mark Salter <msalter@redhat.com> wrote:
> For ARM kernels using CONFIG_ARM_DMA_MEM_BUFFERABLE, this patch adds an ARM
> specific dma_coherent_write_sync() to override the default version. This
> routine forces out any data sitting in a write buffer between the CPU and
> memory.
>
> Signed-off-by: Mark Salter <msalter@redhat.com>
> ---
> ?arch/arm/include/asm/dma-mapping.h | ? 10 ++++++++++
> ?1 files changed, 10 insertions(+), 0 deletions(-)
>
> diff --git a/arch/arm/include/asm/dma-mapping.h b/arch/arm/include/asm/dma-mapping.h
> index 7a21d0b..e99562b 100644
> --- a/arch/arm/include/asm/dma-mapping.h
> +++ b/arch/arm/include/asm/dma-mapping.h
> @@ -206,6 +206,16 @@ int dma_mmap_writecombine(struct device *, struct vm_area_struct *,
> ? ? ? ? ? ? ? ?void *, dma_addr_t, size_t);
>
>
> +#ifdef CONFIG_ARM_DMA_MEM_BUFFERABLE
> +#define ARCH_HAS_DMA_COHERENT_WRITE_SYNC
> +
> +static inline void dma_coherent_write_sync(void)
> +{
> + ? ? ? dsb();
> + ? ? ? outer_sync();
> +}

That's what mb() and wmb() do already, at least on ARM. Why do we need
another API? IIRC from past discussions on linux-arch around barriers,
the mb() should be sufficient in the case of DMA coherent buffers.
That's why macros like writel() on ARM have the mb() added by default
(for cases where you start the DMA transfer by writing to a device
register).

-- 
Catalin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] define ARM-specific dma_coherent_write_sync
  2011-09-06 14:32     ` Catalin Marinas
@ 2011-09-06 14:37       ` Mark Salter
  -1 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-09-06 14:37 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: linux-kernel, linux-arm-kernel, ming.lei, stern

On Tue, 2011-09-06 at 15:32 +0100, Catalin Marinas wrote:
> That's what mb() and wmb() do already, at least on ARM. Why do we need
> another API? IIRC from past discussions on linux-arch around barriers,
> the mb() should be sufficient in the case of DMA coherent buffers.
> That's why macros like writel() on ARM have the mb() added by default
> (for cases where you start the DMA transfer by writing to a device
> register). 

For USB EHCI, the driver does not necessarily write to a register after
writing to DMA coherent memory. In some cases, the controller polls for
information written by the driver.

--Mark



^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 2/3] define ARM-specific dma_coherent_write_sync
@ 2011-09-06 14:37       ` Mark Salter
  0 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-09-06 14:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2011-09-06 at 15:32 +0100, Catalin Marinas wrote:
> That's what mb() and wmb() do already, at least on ARM. Why do we need
> another API? IIRC from past discussions on linux-arch around barriers,
> the mb() should be sufficient in the case of DMA coherent buffers.
> That's why macros like writel() on ARM have the mb() added by default
> (for cases where you start the DMA transfer by writing to a device
> register). 

For USB EHCI, the driver does not necessarily write to a register after
writing to DMA coherent memory. In some cases, the controller polls for
information written by the driver.

--Mark

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] define ARM-specific dma_coherent_write_sync
  2011-09-06 14:37       ` Mark Salter
@ 2011-09-06 14:48         ` Catalin Marinas
  -1 siblings, 0 replies; 64+ messages in thread
From: Catalin Marinas @ 2011-09-06 14:48 UTC (permalink / raw)
  To: Mark Salter; +Cc: ming.lei, stern, linux-kernel, linux-arm-kernel

On 6 September 2011 15:37, Mark Salter <msalter@redhat.com> wrote:
> On Tue, 2011-09-06 at 15:32 +0100, Catalin Marinas wrote:
>> That's what mb() and wmb() do already, at least on ARM. Why do we need
>> another API? IIRC from past discussions on linux-arch around barriers,
>> the mb() should be sufficient in the case of DMA coherent buffers.
>> That's why macros like writel() on ARM have the mb() added by default
>> (for cases where you start the DMA transfer by writing to a device
>> register).
>
> For USB EHCI, the driver does not necessarily write to a register after
> writing to DMA coherent memory. In some cases, the controller polls for
> information written by the driver.

So as I understand, you would like to force the eviction from the
write buffer rather than waiting for it to be drained. On ARM, the
write buffer is eventually flushed, so there is no strict timing
guarantee. It could take longer if the processor immediately starts
polling some memory location for example, but in this case a simple
barrier would do.

-- 
Catalin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 2/3] define ARM-specific dma_coherent_write_sync
@ 2011-09-06 14:48         ` Catalin Marinas
  0 siblings, 0 replies; 64+ messages in thread
From: Catalin Marinas @ 2011-09-06 14:48 UTC (permalink / raw)
  To: linux-arm-kernel

On 6 September 2011 15:37, Mark Salter <msalter@redhat.com> wrote:
> On Tue, 2011-09-06 at 15:32 +0100, Catalin Marinas wrote:
>> That's what mb() and wmb() do already, at least on ARM. Why do we need
>> another API? IIRC from past discussions on linux-arch around barriers,
>> the mb() should be sufficient in the case of DMA coherent buffers.
>> That's why macros like writel() on ARM have the mb() added by default
>> (for cases where you start the DMA transfer by writing to a device
>> register).
>
> For USB EHCI, the driver does not necessarily write to a register after
> writing to DMA coherent memory. In some cases, the controller polls for
> information written by the driver.

So as I understand, you would like to force the eviction from the
write buffer rather than waiting for it to be drained. On ARM, the
write buffer is eventually flushed, so there is no strict timing
guarantee. It could take longer if the processor immediately starts
polling some memory location for example, but in this case a simple
barrier would do.

-- 
Catalin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] define ARM-specific dma_coherent_write_sync
  2011-09-06 14:48         ` Catalin Marinas
@ 2011-09-06 15:02           ` Mark Salter
  -1 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-09-06 15:02 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: ming.lei, stern, linux-kernel, linux-arm-kernel

On Tue, 2011-09-06 at 15:48 +0100, Catalin Marinas wrote:
> On 6 September 2011 15:37, Mark Salter <msalter@redhat.com> wrote:
> > On Tue, 2011-09-06 at 15:32 +0100, Catalin Marinas wrote:
> >> That's what mb() and wmb() do already, at least on ARM. Why do we need
> >> another API? IIRC from past discussions on linux-arch around barriers,
> >> the mb() should be sufficient in the case of DMA coherent buffers.
> >> That's why macros like writel() on ARM have the mb() added by default
> >> (for cases where you start the DMA transfer by writing to a device
> >> register).
> >
> > For USB EHCI, the driver does not necessarily write to a register after
> > writing to DMA coherent memory. In some cases, the controller polls for
> > information written by the driver.
> 
> So as I understand, you would like to force the eviction from the
> write buffer rather than waiting for it to be drained. On ARM, the
> write buffer is eventually flushed, so there is no strict timing
> guarantee. It could take longer if the processor immediately starts
> polling some memory location for example, but in this case a simple
> barrier would do.

Yes, a memory barrier would have the same effect on ARM, but the
purpose of a barrier is to guarantee ordering. What the patch does
is add an interface to force a write buffer flush for performance,
not ordering. If a memory barrier is used, it could have a negative
impact on other arches.

In any case, the current thinking is that the original problem with
the USB performance seen on cortex A9 multicore is probably something
more than just write buffer delays. Once the original problem is better
understood, we can take another look at this patch if it is still
needed.

--Mark




^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 2/3] define ARM-specific dma_coherent_write_sync
@ 2011-09-06 15:02           ` Mark Salter
  0 siblings, 0 replies; 64+ messages in thread
From: Mark Salter @ 2011-09-06 15:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 2011-09-06 at 15:48 +0100, Catalin Marinas wrote:
> On 6 September 2011 15:37, Mark Salter <msalter@redhat.com> wrote:
> > On Tue, 2011-09-06 at 15:32 +0100, Catalin Marinas wrote:
> >> That's what mb() and wmb() do already, at least on ARM. Why do we need
> >> another API? IIRC from past discussions on linux-arch around barriers,
> >> the mb() should be sufficient in the case of DMA coherent buffers.
> >> That's why macros like writel() on ARM have the mb() added by default
> >> (for cases where you start the DMA transfer by writing to a device
> >> register).
> >
> > For USB EHCI, the driver does not necessarily write to a register after
> > writing to DMA coherent memory. In some cases, the controller polls for
> > information written by the driver.
> 
> So as I understand, you would like to force the eviction from the
> write buffer rather than waiting for it to be drained. On ARM, the
> write buffer is eventually flushed, so there is no strict timing
> guarantee. It could take longer if the processor immediately starts
> polling some memory location for example, but in this case a simple
> barrier would do.

Yes, a memory barrier would have the same effect on ARM, but the
purpose of a barrier is to guarantee ordering. What the patch does
is add an interface to force a write buffer flush for performance,
not ordering. If a memory barrier is used, it could have a negative
impact on other arches.

In any case, the current thinking is that the original problem with
the USB performance seen on cortex A9 multicore is probably something
more than just write buffer delays. Once the original problem is better
understood, we can take another look at this patch if it is still
needed.

--Mark

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] define ARM-specific dma_coherent_write_sync
  2011-09-06 15:02           ` Mark Salter
@ 2011-10-03  1:40             ` Jon Masters
  -1 siblings, 0 replies; 64+ messages in thread
From: Jon Masters @ 2011-10-03  1:40 UTC (permalink / raw)
  To: Mark Salter
  Cc: Catalin Marinas, ming.lei, stern, linux-kernel, linux-arm-kernel

On Sep 6, 2011, at 11:02 AM, Mark Salter wrote:
> 
> In any case, the current thinking is that the original problem with
> the USB performance seen on cortex A9 multicore is probably something
> more than just write buffer delays. Once the original problem is better
> understood, we can take another look at this patch if it is still
> needed.


Thanks again for looking into this Mark. My understanding is that this is still being investigated. I'll followup with ARM to see how that's going since I've heard nothing recently :) Meanwhile, we're continuing to carry a hack based on these patches in Fedora ARM kernels.

Jon.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 2/3] define ARM-specific dma_coherent_write_sync
@ 2011-10-03  1:40             ` Jon Masters
  0 siblings, 0 replies; 64+ messages in thread
From: Jon Masters @ 2011-10-03  1:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Sep 6, 2011, at 11:02 AM, Mark Salter wrote:
> 
> In any case, the current thinking is that the original problem with
> the USB performance seen on cortex A9 multicore is probably something
> more than just write buffer delays. Once the original problem is better
> understood, we can take another look at this patch if it is still
> needed.


Thanks again for looking into this Mark. My understanding is that this is still being investigated. I'll followup with ARM to see how that's going since I've heard nothing recently :) Meanwhile, we're continuing to carry a hack based on these patches in Fedora ARM kernels.

Jon.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] define ARM-specific dma_coherent_write_sync
  2011-10-03  1:40             ` Jon Masters
@ 2011-10-03  8:44               ` Catalin Marinas
  -1 siblings, 0 replies; 64+ messages in thread
From: Catalin Marinas @ 2011-10-03  8:44 UTC (permalink / raw)
  To: Jon Masters; +Cc: Mark Salter, ming.lei, stern, linux-kernel, linux-arm-kernel

On Mon, Oct 03, 2011 at 02:40:19AM +0100, Jon Masters wrote:
> On Sep 6, 2011, at 11:02 AM, Mark Salter wrote:
> > In any case, the current thinking is that the original problem with
> > the USB performance seen on cortex A9 multicore is probably something
> > more than just write buffer delays. Once the original problem is better
> > understood, we can take another look at this patch if it is still
> > needed.
> 
> Thanks again for looking into this Mark. My understanding is that this
> is still being investigated. I'll followup with ARM to see how that's
> going since I've heard nothing recently :) Meanwhile, we're continuing
> to carry a hack based on these patches in Fedora ARM kernels.

Not talking about hardware specifics here, the architecture (though
ARMv7 onwards) mandates that the write buffer is eventually drained. But
doesn't state any upper limit, so it could even be half a second.

In this case, some form of buffer draining for devices that poll the
memory may be useful. If we don't want to add a new API, something like
below would work as well:

	write to DMA buffer;
	mb();
	read from DMA buffer;

So an mb() between the last write and a subsequent read would force the
visibility of the write. We have similar scenarios for Device accesses.

If we go for a DMA API extension, the dma_coherent_write_sync() should
probably take an address as well, just in case implementations would
force the draining via some read back from the same area.

-- 
Catalin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 2/3] define ARM-specific dma_coherent_write_sync
@ 2011-10-03  8:44               ` Catalin Marinas
  0 siblings, 0 replies; 64+ messages in thread
From: Catalin Marinas @ 2011-10-03  8:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Oct 03, 2011 at 02:40:19AM +0100, Jon Masters wrote:
> On Sep 6, 2011, at 11:02 AM, Mark Salter wrote:
> > In any case, the current thinking is that the original problem with
> > the USB performance seen on cortex A9 multicore is probably something
> > more than just write buffer delays. Once the original problem is better
> > understood, we can take another look at this patch if it is still
> > needed.
> 
> Thanks again for looking into this Mark. My understanding is that this
> is still being investigated. I'll followup with ARM to see how that's
> going since I've heard nothing recently :) Meanwhile, we're continuing
> to carry a hack based on these patches in Fedora ARM kernels.

Not talking about hardware specifics here, the architecture (though
ARMv7 onwards) mandates that the write buffer is eventually drained. But
doesn't state any upper limit, so it could even be half a second.

In this case, some form of buffer draining for devices that poll the
memory may be useful. If we don't want to add a new API, something like
below would work as well:

	write to DMA buffer;
	mb();
	read from DMA buffer;

So an mb() between the last write and a subsequent read would force the
visibility of the write. We have similar scenarios for Device accesses.

If we go for a DMA API extension, the dma_coherent_write_sync() should
probably take an address as well, just in case implementations would
force the draining via some read back from the same area.

-- 
Catalin

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH 2/3] define ARM-specific dma_coherent_write_sync
  2011-10-03  8:44               ` Catalin Marinas
@ 2011-10-03  9:24                 ` Jon Masters
  -1 siblings, 0 replies; 64+ messages in thread
From: Jon Masters @ 2011-10-03  9:24 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Mark Salter, ming.lei, stern, linux-kernel, linux-arm-kernel


On Oct 3, 2011, at 4:44 AM, Catalin Marinas wrote:

> On Mon, Oct 03, 2011 at 02:40:19AM +0100, Jon Masters wrote:
>> On Sep 6, 2011, at 11:02 AM, Mark Salter wrote:
>>> In any case, the current thinking is that the original problem with
>>> the USB performance seen on cortex A9 multicore is probably something
>>> more than just write buffer delays. Once the original problem is better
>>> understood, we can take another look at this patch if it is still
>>> needed.
>> 
>> Thanks again for looking into this Mark. My understanding is that this
>> is still being investigated. I'll followup with ARM to see how that's
>> going since I've heard nothing recently :) Meanwhile, we're continuing
>> to carry a hack based on these patches in Fedora ARM kernels.
> 
> Not talking about hardware specifics here, the architecture (though
> ARMv7 onwards) mandates that the write buffer is eventually drained. But
> doesn't state any upper limit, so it could even be half a second.

<snip mechanics of dma_coherent_write_sync, etc.>

I guess my main question is, do you think this is just a write buffer delay? If you do, then we should definitely get back to this question of defining DMA extensions. But are we sure that's what this is?

Mark: did you have any more insight into this recently?

Jon.


^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH 2/3] define ARM-specific dma_coherent_write_sync
@ 2011-10-03  9:24                 ` Jon Masters
  0 siblings, 0 replies; 64+ messages in thread
From: Jon Masters @ 2011-10-03  9:24 UTC (permalink / raw)
  To: linux-arm-kernel


On Oct 3, 2011, at 4:44 AM, Catalin Marinas wrote:

> On Mon, Oct 03, 2011 at 02:40:19AM +0100, Jon Masters wrote:
>> On Sep 6, 2011, at 11:02 AM, Mark Salter wrote:
>>> In any case, the current thinking is that the original problem with
>>> the USB performance seen on cortex A9 multicore is probably something
>>> more than just write buffer delays. Once the original problem is better
>>> understood, we can take another look at this patch if it is still
>>> needed.
>> 
>> Thanks again for looking into this Mark. My understanding is that this
>> is still being investigated. I'll followup with ARM to see how that's
>> going since I've heard nothing recently :) Meanwhile, we're continuing
>> to carry a hack based on these patches in Fedora ARM kernels.
> 
> Not talking about hardware specifics here, the architecture (though
> ARMv7 onwards) mandates that the write buffer is eventually drained. But
> doesn't state any upper limit, so it could even be half a second.

<snip mechanics of dma_coherent_write_sync, etc.>

I guess my main question is, do you think this is just a write buffer delay? If you do, then we should definitely get back to this question of defining DMA extensions. But are we sure that's what this is?

Mark: did you have any more insight into this recently?

Jon.

^ permalink raw reply	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2011-10-03  9:24 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-08-31 21:30 [PATCH 0/3] RFC: addition to DMA API Mark Salter
2011-08-31 21:30 ` Mark Salter
2011-08-31 21:30 ` [PATCH 1/3] add dma_coherent_write_sync " Mark Salter
2011-08-31 21:30   ` Mark Salter
2011-09-01  2:59   ` Josh Cartwright
2011-09-01  2:59     ` Josh Cartwright
2011-09-01  9:57   ` Michał Mirosław
2011-09-01  9:57     ` Michał Mirosław
2011-09-01 12:36     ` Mark Salter
2011-09-01 12:36       ` Mark Salter
2011-09-06 14:30       ` Catalin Marinas
2011-09-06 14:30         ` Catalin Marinas
2011-08-31 21:30 ` [PATCH 2/3] define ARM-specific dma_coherent_write_sync Mark Salter
2011-08-31 21:30   ` Mark Salter
2011-09-06 14:32   ` Catalin Marinas
2011-09-06 14:32     ` Catalin Marinas
2011-09-06 14:37     ` Mark Salter
2011-09-06 14:37       ` Mark Salter
2011-09-06 14:48       ` Catalin Marinas
2011-09-06 14:48         ` Catalin Marinas
2011-09-06 15:02         ` Mark Salter
2011-09-06 15:02           ` Mark Salter
2011-10-03  1:40           ` Jon Masters
2011-10-03  1:40             ` Jon Masters
2011-10-03  8:44             ` Catalin Marinas
2011-10-03  8:44               ` Catalin Marinas
2011-10-03  9:24               ` Jon Masters
2011-10-03  9:24                 ` Jon Masters
2011-08-31 21:30 ` [PATCH 3/3] add dma_coherent_write_sync calls to USB EHCI driver Mark Salter
2011-08-31 21:30   ` Mark Salter
2011-09-01  2:33   ` Ming Lei
2011-09-01  2:33     ` Ming Lei
2011-09-01  2:09 ` [PATCH 0/3] RFC: addition to DMA API Ming Lei
2011-09-01  2:09   ` Ming Lei
2011-09-01  3:09   ` Alan Stern
2011-09-01  3:09     ` Alan Stern
2011-09-01  3:41     ` Ming Lei
2011-09-01  3:41       ` Ming Lei
2011-09-01  8:45       ` Will Deacon
2011-09-01  8:45         ` Will Deacon
2011-09-01  9:14         ` Ming Lei
2011-09-01  9:14           ` Ming Lei
2011-09-01 15:42           ` Alan Stern
2011-09-01 15:42             ` Alan Stern
2011-09-01 16:04             ` Russell King - ARM Linux
2011-09-01 16:04               ` Russell King - ARM Linux
2011-09-01 17:31               ` Will Deacon
2011-09-01 17:31                 ` Will Deacon
2011-09-01 18:07                 ` Russell King - ARM Linux
2011-09-01 18:07                   ` Russell King - ARM Linux
2011-09-01 19:14                 ` Mark Salter
2011-09-01 19:14                   ` Mark Salter
2011-09-01 15:22       ` Alan Stern
2011-09-01 15:22         ` Alan Stern
2011-09-01 15:56         ` Ming Lei
2011-09-01 15:56           ` Ming Lei
2011-09-01 16:48           ` Alan Stern
2011-09-01 16:48             ` Alan Stern
2011-09-02  0:59             ` Ming Lei
2011-09-02  0:59               ` Ming Lei
2011-09-02 13:53               ` Alan Stern
2011-09-02 13:53                 ` Alan Stern
2011-09-01  9:11 ` Will Deacon
2011-09-01  9:11   ` Will Deacon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.