All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler()
@ 2018-06-17 14:36 Matwey V. Kornilov
  2018-06-17 14:36 ` [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer Matwey V. Kornilov
  2018-06-18 18:58 ` [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Steven Rostedt
  0 siblings, 2 replies; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-06-17 14:36 UTC (permalink / raw)
  To: hverkuil, mchehab
  Cc: rostedt, mingo, isely, bhumirks, colin.king, linux-media,
	linux-kernel, Matwey V. Kornilov

Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru>
---
 drivers/media/usb/pwc/pwc-if.c |  7 +++++++
 include/trace/events/pwc.h     | 45 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)
 create mode 100644 include/trace/events/pwc.h

diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c
index 54b036d39c5b..5775d1f60668 100644
--- a/drivers/media/usb/pwc/pwc-if.c
+++ b/drivers/media/usb/pwc/pwc-if.c
@@ -57,6 +57,9 @@
    - Pham Thanh Nam: webcam snapshot button as an event input device
 */
 
+#define CREATE_TRACE_POINTS
+#include <trace/events/pwc.h>
+
 #include <linux/errno.h>
 #include <linux/init.h>
 #include <linux/mm.h>
@@ -260,6 +263,8 @@ static void pwc_isoc_handler(struct urb *urb)
 	int i, fst, flen;
 	unsigned char *iso_buf = NULL;
 
+	trace_pwc_handler_enter(urb);
+
 	if (urb->status == -ENOENT || urb->status == -ECONNRESET ||
 	    urb->status == -ESHUTDOWN) {
 		PWC_DEBUG_OPEN("URB (%p) unlinked %ssynchronously.\n",
@@ -347,6 +352,8 @@ static void pwc_isoc_handler(struct urb *urb)
 		pdev->vlast_packet_size = flen;
 	}
 
+	trace_pwc_handler_exit(urb);
+
 handler_end:
 	i = usb_submit_urb(urb, GFP_ATOMIC);
 	if (i != 0)
diff --git a/include/trace/events/pwc.h b/include/trace/events/pwc.h
new file mode 100644
index 000000000000..b13d2118bb7a
--- /dev/null
+++ b/include/trace/events/pwc.h
@@ -0,0 +1,45 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#if !defined(_TRACE_PWC_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_PWC_H
+
+#include <linux/usb.h>
+#include <linux/tracepoint.h>
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM pwc
+
+TRACE_EVENT(pwc_handler_enter,
+	TP_PROTO(struct urb *urb),
+	TP_ARGS(urb),
+	TP_STRUCT__entry(
+		__field(struct urb*, urb)
+		__field(int, urb__status)
+		__field(u32, urb__actual_length)
+	),
+	TP_fast_assign(
+		__entry->urb = urb;
+		__entry->urb__status = urb->status;
+		__entry->urb__actual_length = urb->actual_length;
+	),
+	TP_printk("urb=%p (status=%d actual_length=%u)",
+		__entry->urb,
+		__entry->urb__status,
+		__entry->urb__actual_length)
+);
+
+TRACE_EVENT(pwc_handler_exit,
+	TP_PROTO(struct urb *urb),
+	TP_ARGS(urb),
+	TP_STRUCT__entry(
+		__field(struct urb*, urb)
+	),
+	TP_fast_assign(
+		__entry->urb = urb;
+	),
+	TP_printk("urb=%p", __entry->urb)
+);
+
+#endif /* _TRACE_PWC_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
-- 
2.16.0-rc1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-06-17 14:36 [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Matwey V. Kornilov
@ 2018-06-17 14:36 ` Matwey V. Kornilov
  2018-06-18  5:11   ` Ezequiel Garcia
  2018-06-18 18:58 ` [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Steven Rostedt
  1 sibling, 1 reply; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-06-17 14:36 UTC (permalink / raw)
  To: hverkuil, mchehab
  Cc: rostedt, mingo, isely, bhumirks, colin.king, linux-media,
	linux-kernel, Matwey V. Kornilov

DMA cocherency slows the transfer down on systems without hardware
coherent DMA.

Based on previous commit the following performance benchmarks have been
carried out. Average memcpy() data transfer rate (rate) and handler
completion time (time) have been measured when running video stream at
640x480 resolution at 10fps.

x86_64 based system (Intel Core i5-3470). This platform has hardware
coherent DMA support and proposed change doesn't make big difference here.

 * kmalloc:            rate = (4.4 +- 1.0) GBps
                       time = (2.4 +- 1.2) usec
 * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps
                       time = (2.5 +- 1.0) usec

We see that the measurements agree well within error ranges in this case.
So no performance downgrade is introduced.

armv7l based system (TI AM335x BeagleBone Black). This platform has no
hardware coherent DMA support. DMA coherence is implemented via disabled
page caching that slows down memcpy() due to memory controller behaviour.

 * kmalloc:            rate =  (190 +-  30) MBps
                       time =   (50 +-  10) usec
 * usb_alloc_coherent: rate =   (33 +-   4) MBps
                       time = (3000 +- 400) usec

Note, that quantative difference leads (this commit leads to 5 times
acceleration) to qualitative behavior change in this case. As it was
stated before, the video stream can not be successfully received at AM335x
platforms with MUSB based USB host controller due to performance issues
[1].

[1] https://www.spinics.net/lists/linux-usb/msg165735.html

Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru>
---
 drivers/media/usb/pwc/pwc-if.c | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c
index 5775d1f60668..6a3cd9680a7f 100644
--- a/drivers/media/usb/pwc/pwc-if.c
+++ b/drivers/media/usb/pwc/pwc-if.c
@@ -427,11 +427,8 @@ static int pwc_isoc_init(struct pwc_device *pdev)
 		urb->interval = 1; // devik
 		urb->dev = udev;
 		urb->pipe = usb_rcvisocpipe(udev, pdev->vendpoint);
-		urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP;
-		urb->transfer_buffer = usb_alloc_coherent(udev,
-							  ISO_BUFFER_SIZE,
-							  GFP_KERNEL,
-							  &urb->transfer_dma);
+		urb->transfer_flags = URB_ISO_ASAP;
+		urb->transfer_buffer = kmalloc(ISO_BUFFER_SIZE, GFP_KERNEL);
 		if (urb->transfer_buffer == NULL) {
 			PWC_ERROR("Failed to allocate urb buffer %d\n", i);
 			pwc_isoc_cleanup(pdev);
@@ -491,10 +488,7 @@ static void pwc_iso_free(struct pwc_device *pdev)
 		if (pdev->urbs[i]) {
 			PWC_DEBUG_MEMORY("Freeing URB\n");
 			if (pdev->urbs[i]->transfer_buffer) {
-				usb_free_coherent(pdev->udev,
-					pdev->urbs[i]->transfer_buffer_length,
-					pdev->urbs[i]->transfer_buffer,
-					pdev->urbs[i]->transfer_dma);
+				kfree(pdev->urbs[i]->transfer_buffer);
 			}
 			usb_free_urb(pdev->urbs[i]);
 			pdev->urbs[i] = NULL;
-- 
2.16.0-rc1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-06-17 14:36 ` [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer Matwey V. Kornilov
@ 2018-06-18  5:11   ` Ezequiel Garcia
  2018-06-18  7:10     ` Matwey V. Kornilov
  0 siblings, 1 reply; 38+ messages in thread
From: Ezequiel Garcia @ 2018-06-18  5:11 UTC (permalink / raw)
  To: Matwey V. Kornilov, hverkuil, mchehab, Laurent Pinchart
  Cc: rostedt, mingo, isely, bhumirks, colin.king, linux-media, linux-kernel

+ Laurent

On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote:
> DMA cocherency slows the transfer down on systems without hardware
> coherent DMA.
> 
> Based on previous commit the following performance benchmarks have been
> carried out. Average memcpy() data transfer rate (rate) and handler
> completion time (time) have been measured when running video stream at
> 640x480 resolution at 10fps.
> 
> x86_64 based system (Intel Core i5-3470). This platform has hardware
> coherent DMA support and proposed change doesn't make big difference here.
> 
>  * kmalloc:            rate = (4.4 +- 1.0) GBps
>                        time = (2.4 +- 1.2) usec
>  * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps
>                        time = (2.5 +- 1.0) usec
> 
> We see that the measurements agree well within error ranges in this case.
> So no performance downgrade is introduced.
> 
> armv7l based system (TI AM335x BeagleBone Black). This platform has no
> hardware coherent DMA support. DMA coherence is implemented via disabled
> page caching that slows down memcpy() due to memory controller behaviour.
> 
>  * kmalloc:            rate =  (190 +-  30) MBps
>                        time =   (50 +-  10) usec
>  * usb_alloc_coherent: rate =   (33 +-   4) MBps
>                        time = (3000 +- 400) usec
> 
> Note, that quantative difference leads (this commit leads to 5 times
> acceleration) to qualitative behavior change in this case. As it was
> stated before, the video stream can not be successfully received at AM335x
> platforms with MUSB based USB host controller due to performance issues
> [1].
> 
> [1] https://www.spinics.net/lists/linux-usb/msg165735.html
> 

This is quite interesting! I have receive similar complaints
from users wanting to use stk1160 on BBB and Raspberrys,
without much luck on either, due to insufficient isoc bandwidth.

I'm guessing other ARM platforms could be suffering
from the same issue.

Note that stk1160 and uvcvideo drivers use kmalloc on platforms
where DMA_NONCOHERENT is defined, but this is not the case
on ARM platforms.

So, what is the benefit of using consistent 
for these URBs, as opposed to streaming?

If the choice is simply platform dependent,
can't we somehow detect which mapping should
be prefered?

> Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru>
> ---
>  drivers/media/usb/pwc/pwc-if.c | 12 +++---------
>  1 file changed, 3 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c
> index 5775d1f60668..6a3cd9680a7f 100644
> --- a/drivers/media/usb/pwc/pwc-if.c
> +++ b/drivers/media/usb/pwc/pwc-if.c
> @@ -427,11 +427,8 @@ static int pwc_isoc_init(struct pwc_device *pdev)
>  		urb->interval = 1; // devik
>  		urb->dev = udev;
>  		urb->pipe = usb_rcvisocpipe(udev, pdev->vendpoint);
> -		urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP;
> -		urb->transfer_buffer = usb_alloc_coherent(udev,
> -							  ISO_BUFFER_SIZE,
> -							  GFP_KERNEL,
> -							  &urb->transfer_dma);
> +		urb->transfer_flags = URB_ISO_ASAP;
> +		urb->transfer_buffer = kmalloc(ISO_BUFFER_SIZE, GFP_KERNEL);
>  		if (urb->transfer_buffer == NULL) {
>  			PWC_ERROR("Failed to allocate urb buffer %d\n", i);
>  			pwc_isoc_cleanup(pdev);
> @@ -491,10 +488,7 @@ static void pwc_iso_free(struct pwc_device *pdev)
>  		if (pdev->urbs[i]) {
>  			PWC_DEBUG_MEMORY("Freeing URB\n");
>  			if (pdev->urbs[i]->transfer_buffer) {
> -				usb_free_coherent(pdev->udev,
> -					pdev->urbs[i]->transfer_buffer_length,
> -					pdev->urbs[i]->transfer_buffer,
> -					pdev->urbs[i]->transfer_dma);
> +				kfree(pdev->urbs[i]->transfer_buffer);
>  			}
>  			usb_free_urb(pdev->urbs[i]);
>  			pdev->urbs[i] = NULL;

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-06-18  5:11   ` Ezequiel Garcia
@ 2018-06-18  7:10     ` Matwey V. Kornilov
  2018-07-17 20:10       ` Ezequiel Garcia
  0 siblings, 1 reply; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-06-18  7:10 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: hverkuil, mchehab, Laurent Pinchart, rostedt, mingo, isely,
	bhumirks, colin.king, linux-media, open list

Hi Ezequiel,

2018-06-18 8:11 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>:
> + Laurent
>
> On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote:
>> DMA cocherency slows the transfer down on systems without hardware
>> coherent DMA.
>>
>> Based on previous commit the following performance benchmarks have been
>> carried out. Average memcpy() data transfer rate (rate) and handler
>> completion time (time) have been measured when running video stream at
>> 640x480 resolution at 10fps.
>>
>> x86_64 based system (Intel Core i5-3470). This platform has hardware
>> coherent DMA support and proposed change doesn't make big difference here.
>>
>>  * kmalloc:            rate = (4.4 +- 1.0) GBps
>>                        time = (2.4 +- 1.2) usec
>>  * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps
>>                        time = (2.5 +- 1.0) usec
>>
>> We see that the measurements agree well within error ranges in this case.
>> So no performance downgrade is introduced.
>>
>> armv7l based system (TI AM335x BeagleBone Black). This platform has no
>> hardware coherent DMA support. DMA coherence is implemented via disabled
>> page caching that slows down memcpy() due to memory controller behaviour.
>>
>>  * kmalloc:            rate =  (190 +-  30) MBps
>>                        time =   (50 +-  10) usec
>>  * usb_alloc_coherent: rate =   (33 +-   4) MBps
>>                        time = (3000 +- 400) usec
>>
>> Note, that quantative difference leads (this commit leads to 5 times
>> acceleration) to qualitative behavior change in this case. As it was
>> stated before, the video stream can not be successfully received at AM335x
>> platforms with MUSB based USB host controller due to performance issues
>> [1].
>>
>> [1] https://www.spinics.net/lists/linux-usb/msg165735.html
>>
>
> This is quite interesting! I have receive similar complaints
> from users wanting to use stk1160 on BBB and Raspberrys,
> without much luck on either, due to insufficient isoc bandwidth.
>
> I'm guessing other ARM platforms could be suffering
> from the same issue.
>
> Note that stk1160 and uvcvideo drivers use kmalloc on platforms
> where DMA_NONCOHERENT is defined, but this is not the case
> on ARM platforms.

There are some ARMv7 platforms that have coherent DMA (for instance
Broadcome Horthstar Plus series), but the most of them don't have. It
is defined in device tree file, and there is no way to recover this
information at runtime in USB perepherial driver.

>
> So, what is the benefit of using consistent
> for these URBs, as opposed to streaming?

I don't know, I think there is no real benefit and all we see is a
consequence of copy-pasta when some webcam drivers were inspired by
others and development priparily was going at x86 platforms. It would
be great if somebody corrected me here. DMA Coherence is quite strong
property and I cannot figure out how can it help when streaming video.
The CPU host always reads from the buffer and never writes to.
Hardware perepherial always writes to and never reads from. Moreover,
buffer access is mutually exclusive and separated in time by Interrupt
fireing and URB starting (when we reuse existing URB for new request).
Only single one memory barrier is really required here.

I understand that there are cases when DMA coherence is really needed,
for instane VirtIO VRing when we accessing same data structure in both
directions from the both sides, but this has nothing common with our
case.

>
> If the choice is simply platform dependent,
> can't we somehow detect which mapping should
> be prefered?

Now, we don't have this way.

>
>> Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru>
>> ---
>>  drivers/media/usb/pwc/pwc-if.c | 12 +++---------
>>  1 file changed, 3 insertions(+), 9 deletions(-)
>>
>> diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c
>> index 5775d1f60668..6a3cd9680a7f 100644
>> --- a/drivers/media/usb/pwc/pwc-if.c
>> +++ b/drivers/media/usb/pwc/pwc-if.c
>> @@ -427,11 +427,8 @@ static int pwc_isoc_init(struct pwc_device *pdev)
>>               urb->interval = 1; // devik
>>               urb->dev = udev;
>>               urb->pipe = usb_rcvisocpipe(udev, pdev->vendpoint);
>> -             urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP;
>> -             urb->transfer_buffer = usb_alloc_coherent(udev,
>> -                                                       ISO_BUFFER_SIZE,
>> -                                                       GFP_KERNEL,
>> -                                                       &urb->transfer_dma);
>> +             urb->transfer_flags = URB_ISO_ASAP;
>> +             urb->transfer_buffer = kmalloc(ISO_BUFFER_SIZE, GFP_KERNEL);
>>               if (urb->transfer_buffer == NULL) {
>>                       PWC_ERROR("Failed to allocate urb buffer %d\n", i);
>>                       pwc_isoc_cleanup(pdev);
>> @@ -491,10 +488,7 @@ static void pwc_iso_free(struct pwc_device *pdev)
>>               if (pdev->urbs[i]) {
>>                       PWC_DEBUG_MEMORY("Freeing URB\n");
>>                       if (pdev->urbs[i]->transfer_buffer) {
>> -                             usb_free_coherent(pdev->udev,
>> -                                     pdev->urbs[i]->transfer_buffer_length,
>> -                                     pdev->urbs[i]->transfer_buffer,
>> -                                     pdev->urbs[i]->transfer_dma);
>> +                             kfree(pdev->urbs[i]->transfer_buffer);
>>                       }
>>                       usb_free_urb(pdev->urbs[i]);
>>                       pdev->urbs[i] = NULL;
>



-- 
With best regards,
Matwey V. Kornilov.
Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler()
  2018-06-17 14:36 [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Matwey V. Kornilov
  2018-06-17 14:36 ` [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer Matwey V. Kornilov
@ 2018-06-18 18:58 ` Steven Rostedt
  2018-06-19 16:23   ` Matwey V. Kornilov
  1 sibling, 1 reply; 38+ messages in thread
From: Steven Rostedt @ 2018-06-18 18:58 UTC (permalink / raw)
  To: Matwey V. Kornilov
  Cc: hverkuil, mchehab, mingo, isely, bhumirks, colin.king,
	linux-media, linux-kernel

On Sun, 17 Jun 2018 17:36:24 +0300
"Matwey V. Kornilov" <matwey@sai.msu.ru> wrote:

I would prefer a change log here that would explain why these
tracepoints are being added.


> Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru>
> ---
>  drivers/media/usb/pwc/pwc-if.c |  7 +++++++
>  include/trace/events/pwc.h     | 45 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 52 insertions(+)
>  create mode 100644 include/trace/events/pwc.h
> 
> diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c
> index 54b036d39c5b..5775d1f60668 100644
> --- a/drivers/media/usb/pwc/pwc-if.c
> +++ b/drivers/media/usb/pwc/pwc-if.c
> @@ -57,6 +57,9 @@
>     - Pham Thanh Nam: webcam snapshot button as an event input device
>  */
>  
> +#define CREATE_TRACE_POINTS
> +#include <trace/events/pwc.h>
> +
>  #include <linux/errno.h>
>  #include <linux/init.h>
>  #include <linux/mm.h>
> @@ -260,6 +263,8 @@ static void pwc_isoc_handler(struct urb *urb)
>  	int i, fst, flen;
>  	unsigned char *iso_buf = NULL;
>  
> +	trace_pwc_handler_enter(urb);
> +
>  	if (urb->status == -ENOENT || urb->status == -ECONNRESET ||
>  	    urb->status == -ESHUTDOWN) {
>  		PWC_DEBUG_OPEN("URB (%p) unlinked %ssynchronously.\n",

Looks like if this is hit, we will return from the function without
calling trace_pwc_handler_exit().

> @@ -347,6 +352,8 @@ static void pwc_isoc_handler(struct urb *urb)
>  		pdev->vlast_packet_size = flen;
>  	}
>  
> +	trace_pwc_handler_exit(urb);
> +
>  handler_end:

Why not add the tracepoint after handler_end. In fact, why not add some
exit status to the trace event? I would think that would be useful as
well.


>  	i = usb_submit_urb(urb, GFP_ATOMIC);
>  	if (i != 0)
> diff --git a/include/trace/events/pwc.h b/include/trace/events/pwc.h
> new file mode 100644
> index 000000000000..b13d2118bb7a
> --- /dev/null
> +++ b/include/trace/events/pwc.h
> @@ -0,0 +1,45 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#if !defined(_TRACE_PWC_H) || defined(TRACE_HEADER_MULTI_READ)
> +#define _TRACE_PWC_H
> +
> +#include <linux/usb.h>
> +#include <linux/tracepoint.h>
> +
> +#undef TRACE_SYSTEM
> +#define TRACE_SYSTEM pwc
> +
> +TRACE_EVENT(pwc_handler_enter,
> +	TP_PROTO(struct urb *urb),
> +	TP_ARGS(urb),
> +	TP_STRUCT__entry(
> +		__field(struct urb*, urb)
> +		__field(int, urb__status)
> +		__field(u32, urb__actual_length)
> +	),
> +	TP_fast_assign(
> +		__entry->urb = urb;
> +		__entry->urb__status = urb->status;
> +		__entry->urb__actual_length = urb->actual_length;

Is there any other data that may be interesting to record here?

-- Steve

> +	),
> +	TP_printk("urb=%p (status=%d actual_length=%u)",
> +		__entry->urb,
> +		__entry->urb__status,
> +		__entry->urb__actual_length)
> +);
> +
> +TRACE_EVENT(pwc_handler_exit,
> +	TP_PROTO(struct urb *urb),
> +	TP_ARGS(urb),
> +	TP_STRUCT__entry(
> +		__field(struct urb*, urb)
> +	),
> +	TP_fast_assign(
> +		__entry->urb = urb;
> +	),
> +	TP_printk("urb=%p", __entry->urb)
> +);
> +
> +#endif /* _TRACE_PWC_H */
> +
> +/* This part must be outside protection */
> +#include <trace/define_trace.h>


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler()
  2018-06-18 18:58 ` [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Steven Rostedt
@ 2018-06-19 16:23   ` Matwey V. Kornilov
  2018-06-19 16:33     ` Steven Rostedt
  0 siblings, 1 reply; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-06-19 16:23 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: hverkuil, mchehab, mingo, Mike Isely, Bhumika Goyal, Colin King,
	linux-media, open list

Hi Steven,

Thank you for valuable comments.

This is for measuring performance of URB completion callback inside PWC driver.
What do you think about moving events to __usb_hcd_giveback_urb() in
order to make this more generic? Like the following:

        local_irq_save(flags);
// trace urb complete enter
        urb->complete(urb);
// trace urb complete exit
        local_irq_restore(flags);


2018-06-18 21:58 GMT+03:00 Steven Rostedt <rostedt@goodmis.org>:
> On Sun, 17 Jun 2018 17:36:24 +0300
> "Matwey V. Kornilov" <matwey@sai.msu.ru> wrote:
>
> I would prefer a change log here that would explain why these
> tracepoints are being added.
>
>
>> Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru>
>> ---
>>  drivers/media/usb/pwc/pwc-if.c |  7 +++++++
>>  include/trace/events/pwc.h     | 45 ++++++++++++++++++++++++++++++++++++++++++
>>  2 files changed, 52 insertions(+)
>>  create mode 100644 include/trace/events/pwc.h
>>
>> diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c
>> index 54b036d39c5b..5775d1f60668 100644
>> --- a/drivers/media/usb/pwc/pwc-if.c
>> +++ b/drivers/media/usb/pwc/pwc-if.c
>> @@ -57,6 +57,9 @@
>>     - Pham Thanh Nam: webcam snapshot button as an event input device
>>  */
>>
>> +#define CREATE_TRACE_POINTS
>> +#include <trace/events/pwc.h>
>> +
>>  #include <linux/errno.h>
>>  #include <linux/init.h>
>>  #include <linux/mm.h>
>> @@ -260,6 +263,8 @@ static void pwc_isoc_handler(struct urb *urb)
>>       int i, fst, flen;
>>       unsigned char *iso_buf = NULL;
>>
>> +     trace_pwc_handler_enter(urb);
>> +
>>       if (urb->status == -ENOENT || urb->status == -ECONNRESET ||
>>           urb->status == -ESHUTDOWN) {
>>               PWC_DEBUG_OPEN("URB (%p) unlinked %ssynchronously.\n",
>
> Looks like if this is hit, we will return from the function without
> calling trace_pwc_handler_exit().
>
>> @@ -347,6 +352,8 @@ static void pwc_isoc_handler(struct urb *urb)
>>               pdev->vlast_packet_size = flen;
>>       }
>>
>> +     trace_pwc_handler_exit(urb);
>> +
>>  handler_end:
>
> Why not add the tracepoint after handler_end. In fact, why not add some
> exit status to the trace event? I would think that would be useful as
> well.
>
>
>>       i = usb_submit_urb(urb, GFP_ATOMIC);
>>       if (i != 0)
>> diff --git a/include/trace/events/pwc.h b/include/trace/events/pwc.h
>> new file mode 100644
>> index 000000000000..b13d2118bb7a
>> --- /dev/null
>> +++ b/include/trace/events/pwc.h
>> @@ -0,0 +1,45 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#if !defined(_TRACE_PWC_H) || defined(TRACE_HEADER_MULTI_READ)
>> +#define _TRACE_PWC_H
>> +
>> +#include <linux/usb.h>
>> +#include <linux/tracepoint.h>
>> +
>> +#undef TRACE_SYSTEM
>> +#define TRACE_SYSTEM pwc
>> +
>> +TRACE_EVENT(pwc_handler_enter,
>> +     TP_PROTO(struct urb *urb),
>> +     TP_ARGS(urb),
>> +     TP_STRUCT__entry(
>> +             __field(struct urb*, urb)
>> +             __field(int, urb__status)
>> +             __field(u32, urb__actual_length)
>> +     ),
>> +     TP_fast_assign(
>> +             __entry->urb = urb;
>> +             __entry->urb__status = urb->status;
>> +             __entry->urb__actual_length = urb->actual_length;
>
> Is there any other data that may be interesting to record here?
>
> -- Steve
>
>> +     ),
>> +     TP_printk("urb=%p (status=%d actual_length=%u)",
>> +             __entry->urb,
>> +             __entry->urb__status,
>> +             __entry->urb__actual_length)
>> +);
>> +
>> +TRACE_EVENT(pwc_handler_exit,
>> +     TP_PROTO(struct urb *urb),
>> +     TP_ARGS(urb),
>> +     TP_STRUCT__entry(
>> +             __field(struct urb*, urb)
>> +     ),
>> +     TP_fast_assign(
>> +             __entry->urb = urb;
>> +     ),
>> +     TP_printk("urb=%p", __entry->urb)
>> +);
>> +
>> +#endif /* _TRACE_PWC_H */
>> +
>> +/* This part must be outside protection */
>> +#include <trace/define_trace.h>
>



-- 
With best regards,
Matwey V. Kornilov.
Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler()
  2018-06-19 16:23   ` Matwey V. Kornilov
@ 2018-06-19 16:33     ` Steven Rostedt
  2018-06-20  8:05       ` Matwey V. Kornilov
  0 siblings, 1 reply; 38+ messages in thread
From: Steven Rostedt @ 2018-06-19 16:33 UTC (permalink / raw)
  To: Matwey V. Kornilov
  Cc: hverkuil, mchehab, mingo, Mike Isely, Bhumika Goyal, Colin King,
	linux-media, open list

On Tue, 19 Jun 2018 19:23:04 +0300
"Matwey V. Kornilov" <matwey@sai.msu.ru> wrote:

> Hi Steven,
> 
> Thank you for valuable comments.
> 
> This is for measuring performance of URB completion callback inside PWC driver.
> What do you think about moving events to __usb_hcd_giveback_urb() in
> order to make this more generic? Like the following:
> 
>         local_irq_save(flags);
> // trace urb complete enter
>         urb->complete(urb);
> // trace urb complete exit
>         local_irq_restore(flags);
> 
> 

If that can work for you, I'm fine with that. Trace events may be
cheap, but they do come with some cost. I'd like to have all trace
events be as valuable as possible, and limit the "special case" ones.

-- Steve

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler()
  2018-06-19 16:33     ` Steven Rostedt
@ 2018-06-20  8:05       ` Matwey V. Kornilov
  2018-06-20 13:09         ` Steven Rostedt
  0 siblings, 1 reply; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-06-20  8:05 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: hverkuil, mchehab, mingo, Mike Isely, Bhumika Goyal, Colin King,
	linux-media, open list

2018-06-19 19:33 GMT+03:00 Steven Rostedt <rostedt@goodmis.org>:
> On Tue, 19 Jun 2018 19:23:04 +0300
> "Matwey V. Kornilov" <matwey@sai.msu.ru> wrote:
>
>> Hi Steven,
>>
>> Thank you for valuable comments.
>>
>> This is for measuring performance of URB completion callback inside PWC driver.
>> What do you think about moving events to __usb_hcd_giveback_urb() in
>> order to make this more generic? Like the following:
>>
>>         local_irq_save(flags);
>> // trace urb complete enter
>>         urb->complete(urb);
>> // trace urb complete exit
>>         local_irq_restore(flags);
>>
>>
>
> If that can work for you, I'm fine with that. Trace events may be
> cheap, but they do come with some cost. I'd like to have all trace
> events be as valuable as possible, and limit the "special case" ones.

What is the cost for events? I suppose one conditional check when
trace is disabled? There is already similar debugging stuff related to
usbmon in __usb_hcd_giveback_urb(), so I don't think that another
conditional check will hurt performance dramatically there. When
discussing second patch in this series I see that the issue that it is
intended to resolve may be common to other USB media drivers.


>
> -- Steve
>



-- 
With best regards,
Matwey V. Kornilov.
Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler()
  2018-06-20  8:05       ` Matwey V. Kornilov
@ 2018-06-20 13:09         ` Steven Rostedt
  0 siblings, 0 replies; 38+ messages in thread
From: Steven Rostedt @ 2018-06-20 13:09 UTC (permalink / raw)
  To: Matwey V. Kornilov
  Cc: hverkuil, mchehab, mingo, Mike Isely, Bhumika Goyal, Colin King,
	linux-media, open list

On Wed, 20 Jun 2018 11:05:51 +0300
"Matwey V. Kornilov" <matwey@sai.msu.ru> wrote:


> > If that can work for you, I'm fine with that. Trace events may be
> > cheap, but they do come with some cost. I'd like to have all trace
> > events be as valuable as possible, and limit the "special case" ones.  
> 
> What is the cost for events? I suppose one conditional check when
> trace is disabled? There is already similar debugging stuff related to
> usbmon in __usb_hcd_giveback_urb(), so I don't think that another
> conditional check will hurt performance dramatically there. When
> discussing second patch in this series I see that the issue that it is
> intended to resolve may be common to other USB media drivers.

The cost isn't just about performance. In fact, the performance
overhead of trace events is pretty negligible. The cost I'm worried
about is bloat. Each event can take up to 5K of memory. That can add up
quickly when we add a bunch of events without thinking about that cost.

-- Steve


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-06-18  7:10     ` Matwey V. Kornilov
@ 2018-07-17 20:10       ` Ezequiel Garcia
  2018-07-17 20:51         ` Alan Stern
                           ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Ezequiel Garcia @ 2018-07-17 20:10 UTC (permalink / raw)
  To: Matwey V. Kornilov
  Cc: Alan Stern, Hans de Goede, hverkuil, mchehab, Laurent Pinchart,
	rostedt, mingo, isely, bhumirks, colin.king, linux-media,
	open list

Hi Matwey,

First of all, sorry for the delay.

Adding Alan and Hans. Guys, do you have any feedback here?

See below for some feedback on my side.

On Mon, 2018-06-18 at 10:10 +0300, Matwey V. Kornilov wrote:
> Hi Ezequiel,
> 
> 2018-06-18 8:11 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>:
> > + Laurent
> > 
> > On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote:
> > > DMA cocherency slows the transfer down on systems without hardware
> > > coherent DMA.
> > > 
> > > Based on previous commit the following performance benchmarks have been
> > > carried out. Average memcpy() data transfer rate (rate) and handler
> > > completion time (time) have been measured when running video stream at
> > > 640x480 resolution at 10fps.
> > > 
> > > x86_64 based system (Intel Core i5-3470). This platform has hardware
> > > coherent DMA support and proposed change doesn't make big difference here.
> > > 
> > >  * kmalloc:            rate = (4.4 +- 1.0) GBps
> > >                        time = (2.4 +- 1.2) usec
> > >  * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps
> > >                        time = (2.5 +- 1.0) usec
> > > 
> > > We see that the measurements agree well within error ranges in this case.
> > > So no performance downgrade is introduced.
> > > 
> > > armv7l based system (TI AM335x BeagleBone Black). This platform has no
> > > hardware coherent DMA support. DMA coherence is implemented via disabled
> > > page caching that slows down memcpy() due to memory controller behaviour.
> > > 
> > >  * kmalloc:            rate =  (190 +-  30) MBps
> > >                        time =   (50 +-  10) usec
> > >  * usb_alloc_coherent: rate =   (33 +-   4) MBps
> > >                        time = (3000 +- 400) usec
> > > 
> > > Note, that quantative difference leads (this commit leads to 5 times
> > > acceleration) to qualitative behavior change in this case. As it was
> > > stated before, the video stream can not be successfully received at AM335x
> > > platforms with MUSB based USB host controller due to performance issues
> > > [1].
> > > 
> > > [1] https://www.spinics.net/lists/linux-usb/msg165735.html
> > > 
> > 
> > This is quite interesting! I have receive similar complaints
> > from users wanting to use stk1160 on BBB and Raspberrys,
> > without much luck on either, due to insufficient isoc bandwidth.
> > 
> > I'm guessing other ARM platforms could be suffering
> > from the same issue.
> > 
> > Note that stk1160 and uvcvideo drivers use kmalloc on platforms
> > where DMA_NONCOHERENT is defined, but this is not the case
> > on ARM platforms.
> 
> There are some ARMv7 platforms that have coherent DMA (for instance
> Broadcome Horthstar Plus series), but the most of them don't have. It
> is defined in device tree file, and there is no way to recover this
> information at runtime in USB perepherial driver.
> 
> > 
> > So, what is the benefit of using consistent
> > for these URBs, as opposed to streaming?
> 
> I don't know, I think there is no real benefit and all we see is a
> consequence of copy-pasta when some webcam drivers were inspired by
> others and development priparily was going at x86 platforms.

You are probably right about the copy-pasta.

>  It would
> be great if somebody corrected me here. DMA Coherence is quite strong
> property and I cannot figure out how can it help when streaming video.
> The CPU host always reads from the buffer and never writes to.
> Hardware perepherial always writes to and never reads from. Moreover,
> buffer access is mutually exclusive and separated in time by Interrupt
> fireing and URB starting (when we reuse existing URB for new request).
> Only single one memory barrier is really required here.
> 

Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
create DMA mappings and use the streaming API. Which makes more
sense in hardware without hardware coherency.

The only thing that bothers me with this patch is that it's not
really something specific to this driver. If this fix is valid
for pwc, then it's valid for all the drivers allocating coherent
memory.

And also, this path won't prevent further copy-paste spread
of the coherent allocation.

Is there any chance we can introduce a helper to allocate
isoc URBs, and then change all drivers to use it? No need
to do all of them now, but it would be good to at least have
a plan for it.

Chances are, only a handful of platforms would care to use
coherent mappings.

> I understand that there are cases when DMA coherence is really needed,
> for instane VirtIO VRing when we accessing same data structure in both
> directions from the both sides, but this has nothing common with our
> case.
> 
> > 
> > If the choice is simply platform dependent,
> > can't we somehow detect which mapping should
> > be prefered?
> 
> Now, we don't have this way.
> 
> > 
> > > Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru>
> > > ---
> > >  drivers/media/usb/pwc/pwc-if.c | 12 +++---------
> > >  1 file changed, 3 insertions(+), 9 deletions(-)
> > > 
> > > diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c
> > > index 5775d1f60668..6a3cd9680a7f 100644
> > > --- a/drivers/media/usb/pwc/pwc-if.c
> > > +++ b/drivers/media/usb/pwc/pwc-if.c
> > > @@ -427,11 +427,8 @@ static int pwc_isoc_init(struct pwc_device *pdev)
> > >               urb->interval = 1; // devik
> > >               urb->dev = udev;
> > >               urb->pipe = usb_rcvisocpipe(udev, pdev->vendpoint);
> > > -             urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP;
> > > -             urb->transfer_buffer = usb_alloc_coherent(udev,
> > > -                                                       ISO_BUFFER_SIZE,
> > > -                                                       GFP_KERNEL,
> > > -                                                       &urb->transfer_dma);
> > > +             urb->transfer_flags = URB_ISO_ASAP;
> > > +             urb->transfer_buffer = kmalloc(ISO_BUFFER_SIZE, GFP_KERNEL);
> > >               if (urb->transfer_buffer == NULL) {
> > >                       PWC_ERROR("Failed to allocate urb buffer %d\n", i);
> > >                       pwc_isoc_cleanup(pdev);
> > > @@ -491,10 +488,7 @@ static void pwc_iso_free(struct pwc_device *pdev)
> > >               if (pdev->urbs[i]) {
> > >                       PWC_DEBUG_MEMORY("Freeing URB\n");
> > >                       if (pdev->urbs[i]->transfer_buffer) {
> > > -                             usb_free_coherent(pdev->udev,
> > > -                                     pdev->urbs[i]->transfer_buffer_length,
> > > -                                     pdev->urbs[i]->transfer_buffer,
> > > -                                     pdev->urbs[i]->transfer_dma);
> > > +                             kfree(pdev->urbs[i]->transfer_buffer);
> > >                       }
> > >                       usb_free_urb(pdev->urbs[i]);
> > >                       pdev->urbs[i] = NULL;
> 
> 
> 


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-17 20:10       ` Ezequiel Garcia
@ 2018-07-17 20:51         ` Alan Stern
  2018-07-20 10:55           ` Tomasz Figa
  2018-07-18 12:10         ` Matwey V. Kornilov
  2018-07-30 16:07         ` Mauro Carvalho Chehab
  2 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2018-07-17 20:51 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: Matwey V. Kornilov, Hans de Goede, hverkuil, mchehab,
	Laurent Pinchart, rostedt, mingo, isely, bhumirks, colin.king,
	linux-media, open list

On Tue, 17 Jul 2018, Ezequiel Garcia wrote:

> Hi Matwey,
> 
> First of all, sorry for the delay.
> 
> Adding Alan and Hans. Guys, do you have any feedback here?

...

> > > So, what is the benefit of using consistent
> > > for these URBs, as opposed to streaming?
> > 
> > I don't know, I think there is no real benefit and all we see is a
> > consequence of copy-pasta when some webcam drivers were inspired by
> > others and development priparily was going at x86 platforms.
> 
> You are probably right about the copy-pasta.
> 
> >  It would
> > be great if somebody corrected me here. DMA Coherence is quite strong
> > property and I cannot figure out how can it help when streaming video.
> > The CPU host always reads from the buffer and never writes to.
> > Hardware perepherial always writes to and never reads from. Moreover,
> > buffer access is mutually exclusive and separated in time by Interrupt
> > fireing and URB starting (when we reuse existing URB for new request).
> > Only single one memory barrier is really required here.
> > 
> 
> Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
> create DMA mappings and use the streaming API. Which makes more
> sense in hardware without hardware coherency.

As far as I know, the _only_ advantage to using coherent DMA in this
situation is that you then do not have to pay the overhead of
constantly setting up and tearing down the streaming mappings.  So it
depends very much on the platform: If coherent buffers are cached then
it's a slight win and otherwise it's a big lose.

Alan Stern


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-17 20:10       ` Ezequiel Garcia
  2018-07-17 20:51         ` Alan Stern
@ 2018-07-18 12:10         ` Matwey V. Kornilov
  2018-07-19 23:36           ` Ezequiel Garcia
  2018-07-30 16:07         ` Mauro Carvalho Chehab
  2 siblings, 1 reply; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-07-18 12:10 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: Alan Stern, Hans de Goede, hverkuil, mchehab, Laurent Pinchart,
	Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King,
	linux-media, open list

2018-07-17 23:10 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>:
> Hi Matwey,
>
> First of all, sorry for the delay.
>
> Adding Alan and Hans. Guys, do you have any feedback here?
>
> See below for some feedback on my side.
>
> On Mon, 2018-06-18 at 10:10 +0300, Matwey V. Kornilov wrote:
>> Hi Ezequiel,
>>
>> 2018-06-18 8:11 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>:
>> > + Laurent
>> >
>> > On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote:
>> > > DMA cocherency slows the transfer down on systems without hardware
>> > > coherent DMA.
>> > >
>> > > Based on previous commit the following performance benchmarks have been
>> > > carried out. Average memcpy() data transfer rate (rate) and handler
>> > > completion time (time) have been measured when running video stream at
>> > > 640x480 resolution at 10fps.
>> > >
>> > > x86_64 based system (Intel Core i5-3470). This platform has hardware
>> > > coherent DMA support and proposed change doesn't make big difference here.
>> > >
>> > >  * kmalloc:            rate = (4.4 +- 1.0) GBps
>> > >                        time = (2.4 +- 1.2) usec
>> > >  * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps
>> > >                        time = (2.5 +- 1.0) usec
>> > >
>> > > We see that the measurements agree well within error ranges in this case.
>> > > So no performance downgrade is introduced.
>> > >
>> > > armv7l based system (TI AM335x BeagleBone Black). This platform has no
>> > > hardware coherent DMA support. DMA coherence is implemented via disabled
>> > > page caching that slows down memcpy() due to memory controller behaviour.
>> > >
>> > >  * kmalloc:            rate =  (190 +-  30) MBps
>> > >                        time =   (50 +-  10) usec
>> > >  * usb_alloc_coherent: rate =   (33 +-   4) MBps
>> > >                        time = (3000 +- 400) usec
>> > >
>> > > Note, that quantative difference leads (this commit leads to 5 times
>> > > acceleration) to qualitative behavior change in this case. As it was
>> > > stated before, the video stream can not be successfully received at AM335x
>> > > platforms with MUSB based USB host controller due to performance issues
>> > > [1].
>> > >
>> > > [1] https://www.spinics.net/lists/linux-usb/msg165735.html
>> > >
>> >
>> > This is quite interesting! I have receive similar complaints
>> > from users wanting to use stk1160 on BBB and Raspberrys,
>> > without much luck on either, due to insufficient isoc bandwidth.
>> >
>> > I'm guessing other ARM platforms could be suffering
>> > from the same issue.
>> >
>> > Note that stk1160 and uvcvideo drivers use kmalloc on platforms
>> > where DMA_NONCOHERENT is defined, but this is not the case
>> > on ARM platforms.
>>
>> There are some ARMv7 platforms that have coherent DMA (for instance
>> Broadcome Horthstar Plus series), but the most of them don't have. It
>> is defined in device tree file, and there is no way to recover this
>> information at runtime in USB perepherial driver.
>>
>> >
>> > So, what is the benefit of using consistent
>> > for these URBs, as opposed to streaming?
>>
>> I don't know, I think there is no real benefit and all we see is a
>> consequence of copy-pasta when some webcam drivers were inspired by
>> others and development priparily was going at x86 platforms.
>
> You are probably right about the copy-pasta.
>
>>  It would
>> be great if somebody corrected me here. DMA Coherence is quite strong
>> property and I cannot figure out how can it help when streaming video.
>> The CPU host always reads from the buffer and never writes to.
>> Hardware perepherial always writes to and never reads from. Moreover,
>> buffer access is mutually exclusive and separated in time by Interrupt
>> fireing and URB starting (when we reuse existing URB for new request).
>> Only single one memory barrier is really required here.
>>
>
> Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
> create DMA mappings and use the streaming API. Which makes more
> sense in hardware without hardware coherency.
>
> The only thing that bothers me with this patch is that it's not
> really something specific to this driver. If this fix is valid
> for pwc, then it's valid for all the drivers allocating coherent
> memory.
>
> And also, this path won't prevent further copy-paste spread
> of the coherent allocation.
>
> Is there any chance we can introduce a helper to allocate
> isoc URBs, and then change all drivers to use it? No need
> to do all of them now, but it would be good to at least have
> a plan for it.

Well, basically I am agree with you.
However, I don't have all possible hardware to test, so I can't fix
all possible drivers.
Also I can not figure out how could the helper looked like. What do
you think about usb_alloc() (c.f. usb_alloc_coherent()) ?

>
> Chances are, only a handful of platforms would care to use
> coherent mappings.
>
>> I understand that there are cases when DMA coherence is really needed,
>> for instane VirtIO VRing when we accessing same data structure in both
>> directions from the both sides, but this has nothing common with our
>> case.
>>
>> >
>> > If the choice is simply platform dependent,
>> > can't we somehow detect which mapping should
>> > be prefered?
>>
>> Now, we don't have this way.
>>
>> >
>> > > Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru>
>> > > ---
>> > >  drivers/media/usb/pwc/pwc-if.c | 12 +++---------
>> > >  1 file changed, 3 insertions(+), 9 deletions(-)
>> > >
>> > > diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c
>> > > index 5775d1f60668..6a3cd9680a7f 100644
>> > > --- a/drivers/media/usb/pwc/pwc-if.c
>> > > +++ b/drivers/media/usb/pwc/pwc-if.c
>> > > @@ -427,11 +427,8 @@ static int pwc_isoc_init(struct pwc_device *pdev)
>> > >               urb->interval = 1; // devik
>> > >               urb->dev = udev;
>> > >               urb->pipe = usb_rcvisocpipe(udev, pdev->vendpoint);
>> > > -             urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP;
>> > > -             urb->transfer_buffer = usb_alloc_coherent(udev,
>> > > -                                                       ISO_BUFFER_SIZE,
>> > > -                                                       GFP_KERNEL,
>> > > -                                                       &urb->transfer_dma);
>> > > +             urb->transfer_flags = URB_ISO_ASAP;
>> > > +             urb->transfer_buffer = kmalloc(ISO_BUFFER_SIZE, GFP_KERNEL);
>> > >               if (urb->transfer_buffer == NULL) {
>> > >                       PWC_ERROR("Failed to allocate urb buffer %d\n", i);
>> > >                       pwc_isoc_cleanup(pdev);
>> > > @@ -491,10 +488,7 @@ static void pwc_iso_free(struct pwc_device *pdev)
>> > >               if (pdev->urbs[i]) {
>> > >                       PWC_DEBUG_MEMORY("Freeing URB\n");
>> > >                       if (pdev->urbs[i]->transfer_buffer) {
>> > > -                             usb_free_coherent(pdev->udev,
>> > > -                                     pdev->urbs[i]->transfer_buffer_length,
>> > > -                                     pdev->urbs[i]->transfer_buffer,
>> > > -                                     pdev->urbs[i]->transfer_dma);
>> > > +                             kfree(pdev->urbs[i]->transfer_buffer);
>> > >                       }
>> > >                       usb_free_urb(pdev->urbs[i]);
>> > >                       pdev->urbs[i] = NULL;
>>
>>
>>
>



-- 
With best regards,
Matwey V. Kornilov.
Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-18 12:10         ` Matwey V. Kornilov
@ 2018-07-19 23:36           ` Ezequiel Garcia
  2018-07-20  9:35             ` Matwey V. Kornilov
  2018-07-30 15:42             ` Laurent Pinchart
  0 siblings, 2 replies; 38+ messages in thread
From: Ezequiel Garcia @ 2018-07-19 23:36 UTC (permalink / raw)
  To: Matwey V. Kornilov
  Cc: Alan Stern, Hans de Goede, hverkuil, mchehab, Laurent Pinchart,
	Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King,
	linux-media, open list

On Wed, 2018-07-18 at 15:10 +0300, Matwey V. Kornilov wrote:
> 2018-07-17 23:10 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>:
> > Hi Matwey,
> > 
> > First of all, sorry for the delay.
> > 
> > Adding Alan and Hans. Guys, do you have any feedback here?
> > 
> > See below for some feedback on my side.
> > 
> > On Mon, 2018-06-18 at 10:10 +0300, Matwey V. Kornilov wrote:
> > > Hi Ezequiel,
> > > 
> > > 2018-06-18 8:11 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>:
> > > > + Laurent
> > > > 
> > > > On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote:
> > > > > DMA cocherency slows the transfer down on systems without hardware
> > > > > coherent DMA.
> > > > > 
> > > > > Based on previous commit the following performance benchmarks have been
> > > > > carried out. Average memcpy() data transfer rate (rate) and handler
> > > > > completion time (time) have been measured when running video stream at
> > > > > 640x480 resolution at 10fps.
> > > > > 
> > > > > x86_64 based system (Intel Core i5-3470). This platform has hardware
> > > > > coherent DMA support and proposed change doesn't make big difference here.
> > > > > 
> > > > >  * kmalloc:            rate = (4.4 +- 1.0) GBps
> > > > >                        time = (2.4 +- 1.2) usec
> > > > >  * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps
> > > > >                        time = (2.5 +- 1.0) usec
> > > > > 
> > > > > We see that the measurements agree well within error ranges in this case.
> > > > > So no performance downgrade is introduced.
> > > > > 
> > > > > armv7l based system (TI AM335x BeagleBone Black). This platform has no
> > > > > hardware coherent DMA support. DMA coherence is implemented via disabled
> > > > > page caching that slows down memcpy() due to memory controller behaviour.
> > > > > 
> > > > >  * kmalloc:            rate =  (190 +-  30) MBps
> > > > >                        time =   (50 +-  10) usec
> > > > >  * usb_alloc_coherent: rate =   (33 +-   4) MBps
> > > > >                        time = (3000 +- 400) usec
> > > > > 
> > > > > Note, that quantative difference leads (this commit leads to 5 times
> > > > > acceleration) to qualitative behavior change in this case. As it was
> > > > > stated before, the video stream can not be successfully received at AM335x
> > > > > platforms with MUSB based USB host controller due to performance issues
> > > > > [1].
> > > > > 
> > > > > [1] https://www.spinics.net/lists/linux-usb/msg165735.html
> > > > > 
> > > > 
> > > > This is quite interesting! I have receive similar complaints
> > > > from users wanting to use stk1160 on BBB and Raspberrys,
> > > > without much luck on either, due to insufficient isoc bandwidth.
> > > > 
> > > > I'm guessing other ARM platforms could be suffering
> > > > from the same issue.
> > > > 
> > > > Note that stk1160 and uvcvideo drivers use kmalloc on platforms
> > > > where DMA_NONCOHERENT is defined, but this is not the case
> > > > on ARM platforms.
> > > 
> > > There are some ARMv7 platforms that have coherent DMA (for instance
> > > Broadcome Horthstar Plus series), but the most of them don't have. It
> > > is defined in device tree file, and there is no way to recover this
> > > information at runtime in USB perepherial driver.
> > > 
> > > > 
> > > > So, what is the benefit of using consistent
> > > > for these URBs, as opposed to streaming?
> > > 
> > > I don't know, I think there is no real benefit and all we see is a
> > > consequence of copy-pasta when some webcam drivers were inspired by
> > > others and development priparily was going at x86 platforms.
> > 
> > You are probably right about the copy-pasta.
> > 
> > >  It would
> > > be great if somebody corrected me here. DMA Coherence is quite strong
> > > property and I cannot figure out how can it help when streaming video.
> > > The CPU host always reads from the buffer and never writes to.
> > > Hardware perepherial always writes to and never reads from. Moreover,
> > > buffer access is mutually exclusive and separated in time by Interrupt
> > > fireing and URB starting (when we reuse existing URB for new request).
> > > Only single one memory barrier is really required here.
> > > 
> > 
> > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
> > create DMA mappings and use the streaming API. Which makes more
> > sense in hardware without hardware coherency.
> > 
> > The only thing that bothers me with this patch is that it's not
> > really something specific to this driver. If this fix is valid
> > for pwc, then it's valid for all the drivers allocating coherent
> > memory.
> > 
> > And also, this path won't prevent further copy-paste spread
> > of the coherent allocation.
> > 
> > Is there any chance we can introduce a helper to allocate
> > isoc URBs, and then change all drivers to use it? No need
> > to do all of them now, but it would be good to at least have
> > a plan for it.
> 
> Well, basically I am agree with you.
> However, I don't have all possible hardware to test, so I can't fix
> all possible drivers.

Sure. And keep in mind this is more about the USB host controller,
than about this specific driver. So it's the controller what we
would have to test!

> Also I can not figure out how could the helper looked like. What do
> you think about usb_alloc() (c.f. usb_alloc_coherent()) ?
> 

I do not know that either. But it's something we can think about.

Meanwhile, it would be a shame to loose or stall this excellent
effort (which is effectively enabling a cameras on a bunch of devices).

How about you introduce a driver parameter (or device attribute),
to avoid changing the behavior for USB host controllers we don't know
about?

Something like 'alloc_coherent_urbs=y/n'. Perhaps set that
to 'yes' by default in x86, and 'no' by default in the rest?

We can think about a generic solution later.

Thanks,
Eze

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-19 23:36           ` Ezequiel Garcia
@ 2018-07-20  9:35             ` Matwey V. Kornilov
  2018-07-30 15:42             ` Laurent Pinchart
  1 sibling, 0 replies; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-07-20  9:35 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: Alan Stern, Hans de Goede, hverkuil, mchehab, Laurent Pinchart,
	Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King,
	linux-media, open list

2018-07-20 2:36 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>:
> On Wed, 2018-07-18 at 15:10 +0300, Matwey V. Kornilov wrote:
>> 2018-07-17 23:10 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>:
>> > Hi Matwey,
>> >
>> > First of all, sorry for the delay.
>> >
>> > Adding Alan and Hans. Guys, do you have any feedback here?
>> >
>> > See below for some feedback on my side.
>> >
>> > On Mon, 2018-06-18 at 10:10 +0300, Matwey V. Kornilov wrote:
>> > > Hi Ezequiel,
>> > >
>> > > 2018-06-18 8:11 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>:
>> > > > + Laurent
>> > > >
>> > > > On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote:
>> > > > > DMA cocherency slows the transfer down on systems without hardware
>> > > > > coherent DMA.
>> > > > >
>> > > > > Based on previous commit the following performance benchmarks have been
>> > > > > carried out. Average memcpy() data transfer rate (rate) and handler
>> > > > > completion time (time) have been measured when running video stream at
>> > > > > 640x480 resolution at 10fps.
>> > > > >
>> > > > > x86_64 based system (Intel Core i5-3470). This platform has hardware
>> > > > > coherent DMA support and proposed change doesn't make big difference here.
>> > > > >
>> > > > >  * kmalloc:            rate = (4.4 +- 1.0) GBps
>> > > > >                        time = (2.4 +- 1.2) usec
>> > > > >  * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps
>> > > > >                        time = (2.5 +- 1.0) usec
>> > > > >
>> > > > > We see that the measurements agree well within error ranges in this case.
>> > > > > So no performance downgrade is introduced.
>> > > > >
>> > > > > armv7l based system (TI AM335x BeagleBone Black). This platform has no
>> > > > > hardware coherent DMA support. DMA coherence is implemented via disabled
>> > > > > page caching that slows down memcpy() due to memory controller behaviour.
>> > > > >
>> > > > >  * kmalloc:            rate =  (190 +-  30) MBps
>> > > > >                        time =   (50 +-  10) usec
>> > > > >  * usb_alloc_coherent: rate =   (33 +-   4) MBps
>> > > > >                        time = (3000 +- 400) usec
>> > > > >
>> > > > > Note, that quantative difference leads (this commit leads to 5 times
>> > > > > acceleration) to qualitative behavior change in this case. As it was
>> > > > > stated before, the video stream can not be successfully received at AM335x
>> > > > > platforms with MUSB based USB host controller due to performance issues
>> > > > > [1].
>> > > > >
>> > > > > [1] https://www.spinics.net/lists/linux-usb/msg165735.html
>> > > > >
>> > > >
>> > > > This is quite interesting! I have receive similar complaints
>> > > > from users wanting to use stk1160 on BBB and Raspberrys,
>> > > > without much luck on either, due to insufficient isoc bandwidth.
>> > > >
>> > > > I'm guessing other ARM platforms could be suffering
>> > > > from the same issue.
>> > > >
>> > > > Note that stk1160 and uvcvideo drivers use kmalloc on platforms
>> > > > where DMA_NONCOHERENT is defined, but this is not the case
>> > > > on ARM platforms.
>> > >
>> > > There are some ARMv7 platforms that have coherent DMA (for instance
>> > > Broadcome Horthstar Plus series), but the most of them don't have. It
>> > > is defined in device tree file, and there is no way to recover this
>> > > information at runtime in USB perepherial driver.
>> > >
>> > > >
>> > > > So, what is the benefit of using consistent
>> > > > for these URBs, as opposed to streaming?
>> > >
>> > > I don't know, I think there is no real benefit and all we see is a
>> > > consequence of copy-pasta when some webcam drivers were inspired by
>> > > others and development priparily was going at x86 platforms.
>> >
>> > You are probably right about the copy-pasta.
>> >
>> > >  It would
>> > > be great if somebody corrected me here. DMA Coherence is quite strong
>> > > property and I cannot figure out how can it help when streaming video.
>> > > The CPU host always reads from the buffer and never writes to.
>> > > Hardware perepherial always writes to and never reads from. Moreover,
>> > > buffer access is mutually exclusive and separated in time by Interrupt
>> > > fireing and URB starting (when we reuse existing URB for new request).
>> > > Only single one memory barrier is really required here.
>> > >
>> >
>> > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
>> > create DMA mappings and use the streaming API. Which makes more
>> > sense in hardware without hardware coherency.
>> >
>> > The only thing that bothers me with this patch is that it's not
>> > really something specific to this driver. If this fix is valid
>> > for pwc, then it's valid for all the drivers allocating coherent
>> > memory.
>> >
>> > And also, this path won't prevent further copy-paste spread
>> > of the coherent allocation.
>> >
>> > Is there any chance we can introduce a helper to allocate
>> > isoc URBs, and then change all drivers to use it? No need
>> > to do all of them now, but it would be good to at least have
>> > a plan for it.
>>
>> Well, basically I am agree with you.
>> However, I don't have all possible hardware to test, so I can't fix
>> all possible drivers.
>
> Sure. And keep in mind this is more about the USB host controller,
> than about this specific driver. So it's the controller what we
> would have to test!
>
>> Also I can not figure out how could the helper looked like. What do
>> you think about usb_alloc() (c.f. usb_alloc_coherent()) ?
>>
>
> I do not know that either. But it's something we can think about.
>
> Meanwhile, it would be a shame to loose or stall this excellent
> effort (which is effectively enabling a cameras on a bunch of devices).
>
> How about you introduce a driver parameter (or device attribute),
> to avoid changing the behavior for USB host controllers we don't know
> about?

It would be fine for me, if there are no objections. I know that some
maintainers don't like module parameters per se.

>
> Something like 'alloc_coherent_urbs=y/n'. Perhaps set that
> to 'yes' by default in x86, and 'no' by default in the rest?
>
> We can think about a generic solution later.
>
> Thanks,
> Eze
>



-- 
With best regards,
Matwey V. Kornilov.
Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-17 20:51         ` Alan Stern
@ 2018-07-20 10:55           ` Tomasz Figa
  2018-07-20 11:22             ` Matwey V. Kornilov
  0 siblings, 1 reply; 38+ messages in thread
From: Tomasz Figa @ 2018-07-20 10:55 UTC (permalink / raw)
  To: stern
  Cc: Ezequiel Garcia, matwey, hdegoede, Hans Verkuil,
	Mauro Carvalho Chehab, Laurent Pinchart, rostedt, mingo, isely,
	bhumirks, colin.king, Linux Media Mailing List,
	Linux Kernel Mailing List

On Wed, Jul 18, 2018 at 5:51 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>
> On Tue, 17 Jul 2018, Ezequiel Garcia wrote:
>
> > Hi Matwey,
> >
> > First of all, sorry for the delay.
> >
> > Adding Alan and Hans. Guys, do you have any feedback here?
>
> ...
>
> > > > So, what is the benefit of using consistent
> > > > for these URBs, as opposed to streaming?
> > >
> > > I don't know, I think there is no real benefit and all we see is a
> > > consequence of copy-pasta when some webcam drivers were inspired by
> > > others and development priparily was going at x86 platforms.
> >
> > You are probably right about the copy-pasta.
> >
> > >  It would
> > > be great if somebody corrected me here. DMA Coherence is quite strong
> > > property and I cannot figure out how can it help when streaming video.
> > > The CPU host always reads from the buffer and never writes to.
> > > Hardware perepherial always writes to and never reads from. Moreover,
> > > buffer access is mutually exclusive and separated in time by Interrupt
> > > fireing and URB starting (when we reuse existing URB for new request).
> > > Only single one memory barrier is really required here.
> > >
> >
> > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
> > create DMA mappings and use the streaming API. Which makes more
> > sense in hardware without hardware coherency.
>
> As far as I know, the _only_ advantage to using coherent DMA in this
> situation is that you then do not have to pay the overhead of
> constantly setting up and tearing down the streaming mappings.  So it
> depends very much on the platform: If coherent buffers are cached then
> it's a slight win and otherwise it's a big lose.

Isn't it about usb_alloc_coherent() being backed by DMA coherent API
(dma_alloc_coherent/attrs()) and ending up allocating uncached (or
write-combine) memory for devices with non-coherent DMAs? I'm not sure
how this memory is used by this driver, but if it reads from it using
CPU, uncached mapping might be significantly slower than cached
mapping of memory allocated with kmalloc().

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-20 10:55           ` Tomasz Figa
@ 2018-07-20 11:22             ` Matwey V. Kornilov
  2018-07-20 11:33               ` Tomasz Figa
  0 siblings, 1 reply; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-07-20 11:22 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Alan Stern, Ezequiel Garcia, Hans de Goede, Hans Verkuil,
	Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List

2018-07-20 13:55 GMT+03:00 Tomasz Figa <tfiga@chromium.org>:
> On Wed, Jul 18, 2018 at 5:51 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>>
>> On Tue, 17 Jul 2018, Ezequiel Garcia wrote:
>>
>> > Hi Matwey,
>> >
>> > First of all, sorry for the delay.
>> >
>> > Adding Alan and Hans. Guys, do you have any feedback here?
>>
>> ...
>>
>> > > > So, what is the benefit of using consistent
>> > > > for these URBs, as opposed to streaming?
>> > >
>> > > I don't know, I think there is no real benefit and all we see is a
>> > > consequence of copy-pasta when some webcam drivers were inspired by
>> > > others and development priparily was going at x86 platforms.
>> >
>> > You are probably right about the copy-pasta.
>> >
>> > >  It would
>> > > be great if somebody corrected me here. DMA Coherence is quite strong
>> > > property and I cannot figure out how can it help when streaming video.
>> > > The CPU host always reads from the buffer and never writes to.
>> > > Hardware perepherial always writes to and never reads from. Moreover,
>> > > buffer access is mutually exclusive and separated in time by Interrupt
>> > > fireing and URB starting (when we reuse existing URB for new request).
>> > > Only single one memory barrier is really required here.
>> > >
>> >
>> > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
>> > create DMA mappings and use the streaming API. Which makes more
>> > sense in hardware without hardware coherency.
>>
>> As far as I know, the _only_ advantage to using coherent DMA in this
>> situation is that you then do not have to pay the overhead of
>> constantly setting up and tearing down the streaming mappings.  So it
>> depends very much on the platform: If coherent buffers are cached then
>> it's a slight win and otherwise it's a big lose.
>
> Isn't it about usb_alloc_coherent() being backed by DMA coherent API
> (dma_alloc_coherent/attrs()) and ending up allocating uncached (or
> write-combine) memory for devices with non-coherent DMAs? I'm not sure

Yes, this is what exactly happens at armv7l platforms.

> how this memory is used by this driver, but if it reads from it using
> CPU, uncached mapping might be significantly slower than cached
> mapping of memory allocated with kmalloc().
>
> Best regards,
> Tomasz
>



-- 
With best regards,
Matwey V. Kornilov.
Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-20 11:22             ` Matwey V. Kornilov
@ 2018-07-20 11:33               ` Tomasz Figa
  2018-07-20 11:57                 ` Matwey V. Kornilov
  2018-07-30 15:13                 ` Laurent Pinchart
  0 siblings, 2 replies; 38+ messages in thread
From: Tomasz Figa @ 2018-07-20 11:33 UTC (permalink / raw)
  To: Matwey V. Kornilov
  Cc: Alan Stern, Ezequiel Garcia, hdegoede, Hans Verkuil,
	Mauro Carvalho Chehab, Laurent Pinchart, rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

On Fri, Jul 20, 2018 at 8:23 PM Matwey V. Kornilov <matwey@sai.msu.ru> wrote:
>
> 2018-07-20 13:55 GMT+03:00 Tomasz Figa <tfiga@chromium.org>:
> > On Wed, Jul 18, 2018 at 5:51 AM Alan Stern <stern@rowland.harvard.edu> wrote:
> >>
> >> On Tue, 17 Jul 2018, Ezequiel Garcia wrote:
> >>
> >> > Hi Matwey,
> >> >
> >> > First of all, sorry for the delay.
> >> >
> >> > Adding Alan and Hans. Guys, do you have any feedback here?
> >>
> >> ...
> >>
> >> > > > So, what is the benefit of using consistent
> >> > > > for these URBs, as opposed to streaming?
> >> > >
> >> > > I don't know, I think there is no real benefit and all we see is a
> >> > > consequence of copy-pasta when some webcam drivers were inspired by
> >> > > others and development priparily was going at x86 platforms.
> >> >
> >> > You are probably right about the copy-pasta.
> >> >
> >> > >  It would
> >> > > be great if somebody corrected me here. DMA Coherence is quite strong
> >> > > property and I cannot figure out how can it help when streaming video.
> >> > > The CPU host always reads from the buffer and never writes to.
> >> > > Hardware perepherial always writes to and never reads from. Moreover,
> >> > > buffer access is mutually exclusive and separated in time by Interrupt
> >> > > fireing and URB starting (when we reuse existing URB for new request).
> >> > > Only single one memory barrier is really required here.
> >> > >
> >> >
> >> > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
> >> > create DMA mappings and use the streaming API. Which makes more
> >> > sense in hardware without hardware coherency.
> >>
> >> As far as I know, the _only_ advantage to using coherent DMA in this
> >> situation is that you then do not have to pay the overhead of
> >> constantly setting up and tearing down the streaming mappings.  So it
> >> depends very much on the platform: If coherent buffers are cached then
> >> it's a slight win and otherwise it's a big lose.
> >
> > Isn't it about usb_alloc_coherent() being backed by DMA coherent API
> > (dma_alloc_coherent/attrs()) and ending up allocating uncached (or
> > write-combine) memory for devices with non-coherent DMAs? I'm not sure
>
> Yes, this is what exactly happens at armv7l platforms.

Okay, thanks. So this seems to be exactly the same thing that is
happening in the UVC driver. There is quite a bit of random accesses
to extract some header fields and then a big memcpy into VB2 buffer to
assemble final frame.

If we don't want to pay the cost of creating and destroying the
streaming mapping, we could map (dma_map_single()) once, set
transfer_dma of URB and URB_NO_TRANSFER_DMA_MAP and then just
synchronize the caches (dma_sync_single()) before submitting/after
completing the URB.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-20 11:33               ` Tomasz Figa
@ 2018-07-20 11:57                 ` Matwey V. Kornilov
  2018-07-23 17:04                   ` Matwey V. Kornilov
  2018-07-30 15:13                 ` Laurent Pinchart
  1 sibling, 1 reply; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-07-20 11:57 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Alan Stern, Ezequiel Garcia, Hans de Goede, Hans Verkuil,
	Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

2018-07-20 14:33 GMT+03:00 Tomasz Figa <tfiga@chromium.org>:
> On Fri, Jul 20, 2018 at 8:23 PM Matwey V. Kornilov <matwey@sai.msu.ru> wrote:
>>
>> 2018-07-20 13:55 GMT+03:00 Tomasz Figa <tfiga@chromium.org>:
>> > On Wed, Jul 18, 2018 at 5:51 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>> >>
>> >> On Tue, 17 Jul 2018, Ezequiel Garcia wrote:
>> >>
>> >> > Hi Matwey,
>> >> >
>> >> > First of all, sorry for the delay.
>> >> >
>> >> > Adding Alan and Hans. Guys, do you have any feedback here?
>> >>
>> >> ...
>> >>
>> >> > > > So, what is the benefit of using consistent
>> >> > > > for these URBs, as opposed to streaming?
>> >> > >
>> >> > > I don't know, I think there is no real benefit and all we see is a
>> >> > > consequence of copy-pasta when some webcam drivers were inspired by
>> >> > > others and development priparily was going at x86 platforms.
>> >> >
>> >> > You are probably right about the copy-pasta.
>> >> >
>> >> > >  It would
>> >> > > be great if somebody corrected me here. DMA Coherence is quite strong
>> >> > > property and I cannot figure out how can it help when streaming video.
>> >> > > The CPU host always reads from the buffer and never writes to.
>> >> > > Hardware perepherial always writes to and never reads from. Moreover,
>> >> > > buffer access is mutually exclusive and separated in time by Interrupt
>> >> > > fireing and URB starting (when we reuse existing URB for new request).
>> >> > > Only single one memory barrier is really required here.
>> >> > >
>> >> >
>> >> > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
>> >> > create DMA mappings and use the streaming API. Which makes more
>> >> > sense in hardware without hardware coherency.
>> >>
>> >> As far as I know, the _only_ advantage to using coherent DMA in this
>> >> situation is that you then do not have to pay the overhead of
>> >> constantly setting up and tearing down the streaming mappings.  So it
>> >> depends very much on the platform: If coherent buffers are cached then
>> >> it's a slight win and otherwise it's a big lose.
>> >
>> > Isn't it about usb_alloc_coherent() being backed by DMA coherent API
>> > (dma_alloc_coherent/attrs()) and ending up allocating uncached (or
>> > write-combine) memory for devices with non-coherent DMAs? I'm not sure
>>
>> Yes, this is what exactly happens at armv7l platforms.
>
> Okay, thanks. So this seems to be exactly the same thing that is
> happening in the UVC driver. There is quite a bit of random accesses
> to extract some header fields and then a big memcpy into VB2 buffer to
> assemble final frame.
>
> If we don't want to pay the cost of creating and destroying the
> streaming mapping, we could map (dma_map_single()) once, set
> transfer_dma of URB and URB_NO_TRANSFER_DMA_MAP and then just
> synchronize the caches (dma_sync_single()) before submitting/after
> completing the URB.

Thank you. Now it is becoming clearer to me. Please allow me some time
to try to sketch the implementation.

>
> Best regards,
> Tomasz
>



-- 
With best regards,
Matwey V. Kornilov.
Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-20 11:57                 ` Matwey V. Kornilov
@ 2018-07-23 17:04                   ` Matwey V. Kornilov
  2018-07-23 18:57                     ` Alan Stern
  0 siblings, 1 reply; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-07-23 17:04 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Alan Stern, Ezequiel Garcia, Hans de Goede, Hans Verkuil,
	Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

2018-07-20 14:57 GMT+03:00 Matwey V. Kornilov <matwey@sai.msu.ru>:
> 2018-07-20 14:33 GMT+03:00 Tomasz Figa <tfiga@chromium.org>:
>> On Fri, Jul 20, 2018 at 8:23 PM Matwey V. Kornilov <matwey@sai.msu.ru> wrote:
>>>
>>> 2018-07-20 13:55 GMT+03:00 Tomasz Figa <tfiga@chromium.org>:
>>> > On Wed, Jul 18, 2018 at 5:51 AM Alan Stern <stern@rowland.harvard.edu> wrote:
>>> >>
>>> >> On Tue, 17 Jul 2018, Ezequiel Garcia wrote:
>>> >>
>>> >> > Hi Matwey,
>>> >> >
>>> >> > First of all, sorry for the delay.
>>> >> >
>>> >> > Adding Alan and Hans. Guys, do you have any feedback here?
>>> >>
>>> >> ...
>>> >>
>>> >> > > > So, what is the benefit of using consistent
>>> >> > > > for these URBs, as opposed to streaming?
>>> >> > >
>>> >> > > I don't know, I think there is no real benefit and all we see is a
>>> >> > > consequence of copy-pasta when some webcam drivers were inspired by
>>> >> > > others and development priparily was going at x86 platforms.
>>> >> >
>>> >> > You are probably right about the copy-pasta.
>>> >> >
>>> >> > >  It would
>>> >> > > be great if somebody corrected me here. DMA Coherence is quite strong
>>> >> > > property and I cannot figure out how can it help when streaming video.
>>> >> > > The CPU host always reads from the buffer and never writes to.
>>> >> > > Hardware perepherial always writes to and never reads from. Moreover,
>>> >> > > buffer access is mutually exclusive and separated in time by Interrupt
>>> >> > > fireing and URB starting (when we reuse existing URB for new request).
>>> >> > > Only single one memory barrier is really required here.
>>> >> > >
>>> >> >
>>> >> > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
>>> >> > create DMA mappings and use the streaming API. Which makes more
>>> >> > sense in hardware without hardware coherency.
>>> >>
>>> >> As far as I know, the _only_ advantage to using coherent DMA in this
>>> >> situation is that you then do not have to pay the overhead of
>>> >> constantly setting up and tearing down the streaming mappings.  So it
>>> >> depends very much on the platform: If coherent buffers are cached then
>>> >> it's a slight win and otherwise it's a big lose.
>>> >
>>> > Isn't it about usb_alloc_coherent() being backed by DMA coherent API
>>> > (dma_alloc_coherent/attrs()) and ending up allocating uncached (or
>>> > write-combine) memory for devices with non-coherent DMAs? I'm not sure
>>>
>>> Yes, this is what exactly happens at armv7l platforms.
>>
>> Okay, thanks. So this seems to be exactly the same thing that is
>> happening in the UVC driver. There is quite a bit of random accesses
>> to extract some header fields and then a big memcpy into VB2 buffer to
>> assemble final frame.
>>
>> If we don't want to pay the cost of creating and destroying the
>> streaming mapping, we could map (dma_map_single()) once, set
>> transfer_dma of URB and URB_NO_TRANSFER_DMA_MAP and then just
>> synchronize the caches (dma_sync_single()) before submitting/after
>> completing the URB.
>
> Thank you. Now it is becoming clearer to me. Please allow me some time
> to try to sketch the implementation.

I've tried to strategies:

1) Use dma_unmap and dma_map inside the handler (I suppose this is
similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP)

2) Use sync_cpu and sync_device inside the handler (and dma_map only
once at memory allocation)

It is interesting that dma_unmap/dma_map pair leads to the lower
overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform.
At armv7l platform using dma_unmap/dma_map  leads to ~50 usec in the
handler, and sync_cpu/sync_device - ~65 usec.

However, I am not sure is it mandatory to call
dma_sync_single_for_device for FROM_DEVICE direction?

>
>>
>> Best regards,
>> Tomasz
>>
>
>
>
> --
> With best regards,
> Matwey V. Kornilov.
> Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
> 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382



-- 
With best regards,
Matwey V. Kornilov.
Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-23 17:04                   ` Matwey V. Kornilov
@ 2018-07-23 18:57                     ` Alan Stern
  2018-07-24 18:56                       ` Matwey V. Kornilov
  0 siblings, 1 reply; 38+ messages in thread
From: Alan Stern @ 2018-07-23 18:57 UTC (permalink / raw)
  To: Matwey V. Kornilov
  Cc: Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil,
	Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

On Mon, 23 Jul 2018, Matwey V. Kornilov wrote:

> I've tried to strategies:
> 
> 1) Use dma_unmap and dma_map inside the handler (I suppose this is
> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP)

Yes.

> 2) Use sync_cpu and sync_device inside the handler (and dma_map only
> once at memory allocation)
> 
> It is interesting that dma_unmap/dma_map pair leads to the lower
> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform.
> At armv7l platform using dma_unmap/dma_map  leads to ~50 usec in the
> handler, and sync_cpu/sync_device - ~65 usec.
> 
> However, I am not sure is it mandatory to call
> dma_sync_single_for_device for FROM_DEVICE direction?

According to Documentation/DMA-API-HOWTO.txt, the CPU should not write 
to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is
not needed.

Alan Stern


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-23 18:57                     ` Alan Stern
@ 2018-07-24 18:56                       ` Matwey V. Kornilov
  2018-07-24 20:55                         ` Alan Stern
  2018-07-30 15:35                         ` Laurent Pinchart
  0 siblings, 2 replies; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-07-24 18:56 UTC (permalink / raw)
  To: Alan Stern
  Cc: Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil,
	Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

2018-07-23 21:57 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>:
> On Mon, 23 Jul 2018, Matwey V. Kornilov wrote:
>
>> I've tried to strategies:
>>
>> 1) Use dma_unmap and dma_map inside the handler (I suppose this is
>> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP)
>
> Yes.
>
>> 2) Use sync_cpu and sync_device inside the handler (and dma_map only
>> once at memory allocation)
>>
>> It is interesting that dma_unmap/dma_map pair leads to the lower
>> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform.
>> At armv7l platform using dma_unmap/dma_map  leads to ~50 usec in the
>> handler, and sync_cpu/sync_device - ~65 usec.
>>
>> However, I am not sure is it mandatory to call
>> dma_sync_single_for_device for FROM_DEVICE direction?
>
> According to Documentation/DMA-API-HOWTO.txt, the CPU should not write
> to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is
> not needed.

Well, I measured the following at armv7l. The handler execution time
(URB_NO_TRANSFER_DMA_MAP is used for all cases):

1) coherent DMA: ~3000 usec (pwc is not functional)
2) explicit dma_unmap and dma_map in the handler: ~52 usec
3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 usec

So, I suppose that unfortunately Tomasz suggestion doesn't work. There
is no performance improvement when dma_sync_single is used.

At x86_64 the following happens:

1) coherent DMA: ~2 usec
2) explicit dma_unmap and dma_map in the handler: ~3.5 usec
3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4 usec

So, whats to do next? Personally, I think that DMA streaming API
introduces not so great overhead.
Does anybody happy with turning to streaming DMA or I'll introduce
module-level switch as Ezequiel suggested?


>
> Alan Stern
>



-- 
With best regards,
Matwey V. Kornilov.
Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-24 18:56                       ` Matwey V. Kornilov
@ 2018-07-24 20:55                         ` Alan Stern
  2018-07-25 13:46                           ` Matwey V. Kornilov
  2018-07-30 15:35                         ` Laurent Pinchart
  1 sibling, 1 reply; 38+ messages in thread
From: Alan Stern @ 2018-07-24 20:55 UTC (permalink / raw)
  To: Matwey V. Kornilov
  Cc: Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil,
	Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

On Tue, 24 Jul 2018, Matwey V. Kornilov wrote:

> 2018-07-23 21:57 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>:
> > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote:
> >
> >> I've tried to strategies:
> >>
> >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is
> >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP)
> >
> > Yes.
> >
> >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only
> >> once at memory allocation)
> >>
> >> It is interesting that dma_unmap/dma_map pair leads to the lower
> >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform.
> >> At armv7l platform using dma_unmap/dma_map  leads to ~50 usec in the
> >> handler, and sync_cpu/sync_device - ~65 usec.
> >>
> >> However, I am not sure is it mandatory to call
> >> dma_sync_single_for_device for FROM_DEVICE direction?
> >
> > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write
> > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is
> > not needed.
> 
> Well, I measured the following at armv7l. The handler execution time
> (URB_NO_TRANSFER_DMA_MAP is used for all cases):
> 
> 1) coherent DMA: ~3000 usec (pwc is not functional)
> 2) explicit dma_unmap and dma_map in the handler: ~52 usec
> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 usec
> 
> So, I suppose that unfortunately Tomasz suggestion doesn't work. There
> is no performance improvement when dma_sync_single is used.
> 
> At x86_64 the following happens:
> 
> 1) coherent DMA: ~2 usec
> 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec
> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4 usec
> 
> So, whats to do next? Personally, I think that DMA streaming API
> introduces not so great overhead.
> Does anybody happy with turning to streaming DMA or I'll introduce
> module-level switch as Ezequiel suggested?

How about using the dma_unmap and dma_map calls in the USB core?  If 
they add the same overhead as putting them in the handler, I think it 
would be acceptable for x86_64.

It certainly is odd that the dma_sync_single APIs take longer than 
simply mapping and unmapping.

Alan Stern


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-24 20:55                         ` Alan Stern
@ 2018-07-25 13:46                           ` Matwey V. Kornilov
  2018-07-30  4:14                             ` Tomasz Figa
  0 siblings, 1 reply; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-07-25 13:46 UTC (permalink / raw)
  To: Alan Stern
  Cc: Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil,
	Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

2018-07-24 23:55 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>:
> On Tue, 24 Jul 2018, Matwey V. Kornilov wrote:
>
>> 2018-07-23 21:57 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>:
>> > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote:
>> >
>> >> I've tried to strategies:
>> >>
>> >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is
>> >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP)
>> >
>> > Yes.
>> >
>> >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only
>> >> once at memory allocation)
>> >>
>> >> It is interesting that dma_unmap/dma_map pair leads to the lower
>> >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform.
>> >> At armv7l platform using dma_unmap/dma_map  leads to ~50 usec in the
>> >> handler, and sync_cpu/sync_device - ~65 usec.
>> >>
>> >> However, I am not sure is it mandatory to call
>> >> dma_sync_single_for_device for FROM_DEVICE direction?
>> >
>> > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write
>> > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is
>> > not needed.
>>
>> Well, I measured the following at armv7l. The handler execution time
>> (URB_NO_TRANSFER_DMA_MAP is used for all cases):
>>
>> 1) coherent DMA: ~3000 usec (pwc is not functional)
>> 2) explicit dma_unmap and dma_map in the handler: ~52 usec
>> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 usec
>>
>> So, I suppose that unfortunately Tomasz suggestion doesn't work. There
>> is no performance improvement when dma_sync_single is used.
>>
>> At x86_64 the following happens:
>>
>> 1) coherent DMA: ~2 usec
>> 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec
>> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4 usec
>>
>> So, whats to do next? Personally, I think that DMA streaming API
>> introduces not so great overhead.
>> Does anybody happy with turning to streaming DMA or I'll introduce
>> module-level switch as Ezequiel suggested?
>
> How about using the dma_unmap and dma_map calls in the USB core?  If
> they add the same overhead as putting them in the handler, I think it
> would be acceptable for x86_64.

Sure, that is the simplest way to implement streaming API.

>
> It certainly is odd that the dma_sync_single APIs take longer than
> simply mapping and unmapping.

Well. I am surprised too. Probably, it is related to that only 9560
bytes are used for each URB. It is only three memory pages.

>
> Alan Stern
>



-- 
With best regards,
Matwey V. Kornilov.
Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-25 13:46                           ` Matwey V. Kornilov
@ 2018-07-30  4:14                             ` Tomasz Figa
  2018-08-04  8:05                               ` Matwey V. Kornilov
  0 siblings, 1 reply; 38+ messages in thread
From: Tomasz Figa @ 2018-07-30  4:14 UTC (permalink / raw)
  To: Matwey V. Kornilov
  Cc: Alan Stern, Ezequiel Garcia, hdegoede, Hans Verkuil,
	Mauro Carvalho Chehab, Laurent Pinchart, rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

On Wed, Jul 25, 2018 at 10:47 PM Matwey V. Kornilov <matwey@sai.msu.ru> wrote:
>
> 2018-07-24 23:55 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>:
> > On Tue, 24 Jul 2018, Matwey V. Kornilov wrote:
> >
> >> 2018-07-23 21:57 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>:
> >> > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote:
> >> >
> >> >> I've tried to strategies:
> >> >>
> >> >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is
> >> >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP)
> >> >
> >> > Yes.
> >> >
> >> >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only
> >> >> once at memory allocation)
> >> >>
> >> >> It is interesting that dma_unmap/dma_map pair leads to the lower
> >> >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform.
> >> >> At armv7l platform using dma_unmap/dma_map  leads to ~50 usec in the
> >> >> handler, and sync_cpu/sync_device - ~65 usec.
> >> >>
> >> >> However, I am not sure is it mandatory to call
> >> >> dma_sync_single_for_device for FROM_DEVICE direction?
> >> >
> >> > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write
> >> > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is
> >> > not needed.
> >>
> >> Well, I measured the following at armv7l. The handler execution time
> >> (URB_NO_TRANSFER_DMA_MAP is used for all cases):
> >>
> >> 1) coherent DMA: ~3000 usec (pwc is not functional)
> >> 2) explicit dma_unmap and dma_map in the handler: ~52 usec
> >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 usec

That's very strange because on ARM dma_unmap_single() does exactly the
same thing as dma_sync_single_for_cpu():

arm_dma_map_page()
https://elixir.bootlin.com/linux/v4.18-rc7/source/arch/arm/mm/dma-mapping.c#L129
arm_dma_unmap_page()
https://elixir.bootlin.com/linux/v4.18-rc7/source/arch/arm/mm/dma-mapping.c#L159

arm_dma_sync_single_for_cpu()
https://elixir.bootlin.com/linux/v4.18-rc7/source/arch/arm/mm/dma-mapping.c#L167

Could you post the code you used for testing of cases 2) and 3)?

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-20 11:33               ` Tomasz Figa
  2018-07-20 11:57                 ` Matwey V. Kornilov
@ 2018-07-30 15:13                 ` Laurent Pinchart
  1 sibling, 0 replies; 38+ messages in thread
From: Laurent Pinchart @ 2018-07-30 15:13 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Matwey V. Kornilov, Alan Stern, Ezequiel Garcia, hdegoede,
	Hans Verkuil, Mauro Carvalho Chehab, rostedt, mingo, Mike Isely,
	Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

Hi Tomasz,

On Friday, 20 July 2018 14:33:33 EEST Tomasz Figa wrote:
> On Fri, Jul 20, 2018 at 8:23 PM Matwey V. Kornilov wrote:
> > 2018-07-20 13:55 GMT+03:00 Tomasz Figa:
> >> On Wed, Jul 18, 2018 at 5:51 AM Alan Stern wrote:
> >>> On Tue, 17 Jul 2018, Ezequiel Garcia wrote:
> >>>> Hi Matwey,
> >>>> 
> >>>> First of all, sorry for the delay.
> >>>> 
> >>>> Adding Alan and Hans. Guys, do you have any feedback here?
> >>> 
> >>> ...
> >>> 
> >>>>>> So, what is the benefit of using consistent
> >>>>>> for these URBs, as opposed to streaming?
> >>>>> 
> >>>>> I don't know, I think there is no real benefit and all we see is a
> >>>>> consequence of copy-pasta when some webcam drivers were inspired by
> >>>>> others and development priparily was going at x86 platforms.
> >>>> 
> >>>> You are probably right about the copy-pasta.
> >>>> 
> >>>>> It would be great if somebody corrected me here. DMA Coherence is
> >>>>> quite strong property and I cannot figure out how can it help when
> >>>>> streaming video. The CPU host always reads from the buffer and never
> >>>>> writes to. Hardware perepherial always writes to and never reads from.
> >>>>> Moreover, buffer access is mutually exclusive and separated in time by
> >>>>> Interrupt fireing and URB starting (when we reuse existing URB for new
> >>>>> request). Only single one memory barrier is really required here.
> >>>> 
> >>>> Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
> >>>> create DMA mappings and use the streaming API. Which makes more
> >>>> sense in hardware without hardware coherency.
> >>> 
> >>> As far as I know, the _only_ advantage to using coherent DMA in this
> >>> situation is that you then do not have to pay the overhead of
> >>> constantly setting up and tearing down the streaming mappings.  So it
> >>> depends very much on the platform: If coherent buffers are cached then
> >>> it's a slight win and otherwise it's a big lose.
> >> 
> >> Isn't it about usb_alloc_coherent() being backed by DMA coherent API
> >> (dma_alloc_coherent/attrs()) and ending up allocating uncached (or
> >> write-combine) memory for devices with non-coherent DMAs? I'm not sure
> > 
> > Yes, this is what exactly happens at armv7l platforms.
> 
> Okay, thanks. So this seems to be exactly the same thing that is
> happening in the UVC driver. There is quite a bit of random accesses
> to extract some header fields and then a big memcpy into VB2 buffer to
> assemble final frame.
> 
> If we don't want to pay the cost of creating and destroying the
> streaming mapping, we could map (dma_map_single()) once, set
> transfer_dma of URB and URB_NO_TRANSFER_DMA_MAP and then just
> synchronize the caches (dma_sync_single()) before submitting/after
> completing the URB.

The problem is that the dma_sync_single() calls can end up being quite costly, 
depending on the platform, its cache architecture, and the buffer size. For a 
given system and use case we should always be able to decide which option is 
best, but finding that dynamically at runtime isn't an easy task. I remember 
that when developing the OMAP3 ISP driver at Nokia we had a heuristics that 
forced a full D-cache clean above a certain picture resolution as that was 
faster than selectively cleaning cache lines.

Furthermore, the DMA mapping API doesn't help us here, as it doesn't allow a 
platform to optimize operations based on the buffer mappings. An easy example 
is that there's no way for the DMA mapping implementation on ARM to find out 
in the dma_sync_single() operation that the buffer isn't mapped to the CPU and 
that the CPU cache clean can be skipped. This doesn't affect USB drivers as a 
CPU mapping always exists, but there could be some limitations there too when 
we'll try to optimize the implementation.

-- 
Regards,

Laurent Pinchart




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-24 18:56                       ` Matwey V. Kornilov
  2018-07-24 20:55                         ` Alan Stern
@ 2018-07-30 15:35                         ` Laurent Pinchart
  2018-08-04  8:00                           ` Matwey V. Kornilov
  1 sibling, 1 reply; 38+ messages in thread
From: Laurent Pinchart @ 2018-07-30 15:35 UTC (permalink / raw)
  To: Matwey V. Kornilov
  Cc: Alan Stern, Tomasz Figa, Ezequiel Garcia, Hans de Goede,
	Hans Verkuil, Mauro Carvalho Chehab, Steven Rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

Hi Matwey,

On Tuesday, 24 July 2018 21:56:09 EEST Matwey V. Kornilov wrote:
> 2018-07-23 21:57 GMT+03:00 Alan Stern:
> > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote:
> >> I've tried to strategies:
> >> 
> >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is
> >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP)
> > 
> > Yes.
> > 
> >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only
> >> once at memory allocation)
> >> 
> >> It is interesting that dma_unmap/dma_map pair leads to the lower
> >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform.
> >> At armv7l platform using dma_unmap/dma_map  leads to ~50 usec in the
> >> handler, and sync_cpu/sync_device - ~65 usec.
> >> 
> >> However, I am not sure is it mandatory to call
> >> dma_sync_single_for_device for FROM_DEVICE direction?
> > 
> > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write
> > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is
> > not needed.
> 
> Well, I measured the following at armv7l. The handler execution time
> (URB_NO_TRANSFER_DMA_MAP is used for all cases):
> 
> 1) coherent DMA: ~3000 usec (pwc is not functional)
> 2) explicit dma_unmap and dma_map in the handler: ~52 usec
> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56
> usec

I really don't understand why the sync option is slower. Could you please 
investigate ? Before doing anything we need to make sure we have a full 
understanding of the problem.

> So, I suppose that unfortunately Tomasz suggestion doesn't work. There
> is no performance improvement when dma_sync_single is used.
> 
> At x86_64 the following happens:
> 
> 1) coherent DMA: ~2 usec

What do you mean by coherent DMA for x86_64 ? Is that usb_alloc_coherent() ? 
Could you trace it to see how memory is allocated exactly, and how it's mapped 
to the CPU ? I suspect that it will end up in dma_direct_alloc() but I'd like 
a confirmation.

> 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec
> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4 usec
> 
> So, whats to do next? Personally, I think that DMA streaming API
> introduces not so great overhead.

It might not be very large, but with USB3 cameras at high resolutions and 
framerates, it might still become noticeable. I wouldn't degrade performances 
on x86, especially if we can decide which option to use based on the platform 
(or perhaps even better based on Kconfig options such as DMA_NONCOHERENT).

> Does anybody happy with turning to streaming DMA or I'll introduce
> module-level switch as Ezequiel suggested?

A module-level switch isn't a good idea, it will just confuse users. We need 
to establish a strategy and come up with a good heuristic that can be applied 
at compile and/or runtime to automatically decide how to allocate buffers.

-- 
Regards,

Laurent Pinchart




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-19 23:36           ` Ezequiel Garcia
  2018-07-20  9:35             ` Matwey V. Kornilov
@ 2018-07-30 15:42             ` Laurent Pinchart
  1 sibling, 0 replies; 38+ messages in thread
From: Laurent Pinchart @ 2018-07-30 15:42 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: Matwey V. Kornilov, Alan Stern, Hans de Goede, hverkuil, mchehab,
	Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King,
	linux-media, open list

Hi Ezequiel,

On Friday, 20 July 2018 02:36:40 EEST Ezequiel Garcia wrote:
> On Wed, 2018-07-18 at 15:10 +0300, Matwey V. Kornilov wrote:
> > 2018-07-17 23:10 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>:
> >> On Mon, 2018-06-18 at 10:10 +0300, Matwey V. Kornilov wrote:
> >>> 2018-06-18 8:11 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>:
> >>>> On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote:
> >>>>> DMA cocherency slows the transfer down on systems without hardware
> >>>>> coherent DMA.
> >>>>> 
> >>>>> Based on previous commit the following performance benchmarks have
> >>>>> been carried out. Average memcpy() data transfer rate (rate) and
> >>>>> handler completion time (time) have been measured when running
> >>>>> video stream at 640x480 resolution at 10fps.
> >>>>> 
> >>>>> x86_64 based system (Intel Core i5-3470). This platform has
> >>>>> hardware coherent DMA support and proposed change doesn't make big
> >>>>> difference here.
> >>>>> 
> >>>>>  * kmalloc:            rate = (4.4 +- 1.0) GBps
> >>>>>                        time = (2.4 +- 1.2) usec
> >>>>>  
> >>>>>  * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps
> >>>>>                        time = (2.5 +- 1.0) usec
> >>>>> 
> >>>>> We see that the measurements agree well within error ranges in
> >>>>> this case. So no performance downgrade is introduced.
> >>>>> 
> >>>>> armv7l based system (TI AM335x BeagleBone Black). This platform
> >>>>> has no hardware coherent DMA support. DMA coherence is implemented
> >>>>> via disabled page caching that slows down memcpy() due to memory
> >>>>> controller behaviour.
> >>>>> 
> >>>>>  * kmalloc:            rate =  (190 +-  30) MBps
> >>>>>                        time =   (50 +-  10) usec
> >>>>>  
> >>>>>  * usb_alloc_coherent: rate =   (33 +-   4) MBps
> >>>>>                        time = (3000 +- 400) usec
> >>>>> 
> >>>>> Note, that quantative difference leads (this commit leads to 5
> >>>>> times acceleration) to qualitative behavior change in this case.
> >>>>> As it was stated before, the video stream can not be successfully
> >>>>> received at AM335x platforms with MUSB based USB host controller
> >>>>> due to performance issues [1].
> >>>>> 
> >>>>> [1] https://www.spinics.net/lists/linux-usb/msg165735.html
> >>>> 
> >>>> This is quite interesting! I have receive similar complaints
> >>>> from users wanting to use stk1160 on BBB and Raspberrys,
> >>>> without much luck on either, due to insufficient isoc bandwidth.
> >>>> 
> >>>> I'm guessing other ARM platforms could be suffering
> >>>> from the same issue.
> >>>> 
> >>>> Note that stk1160 and uvcvideo drivers use kmalloc on platforms
> >>>> where DMA_NONCOHERENT is defined, but this is not the case
> >>>> on ARM platforms.
> >>> 
> >>> There are some ARMv7 platforms that have coherent DMA (for instance
> >>> Broadcome Horthstar Plus series), but the most of them don't have. It
> >>> is defined in device tree file, and there is no way to recover this
> >>> information at runtime in USB perepherial driver.
> >>> 
> >>>> So, what is the benefit of using consistent
> >>>> for these URBs, as opposed to streaming?
> >>> 
> >>> I don't know, I think there is no real benefit and all we see is a
> >>> consequence of copy-pasta when some webcam drivers were inspired by
> >>> others and development priparily was going at x86 platforms.
> >> 
> >> You are probably right about the copy-pasta.
> >> 
> >>> It would be great if somebody corrected me here. DMA Coherence is quite
> >>> strong property and I cannot figure out how can it help when streaming
> >>> video. The CPU host always reads from the buffer and never writes to.
> >>> Hardware perepherial always writes to and never reads from. Moreover,
> >>> buffer access is mutually exclusive and separated in time by Interrupt
> >>> fireing and URB starting (when we reuse existing URB for new request).
> >>> Only single one memory barrier is really required here.
> >> 
> >> Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
> >> create DMA mappings and use the streaming API. Which makes more
> >> sense in hardware without hardware coherency.
> >> 
> >> The only thing that bothers me with this patch is that it's not
> >> really something specific to this driver. If this fix is valid
> >> for pwc, then it's valid for all the drivers allocating coherent
> >> memory.
> >> 
> >> And also, this path won't prevent further copy-paste spread
> >> of the coherent allocation.
> >> 
> >> Is there any chance we can introduce a helper to allocate
> >> isoc URBs, and then change all drivers to use it? No need
> >> to do all of them now, but it would be good to at least have
> >> a plan for it.
> > 
> > Well, basically I am agree with you.
> > However, I don't have all possible hardware to test, so I can't fix
> > all possible drivers.
> 
> Sure. And keep in mind this is more about the USB host controller,
> than about this specific driver. So it's the controller what we
> would have to test!
> 
> > Also I can not figure out how could the helper looked like. What do
> > you think about usb_alloc() (c.f. usb_alloc_coherent()) ?
> 
> I do not know that either. But it's something we can think about.
> 
> Meanwhile, it would be a shame to loose or stall this excellent
> effort (which is effectively enabling a cameras on a bunch of devices).
> 
> How about you introduce a driver parameter (or device attribute),
> to avoid changing the behavior for USB host controllers we don't know
> about?
> 
> Something like 'alloc_coherent_urbs=y/n'. Perhaps set that
> to 'yes' by default in x86, and 'no' by default in the rest?
> 
> We can think about a generic solution later.

A generic solution would be much better though. Could we still try to achieve 
one, and only go for a hack as a last resort ? With an analysis of code flows 
and performances on x86 vs. ARM I don't think it would be too difficult to 
decide what to do.

-- 
Regards,

Laurent Pinchart




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-17 20:10       ` Ezequiel Garcia
  2018-07-17 20:51         ` Alan Stern
  2018-07-18 12:10         ` Matwey V. Kornilov
@ 2018-07-30 16:07         ` Mauro Carvalho Chehab
  2018-07-31  6:06           ` Tomasz Figa
  2 siblings, 1 reply; 38+ messages in thread
From: Mauro Carvalho Chehab @ 2018-07-30 16:07 UTC (permalink / raw)
  To: Ezequiel Garcia
  Cc: Matwey V. Kornilov, Alan Stern, Hans de Goede, hverkuil, mchehab,
	Laurent Pinchart, rostedt, mingo, isely, bhumirks, colin.king,
	linux-media, open list

Em Tue, 17 Jul 2018 17:10:22 -0300
Ezequiel Garcia <ezequiel@collabora.com> escreveu:

> Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
> create DMA mappings and use the streaming API. Which makes more
> sense in hardware without hardware coherency.
>
> The only thing that bothers me with this patch is that it's not
> really something specific to this driver. If this fix is valid
> for pwc, then it's valid for all the drivers allocating coherent
> memory.

We're actually doing this change on other drivers:
	https://git.linuxtv.org/media_tree.git/commit/?id=d571b592c6206

I suspect that the reason why all USB media drivers were using
URB_NO_TRANSFER_DMA_MAP is just because the first media USB driver
upstream used it.

On that time, I remember I tried once to not use this flag, but there
was something that broke (perhaps I just didn't know enough about the
USB layer - or perhaps some fixes happened at USB core - allowing it
to be used with ISOC transfers).

Anyway, nowadays, I fail to see a reason why not let the USB core
do the DMA maps. On my tests after this patch, at the boards I tested
(arm and x86), I was unable to see any regressions.

Thanks,
Mauro

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-30 16:07         ` Mauro Carvalho Chehab
@ 2018-07-31  6:06           ` Tomasz Figa
  0 siblings, 0 replies; 38+ messages in thread
From: Tomasz Figa @ 2018-07-31  6:06 UTC (permalink / raw)
  To: Mauro Carvalho Chehab
  Cc: Ezequiel Garcia, Matwey V. Kornilov, Alan Stern, hdegoede,
	Hans Verkuil, Mauro Carvalho Chehab, Laurent Pinchart, rostedt,
	mingo, Mike Isely, Bhumika Goyal, Colin King,
	Linux Media Mailing List, Linux Kernel Mailing List

On Tue, Jul 31, 2018 at 1:07 AM Mauro Carvalho Chehab
<mchehab+samsung@kernel.org> wrote:
>
> Em Tue, 17 Jul 2018 17:10:22 -0300
> Ezequiel Garcia <ezequiel@collabora.com> escreveu:
>
> > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core
> > create DMA mappings and use the streaming API. Which makes more
> > sense in hardware without hardware coherency.
> >
> > The only thing that bothers me with this patch is that it's not
> > really something specific to this driver. If this fix is valid
> > for pwc, then it's valid for all the drivers allocating coherent
> > memory.
>
> We're actually doing this change on other drivers:
>         https://git.linuxtv.org/media_tree.git/commit/?id=d571b592c6206
>
> I suspect that the reason why all USB media drivers were using
> URB_NO_TRANSFER_DMA_MAP is just because the first media USB driver
> upstream used it.
>
> On that time, I remember I tried once to not use this flag, but there
> was something that broke (perhaps I just didn't know enough about the
> USB layer - or perhaps some fixes happened at USB core - allowing it
> to be used with ISOC transfers).
>
> Anyway, nowadays, I fail to see a reason why not let the USB core
> do the DMA maps. On my tests after this patch, at the boards I tested
> (arm and x86), I was unable to see any regressions.

I can see one reason:

usb_alloc_coherent() would use dma_pool_alloc() with a fallback to
dma_alloc_coherent() to do the allocation. I'm not sure what a typical
size for an URB buffer is, but I assume that it's definitely more than
1 page. Order >0 allocations with page allocator (and SLAB, which
eventually just falls back to page allocator for such large
allocations) are generally considered costly. With allocation through
DMA API, mechanisms such as CMA or IOMMU can be used (if available),
making it much more likely to have the allocation satisfied on heavy
load / long uptime systems.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-30 15:35                         ` Laurent Pinchart
@ 2018-08-04  8:00                           ` Matwey V. Kornilov
  2018-08-04 14:46                             ` Alan Stern
  2018-08-08 22:32                             ` Laurent Pinchart
  0 siblings, 2 replies; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-08-04  8:00 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Alan Stern, Tomasz Figa, Ezequiel Garcia, Hans de Goede,
	Hans Verkuil, Mauro Carvalho Chehab, Steven Rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

2018-07-30 18:35 GMT+03:00 Laurent Pinchart <laurent.pinchart@ideasonboard.com>:
> Hi Matwey,
>
> On Tuesday, 24 July 2018 21:56:09 EEST Matwey V. Kornilov wrote:
>> 2018-07-23 21:57 GMT+03:00 Alan Stern:
>> > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote:
>> >> I've tried to strategies:
>> >>
>> >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is
>> >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP)
>> >
>> > Yes.
>> >
>> >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only
>> >> once at memory allocation)
>> >>
>> >> It is interesting that dma_unmap/dma_map pair leads to the lower
>> >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform.
>> >> At armv7l platform using dma_unmap/dma_map  leads to ~50 usec in the
>> >> handler, and sync_cpu/sync_device - ~65 usec.
>> >>
>> >> However, I am not sure is it mandatory to call
>> >> dma_sync_single_for_device for FROM_DEVICE direction?
>> >
>> > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write
>> > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is
>> > not needed.
>>
>> Well, I measured the following at armv7l. The handler execution time
>> (URB_NO_TRANSFER_DMA_MAP is used for all cases):
>>
>> 1) coherent DMA: ~3000 usec (pwc is not functional)
>> 2) explicit dma_unmap and dma_map in the handler: ~52 usec
>> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56
>> usec
>
> I really don't understand why the sync option is slower. Could you please
> investigate ? Before doing anything we need to make sure we have a full
> understanding of the problem.

Hi,

I've found one drawback in my measurements. I forgot to fix CPU
frequency at lowest state 300MHz. Now, I remeasured

2) dma_unmap and dma_map in the handler:
2A) dma_unmap_single call: 28.8 +- 1.5 usec
2B) memcpy and the rest: 58 +- 6 usec
2C) dma_map_single call: 22 +- 2 usec
Total: 110 +- 7 usec

3) dma_sync_single_for_cpu
3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec
3B) memcpy and the rest: 59 +- 6 usec
3C) noop (trace events overhead): 5 +- 2 usec
Total: 93 +- 7 usec

So, now we see that 2A and 3A (as well as 2B and 3B) agree good within
error ranges.

>
>> So, I suppose that unfortunately Tomasz suggestion doesn't work. There
>> is no performance improvement when dma_sync_single is used.
>>
>> At x86_64 the following happens:
>>
>> 1) coherent DMA: ~2 usec
>
> What do you mean by coherent DMA for x86_64 ? Is that usb_alloc_coherent() ?
> Could you trace it to see how memory is allocated exactly, and how it's mapped
> to the CPU ? I suspect that it will end up in dma_direct_alloc() but I'd like
> a confirmation.

usb_alloc_coherents() ends up inside hcd_buffer_alloc() where
dma_alloc_coherent() is called. Keep in mind, that requested size is
9560 in our case and pool is not used.

>
>> 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec
>> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4 usec
>>
>> So, whats to do next? Personally, I think that DMA streaming API
>> introduces not so great overhead.
>
> It might not be very large, but with USB3 cameras at high resolutions and
> framerates, it might still become noticeable. I wouldn't degrade performances
> on x86, especially if we can decide which option to use based on the platform
> (or perhaps even better based on Kconfig options such as DMA_NONCOHERENT).

PWC is discontinued chip, so there will not be any new USB3 cameras.

Kconfig won't work here, as I said before, DMA config is stored inside
device tree blob on ARM architecture.

>
>> Does anybody happy with turning to streaming DMA or I'll introduce
>> module-level switch as Ezequiel suggested?
>
> A module-level switch isn't a good idea, it will just confuse users. We need
> to establish a strategy and come up with a good heuristic that can be applied
> at compile and/or runtime to automatically decide how to allocate buffers.

I am agree in general, but I cannot understand why webcam driver
should think about memory allocation heuristics.

>
> --
> Regards,
>
> Laurent Pinchart
>
>
>



-- 
With best regards,
Matwey V. Kornilov.
Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-07-30  4:14                             ` Tomasz Figa
@ 2018-08-04  8:05                               ` Matwey V. Kornilov
  0 siblings, 0 replies; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-08-04  8:05 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Alan Stern, Ezequiel Garcia, Hans de Goede, Hans Verkuil,
	Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

2018-07-30 7:14 GMT+03:00 Tomasz Figa <tfiga@chromium.org>:
> On Wed, Jul 25, 2018 at 10:47 PM Matwey V. Kornilov <matwey@sai.msu.ru> wrote:
>>
>> 2018-07-24 23:55 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>:
>> > On Tue, 24 Jul 2018, Matwey V. Kornilov wrote:
>> >
>> >> 2018-07-23 21:57 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>:
>> >> > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote:
>> >> >
>> >> >> I've tried to strategies:
>> >> >>
>> >> >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is
>> >> >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP)
>> >> >
>> >> > Yes.
>> >> >
>> >> >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only
>> >> >> once at memory allocation)
>> >> >>
>> >> >> It is interesting that dma_unmap/dma_map pair leads to the lower
>> >> >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform.
>> >> >> At armv7l platform using dma_unmap/dma_map  leads to ~50 usec in the
>> >> >> handler, and sync_cpu/sync_device - ~65 usec.
>> >> >>
>> >> >> However, I am not sure is it mandatory to call
>> >> >> dma_sync_single_for_device for FROM_DEVICE direction?
>> >> >
>> >> > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write
>> >> > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is
>> >> > not needed.
>> >>
>> >> Well, I measured the following at armv7l. The handler execution time
>> >> (URB_NO_TRANSFER_DMA_MAP is used for all cases):
>> >>
>> >> 1) coherent DMA: ~3000 usec (pwc is not functional)
>> >> 2) explicit dma_unmap and dma_map in the handler: ~52 usec
>> >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 usec
>
> That's very strange because on ARM dma_unmap_single() does exactly the
> same thing as dma_sync_single_for_cpu():
>
> arm_dma_map_page()
> https://elixir.bootlin.com/linux/v4.18-rc7/source/arch/arm/mm/dma-mapping.c#L129
> arm_dma_unmap_page()
> https://elixir.bootlin.com/linux/v4.18-rc7/source/arch/arm/mm/dma-mapping.c#L159
>
> arm_dma_sync_single_for_cpu()
> https://elixir.bootlin.com/linux/v4.18-rc7/source/arch/arm/mm/dma-mapping.c#L167
>
> Could you post the code you used for testing of cases 2) and 3)?

I remeasured the timings (see my other mail).
The code you asked is here:
https://github.com/matwey/linux/tree/pwc_dma_v3 but I am going to
rebase this branch.

>
> Best regards,
> Tomasz
>



-- 
With best regards,
Matwey V. Kornilov.
Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-08-04  8:00                           ` Matwey V. Kornilov
@ 2018-08-04 14:46                             ` Alan Stern
  2018-08-05  7:49                               ` Christoph Hellwig
  2018-08-08 22:32                             ` Laurent Pinchart
  1 sibling, 1 reply; 38+ messages in thread
From: Alan Stern @ 2018-08-04 14:46 UTC (permalink / raw)
  To: Matwey V. Kornilov
  Cc: Laurent Pinchart, Tomasz Figa, Ezequiel Garcia, Hans de Goede,
	Hans Verkuil, Mauro Carvalho Chehab, Steven Rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

On Sat, 4 Aug 2018, Matwey V. Kornilov wrote:

> 2018-07-30 18:35 GMT+03:00 Laurent Pinchart <laurent.pinchart@ideasonboard.com>:
> > Hi Matwey,
> >
> > On Tuesday, 24 July 2018 21:56:09 EEST Matwey V. Kornilov wrote:
> >> 2018-07-23 21:57 GMT+03:00 Alan Stern:
> >> > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote:
> >> >> I've tried to strategies:
> >> >>
> >> >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is
> >> >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP)
> >> >
> >> > Yes.
> >> >
> >> >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only
> >> >> once at memory allocation)
> >> >>
> >> >> It is interesting that dma_unmap/dma_map pair leads to the lower
> >> >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform.
> >> >> At armv7l platform using dma_unmap/dma_map  leads to ~50 usec in the
> >> >> handler, and sync_cpu/sync_device - ~65 usec.
> >> >>
> >> >> However, I am not sure is it mandatory to call
> >> >> dma_sync_single_for_device for FROM_DEVICE direction?
> >> >
> >> > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write
> >> > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is
> >> > not needed.
> >>
> >> Well, I measured the following at armv7l. The handler execution time
> >> (URB_NO_TRANSFER_DMA_MAP is used for all cases):
> >>
> >> 1) coherent DMA: ~3000 usec (pwc is not functional)
> >> 2) explicit dma_unmap and dma_map in the handler: ~52 usec
> >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56
> >> usec
> >
> > I really don't understand why the sync option is slower. Could you please
> > investigate ? Before doing anything we need to make sure we have a full
> > understanding of the problem.
> 
> Hi,
> 
> I've found one drawback in my measurements. I forgot to fix CPU
> frequency at lowest state 300MHz. Now, I remeasured
> 
> 2) dma_unmap and dma_map in the handler:
> 2A) dma_unmap_single call: 28.8 +- 1.5 usec
> 2B) memcpy and the rest: 58 +- 6 usec
> 2C) dma_map_single call: 22 +- 2 usec
> Total: 110 +- 7 usec
> 
> 3) dma_sync_single_for_cpu
> 3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec
> 3B) memcpy and the rest: 59 +- 6 usec
> 3C) noop (trace events overhead): 5 +- 2 usec
> Total: 93 +- 7 usec
> 
> So, now we see that 2A and 3A (as well as 2B and 3B) agree good within
> error ranges.

Taken together, those measurements look like a pretty good argument for 
always using dma_sync_single_for_cpu in the driver.  Provided results 
on other platforms aren't too far out of line with these results.

Alan Stern


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-08-04 14:46                             ` Alan Stern
@ 2018-08-05  7:49                               ` Christoph Hellwig
  2018-08-05  8:33                                 ` Matwey V. Kornilov
  0 siblings, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2018-08-05  7:49 UTC (permalink / raw)
  To: Alan Stern
  Cc: Matwey V. Kornilov, Laurent Pinchart, Tomasz Figa,
	Ezequiel Garcia, Hans de Goede, Hans Verkuil,
	Mauro Carvalho Chehab, Steven Rostedt, mingo, Mike Isely,
	Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

On Sat, Aug 04, 2018 at 10:46:35AM -0400, Alan Stern wrote:
> > 2) dma_unmap and dma_map in the handler:
> > 2A) dma_unmap_single call: 28.8 +- 1.5 usec
> > 2B) memcpy and the rest: 58 +- 6 usec
> > 2C) dma_map_single call: 22 +- 2 usec
> > Total: 110 +- 7 usec
> > 
> > 3) dma_sync_single_for_cpu
> > 3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec
> > 3B) memcpy and the rest: 59 +- 6 usec
> > 3C) noop (trace events overhead): 5 +- 2 usec
> > Total: 93 +- 7 usec
> > 
> > So, now we see that 2A and 3A (as well as 2B and 3B) agree good within
> > error ranges.
> 
> Taken together, those measurements look like a pretty good argument for 
> always using dma_sync_single_for_cpu in the driver.  Provided results 
> on other platforms aren't too far out of line with these results.

Logically speaking on no-mmio no-swiotlb platforms dma_sync_single_for_cpu
and dma_unmap should always be identical.  With the migration towards
everyone using dma-direct and dma-noncoherent this is actually going to
be enforced, and I plan to move that enforcement to common code in the
next merge window or two.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-08-05  7:49                               ` Christoph Hellwig
@ 2018-08-05  8:33                                 ` Matwey V. Kornilov
  2018-08-05  8:41                                   ` Christoph Hellwig
  0 siblings, 1 reply; 38+ messages in thread
From: Matwey V. Kornilov @ 2018-08-05  8:33 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Alan Stern, Laurent Pinchart, Tomasz Figa, Ezequiel Garcia,
	Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab,
	Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King,
	Linux Media Mailing List, Linux Kernel Mailing List,
	Kieran Bingham, keiichiw

2018-08-05 10:49 GMT+03:00 Christoph Hellwig <hch@infradead.org>:
> On Sat, Aug 04, 2018 at 10:46:35AM -0400, Alan Stern wrote:
>> > 2) dma_unmap and dma_map in the handler:
>> > 2A) dma_unmap_single call: 28.8 +- 1.5 usec
>> > 2B) memcpy and the rest: 58 +- 6 usec
>> > 2C) dma_map_single call: 22 +- 2 usec
>> > Total: 110 +- 7 usec
>> >
>> > 3) dma_sync_single_for_cpu
>> > 3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec
>> > 3B) memcpy and the rest: 59 +- 6 usec
>> > 3C) noop (trace events overhead): 5 +- 2 usec
>> > Total: 93 +- 7 usec
>> >
>> > So, now we see that 2A and 3A (as well as 2B and 3B) agree good within
>> > error ranges.
>>
>> Taken together, those measurements look like a pretty good argument for
>> always using dma_sync_single_for_cpu in the driver.  Provided results
>> on other platforms aren't too far out of line with these results.
>
> Logically speaking on no-mmio no-swiotlb platforms dma_sync_single_for_cpu
> and dma_unmap should always be identical.  With the migration towards
> everyone using dma-direct and dma-noncoherent this is actually going to
> be enforced, and I plan to move that enforcement to common code in the
> next merge window or two.
>

I think that Alan means that using dma_sync_single_for_cpu() we save
time required for subsequent dma_map() call (which is required when we
do dma_unmap()).

So, does everybody happy with dma_sync_single_for_cpu() for now? If
so, then I'll prepare new version of the patch series.

-- 
With best regards,
Matwey V. Kornilov.
Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia
119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-08-05  8:33                                 ` Matwey V. Kornilov
@ 2018-08-05  8:41                                   ` Christoph Hellwig
  0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2018-08-05  8:41 UTC (permalink / raw)
  To: Matwey V. Kornilov
  Cc: Christoph Hellwig, Alan Stern, Laurent Pinchart, Tomasz Figa,
	Ezequiel Garcia, Hans de Goede, Hans Verkuil,
	Mauro Carvalho Chehab, Steven Rostedt, mingo, Mike Isely,
	Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

On Sun, Aug 05, 2018 at 11:33:38AM +0300, Matwey V. Kornilov wrote:
> >> Taken together, those measurements look like a pretty good argument for
> >> always using dma_sync_single_for_cpu in the driver.  Provided results
> >> on other platforms aren't too far out of line with these results.
> >
> > Logically speaking on no-mmio no-swiotlb platforms dma_sync_single_for_cpu
> > and dma_unmap should always be identical.  With the migration towards
> > everyone using dma-direct and dma-noncoherent this is actually going to
> > be enforced, and I plan to move that enforcement to common code in the
> > next merge window or two.
> >
> 
> I think that Alan means that using dma_sync_single_for_cpu() we save
> time required for subsequent dma_map() call (which is required when we
> do dma_unmap()).

The point still stands.  By definition for a correct DMA API
implementation a dma_sync_single_for_cpu/dma_sync_single_for_device
pair is always going to be cheaper than a dma_unmap/dma_map pair,
although for many cases the difference might be so small that it is
not measureable.

If you reuse a buffer using dma_sync_single* is always the right thing
to do vs unmapping and remapping it.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-08-04  8:00                           ` Matwey V. Kornilov
  2018-08-04 14:46                             ` Alan Stern
@ 2018-08-08 22:32                             ` Laurent Pinchart
  2018-08-09  2:36                               ` Tomasz Figa
  1 sibling, 1 reply; 38+ messages in thread
From: Laurent Pinchart @ 2018-08-08 22:32 UTC (permalink / raw)
  To: Matwey V. Kornilov
  Cc: Alan Stern, Tomasz Figa, Ezequiel Garcia, Hans de Goede,
	Hans Verkuil, Mauro Carvalho Chehab, Steven Rostedt, mingo,
	Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

Hi Matwey,

On Saturday, 4 August 2018 11:00:05 EEST Matwey V. Kornilov wrote:
> 2018-07-30 18:35 GMT+03:00 Laurent Pinchart:
> > On Tuesday, 24 July 2018 21:56:09 EEST Matwey V. Kornilov wrote:
> >> 2018-07-23 21:57 GMT+03:00 Alan Stern:
> >>> On Mon, 23 Jul 2018, Matwey V. Kornilov wrote:
> >>>> I've tried to strategies:
> >>>> 
> >>>> 1) Use dma_unmap and dma_map inside the handler (I suppose this is
> >>>> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP)
> >>> 
> >>> Yes.
> >>> 
> >>>> 2) Use sync_cpu and sync_device inside the handler (and dma_map only
> >>>> once at memory allocation)
> >>>> 
> >>>> It is interesting that dma_unmap/dma_map pair leads to the lower
> >>>> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform.
> >>>> At armv7l platform using dma_unmap/dma_map  leads to ~50 usec in the
> >>>> handler, and sync_cpu/sync_device - ~65 usec.
> >>>> 
> >>>> However, I am not sure is it mandatory to call
> >>>> dma_sync_single_for_device for FROM_DEVICE direction?
> >>> 
> >>> According to Documentation/DMA-API-HOWTO.txt, the CPU should not write
> >>> to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is
> >>> not needed.
> >> 
> >> Well, I measured the following at armv7l. The handler execution time
> >> (URB_NO_TRANSFER_DMA_MAP is used for all cases):
> >> 
> >> 1) coherent DMA: ~3000 usec (pwc is not functional)
> >> 2) explicit dma_unmap and dma_map in the handler: ~52 usec
> >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56
> >> usec
> > 
> > I really don't understand why the sync option is slower. Could you please
> > investigate ? Before doing anything we need to make sure we have a full
> > understanding of the problem.
> 
> Hi,
> 
> I've found one drawback in my measurements. I forgot to fix CPU
> frequency at lowest state 300MHz. Now, I remeasured
> 
> 2) dma_unmap and dma_map in the handler:
> 2A) dma_unmap_single call: 28.8 +- 1.5 usec
> 2B) memcpy and the rest: 58 +- 6 usec
> 2C) dma_map_single call: 22 +- 2 usec
> Total: 110 +- 7 usec
> 
> 3) dma_sync_single_for_cpu
> 3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec
> 3B) memcpy and the rest: 59 +- 6 usec
> 3C) noop (trace events overhead): 5 +- 2 usec
> Total: 93 +- 7 usec
> 
> So, now we see that 2A and 3A (as well as 2B and 3B) agree good within
> error ranges.

Thank you for the time you've spent on these measurements, the information is 
useful and your work very appreciated.

> >> So, I suppose that unfortunately Tomasz suggestion doesn't work. There
> >> is no performance improvement when dma_sync_single is used.
> >> 
> >> At x86_64 the following happens:
> >> 
> >> 1) coherent DMA: ~2 usec
> > 
> > What do you mean by coherent DMA for x86_64 ? Is that usb_alloc_coherent()
> > ? Could you trace it to see how memory is allocated exactly, and how it's
> > mapped to the CPU ? I suspect that it will end up in dma_direct_alloc()
> > but I'd like a confirmation.
> 
> usb_alloc_coherents() ends up inside hcd_buffer_alloc() where
> dma_alloc_coherent() is called. Keep in mind, that requested size is
> 9560 in our case and pool is not used.
> 
> >> 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec
> >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4
> >> usec
> >> 
> >> So, whats to do next? Personally, I think that DMA streaming API
> >> introduces not so great overhead.
> > 
> > It might not be very large, but with USB3 cameras at high resolutions and
> > framerates, it might still become noticeable. I wouldn't degrade
> > performances on x86, especially if we can decide which option to use
> > based on the platform (or perhaps even better based on Kconfig options
> > such as DMA_NONCOHERENT).
> 
> PWC is discontinued chip, so there will not be any new USB3 cameras.

You're right. I had in mind other USB cameras that would benefit from the same 
change, and in particular the uvcvideo driver, which is used by USB3 cameras.

> Kconfig won't work here, as I said before, DMA config is stored inside
> device tree blob on ARM architecture.

But couldn't we skip it at least on x86 ?

> >> Does anybody happy with turning to streaming DMA or I'll introduce
> >> module-level switch as Ezequiel suggested?
> > 
> > A module-level switch isn't a good idea, it will just confuse users. We
> > need to establish a strategy and come up with a good heuristic that can
> > be applied at compile and/or runtime to automatically decide how to
> > allocate buffers.
> 
> I am agree in general, but I cannot understand why webcam driver
> should think about memory allocation heuristics.

I fully agree with you, this should be handled by either the USB core or the 
media core (possibly with a few static hints from the driver, such as buffer 
sizes, to help with heuristics, if needed at all).

-- 
Regards,

Laurent Pinchart




^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-08-08 22:32                             ` Laurent Pinchart
@ 2018-08-09  2:36                               ` Tomasz Figa
  2018-08-09 10:28                                 ` Laurent Pinchart
  0 siblings, 1 reply; 38+ messages in thread
From: Tomasz Figa @ 2018-08-09  2:36 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Matwey V. Kornilov, Alan Stern, Ezequiel Garcia, hdegoede,
	Hans Verkuil, Mauro Carvalho Chehab, rostedt, mingo, Mike Isely,
	Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

On Thu, Aug 9, 2018 at 7:31 AM Laurent Pinchart
<laurent.pinchart@ideasonboard.com> wrote:
>
> Hi Matwey,
>
> On Saturday, 4 August 2018 11:00:05 EEST Matwey V. Kornilov wrote:
> > 2018-07-30 18:35 GMT+03:00 Laurent Pinchart:
> > > On Tuesday, 24 July 2018 21:56:09 EEST Matwey V. Kornilov wrote:
> > >> 2018-07-23 21:57 GMT+03:00 Alan Stern:
> > >>> On Mon, 23 Jul 2018, Matwey V. Kornilov wrote:
> > >>>> I've tried to strategies:
> > >>>>
> > >>>> 1) Use dma_unmap and dma_map inside the handler (I suppose this is
> > >>>> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP)
> > >>>
> > >>> Yes.
> > >>>
> > >>>> 2) Use sync_cpu and sync_device inside the handler (and dma_map only
> > >>>> once at memory allocation)
> > >>>>
> > >>>> It is interesting that dma_unmap/dma_map pair leads to the lower
> > >>>> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform.
> > >>>> At armv7l platform using dma_unmap/dma_map  leads to ~50 usec in the
> > >>>> handler, and sync_cpu/sync_device - ~65 usec.
> > >>>>
> > >>>> However, I am not sure is it mandatory to call
> > >>>> dma_sync_single_for_device for FROM_DEVICE direction?
> > >>>
> > >>> According to Documentation/DMA-API-HOWTO.txt, the CPU should not write
> > >>> to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is
> > >>> not needed.
> > >>
> > >> Well, I measured the following at armv7l. The handler execution time
> > >> (URB_NO_TRANSFER_DMA_MAP is used for all cases):
> > >>
> > >> 1) coherent DMA: ~3000 usec (pwc is not functional)
> > >> 2) explicit dma_unmap and dma_map in the handler: ~52 usec
> > >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56
> > >> usec
> > >
> > > I really don't understand why the sync option is slower. Could you please
> > > investigate ? Before doing anything we need to make sure we have a full
> > > understanding of the problem.
> >
> > Hi,
> >
> > I've found one drawback in my measurements. I forgot to fix CPU
> > frequency at lowest state 300MHz. Now, I remeasured
> >
> > 2) dma_unmap and dma_map in the handler:
> > 2A) dma_unmap_single call: 28.8 +- 1.5 usec
> > 2B) memcpy and the rest: 58 +- 6 usec
> > 2C) dma_map_single call: 22 +- 2 usec
> > Total: 110 +- 7 usec
> >
> > 3) dma_sync_single_for_cpu
> > 3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec
> > 3B) memcpy and the rest: 59 +- 6 usec
> > 3C) noop (trace events overhead): 5 +- 2 usec
> > Total: 93 +- 7 usec
> >
> > So, now we see that 2A and 3A (as well as 2B and 3B) agree good within
> > error ranges.
>
> Thank you for the time you've spent on these measurements, the information is
> useful and your work very appreciated.
>
> > >> So, I suppose that unfortunately Tomasz suggestion doesn't work. There
> > >> is no performance improvement when dma_sync_single is used.
> > >>
> > >> At x86_64 the following happens:
> > >>
> > >> 1) coherent DMA: ~2 usec
> > >
> > > What do you mean by coherent DMA for x86_64 ? Is that usb_alloc_coherent()
> > > ? Could you trace it to see how memory is allocated exactly, and how it's
> > > mapped to the CPU ? I suspect that it will end up in dma_direct_alloc()
> > > but I'd like a confirmation.
> >
> > usb_alloc_coherents() ends up inside hcd_buffer_alloc() where
> > dma_alloc_coherent() is called. Keep in mind, that requested size is
> > 9560 in our case and pool is not used.
> >
> > >> 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec
> > >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4
> > >> usec
> > >>
> > >> So, whats to do next? Personally, I think that DMA streaming API
> > >> introduces not so great overhead.
> > >
> > > It might not be very large, but with USB3 cameras at high resolutions and
> > > framerates, it might still become noticeable. I wouldn't degrade
> > > performances on x86, especially if we can decide which option to use
> > > based on the platform (or perhaps even better based on Kconfig options
> > > such as DMA_NONCOHERENT).
> >
> > PWC is discontinued chip, so there will not be any new USB3 cameras.
>
> You're right. I had in mind other USB cameras that would benefit from the same
> change, and in particular the uvcvideo driver, which is used by USB3 cameras.
>
> > Kconfig won't work here, as I said before, DMA config is stored inside
> > device tree blob on ARM architecture.
>
> But couldn't we skip it at least on x86 ?

If we use the map-once, sync-repeatedly approach, would there be
anything to gain on x86? I believe the sync ops there would be
effectively no-ops, so the only overhead would be of a function call.

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer
  2018-08-09  2:36                               ` Tomasz Figa
@ 2018-08-09 10:28                                 ` Laurent Pinchart
  0 siblings, 0 replies; 38+ messages in thread
From: Laurent Pinchart @ 2018-08-09 10:28 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Matwey V. Kornilov, Alan Stern, Ezequiel Garcia, hdegoede,
	Hans Verkuil, Mauro Carvalho Chehab, rostedt, mingo, Mike Isely,
	Bhumika Goyal, Colin King, Linux Media Mailing List,
	Linux Kernel Mailing List, Kieran Bingham, keiichiw

Hi Tomasz,

On Thursday, 9 August 2018 05:36:46 EEST Tomasz Figa wrote:
> On Thu, Aug 9, 2018 at 7:31 AM Laurent Pinchart wrote:
> > On Saturday, 4 August 2018 11:00:05 EEST Matwey V. Kornilov wrote:
> >> 2018-07-30 18:35 GMT+03:00 Laurent Pinchart:
> >>> On Tuesday, 24 July 2018 21:56:09 EEST Matwey V. Kornilov wrote:
> >>>> 2018-07-23 21:57 GMT+03:00 Alan Stern:
> >>>>> On Mon, 23 Jul 2018, Matwey V. Kornilov wrote:
> >>>>>> I've tried to strategies:
> >>>>>> 
> >>>>>> 1) Use dma_unmap and dma_map inside the handler (I suppose this is
> >>>>>> similar to how USB core does when there is no
> >>>>>> URB_NO_TRANSFER_DMA_MAP)
> >>>>> 
> >>>>> Yes.
> >>>>> 
> >>>>>> 2) Use sync_cpu and sync_device inside the handler (and dma_map
> >>>>>> only once at memory allocation)
> >>>>>> 
> >>>>>> It is interesting that dma_unmap/dma_map pair leads to the lower
> >>>>>> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64
> >>>>>> platform. At armv7l platform using dma_unmap/dma_map  leads to ~50
> >>>>>> usec in the handler, and sync_cpu/sync_device - ~65 usec.
> >>>>>> 
> >>>>>> However, I am not sure is it mandatory to call
> >>>>>> dma_sync_single_for_device for FROM_DEVICE direction?
> >>>>> 
> >>>>> According to Documentation/DMA-API-HOWTO.txt, the CPU should not
> >>>>> write to a DMA_FROM_DEVICE-mapped area, so
> >>>>> dma_sync_single_for_device() is not needed.
> >>>> 
> >>>> Well, I measured the following at armv7l. The handler execution time
> >>>> (URB_NO_TRANSFER_DMA_MAP is used for all cases):
> >>>> 
> >>>> 1) coherent DMA: ~3000 usec (pwc is not functional)
> >>>> 2) explicit dma_unmap and dma_map in the handler: ~52 usec
> >>>> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device):
> >>>> ~56 usec
> >>> 
> >>> I really don't understand why the sync option is slower. Could you
> >>> please investigate ? Before doing anything we need to make sure we have
> >>> a full understanding of the problem.
> >> 
> >> Hi,
> >> 
> >> I've found one drawback in my measurements. I forgot to fix CPU
> >> frequency at lowest state 300MHz. Now, I remeasured
> >> 
> >> 2) dma_unmap and dma_map in the handler:
> >> 2A) dma_unmap_single call: 28.8 +- 1.5 usec
> >> 2B) memcpy and the rest: 58 +- 6 usec
> >> 2C) dma_map_single call: 22 +- 2 usec
> >> Total: 110 +- 7 usec
> >> 
> >> 3) dma_sync_single_for_cpu
> >> 3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec
> >> 3B) memcpy and the rest: 59 +- 6 usec
> >> 3C) noop (trace events overhead): 5 +- 2 usec
> >> Total: 93 +- 7 usec
> >> 
> >> So, now we see that 2A and 3A (as well as 2B and 3B) agree good within
> >> error ranges.
> > 
> > Thank you for the time you've spent on these measurements, the information
> > is useful and your work very appreciated.
> > 
> >>>> So, I suppose that unfortunately Tomasz suggestion doesn't work.
> >>>> There is no performance improvement when dma_sync_single is used.
> >>>> 
> >>>> At x86_64 the following happens:
> >>>> 
> >>>> 1) coherent DMA: ~2 usec
> >>> 
> >>> What do you mean by coherent DMA for x86_64 ? Is that
> >>> usb_alloc_coherent() ? Could you trace it to see how memory is allocated
> >>> exactly, and how it's mapped to the CPU ? I suspect that it will end up
> >>> in dma_direct_alloc() but I'd like a confirmation.
> >> 
> >> usb_alloc_coherents() ends up inside hcd_buffer_alloc() where
> >> dma_alloc_coherent() is called. Keep in mind, that requested size is
> >> 9560 in our case and pool is not used.
> >> 
> >>>> 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec
> >>>> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device):
> >>>> ~4 usec
> >>>> 
> >>>> So, whats to do next? Personally, I think that DMA streaming API
> >>>> introduces not so great overhead.
> >>> 
> >>> It might not be very large, but with USB3 cameras at high resolutions
> >>> and framerates, it might still become noticeable. I wouldn't degrade
> >>> performances on x86, especially if we can decide which option to use
> >>> based on the platform (or perhaps even better based on Kconfig options
> >>> such as DMA_NONCOHERENT).
> >> 
> >> PWC is discontinued chip, so there will not be any new USB3 cameras.
> > 
> > You're right. I had in mind other USB cameras that would benefit from the
> > same change, and in particular the uvcvideo driver, which is used by USB3
> > cameras.
> > 
> >> Kconfig won't work here, as I said before, DMA config is stored inside
> >> device tree blob on ARM architecture.
> > 
> > But couldn't we skip it at least on x86 ?
> 
> If we use the map-once, sync-repeatedly approach, would there be
> anything to gain on x86? I believe the sync ops there would be
> effectively no-ops, so the only overhead would be of a function call.

With that approach, and iff they're effectively no-ops, that should be fine. 
We thus need to double-check.

-- 
Regards,

Laurent Pinchart




^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2018-08-09 10:27 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-17 14:36 [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Matwey V. Kornilov
2018-06-17 14:36 ` [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer Matwey V. Kornilov
2018-06-18  5:11   ` Ezequiel Garcia
2018-06-18  7:10     ` Matwey V. Kornilov
2018-07-17 20:10       ` Ezequiel Garcia
2018-07-17 20:51         ` Alan Stern
2018-07-20 10:55           ` Tomasz Figa
2018-07-20 11:22             ` Matwey V. Kornilov
2018-07-20 11:33               ` Tomasz Figa
2018-07-20 11:57                 ` Matwey V. Kornilov
2018-07-23 17:04                   ` Matwey V. Kornilov
2018-07-23 18:57                     ` Alan Stern
2018-07-24 18:56                       ` Matwey V. Kornilov
2018-07-24 20:55                         ` Alan Stern
2018-07-25 13:46                           ` Matwey V. Kornilov
2018-07-30  4:14                             ` Tomasz Figa
2018-08-04  8:05                               ` Matwey V. Kornilov
2018-07-30 15:35                         ` Laurent Pinchart
2018-08-04  8:00                           ` Matwey V. Kornilov
2018-08-04 14:46                             ` Alan Stern
2018-08-05  7:49                               ` Christoph Hellwig
2018-08-05  8:33                                 ` Matwey V. Kornilov
2018-08-05  8:41                                   ` Christoph Hellwig
2018-08-08 22:32                             ` Laurent Pinchart
2018-08-09  2:36                               ` Tomasz Figa
2018-08-09 10:28                                 ` Laurent Pinchart
2018-07-30 15:13                 ` Laurent Pinchart
2018-07-18 12:10         ` Matwey V. Kornilov
2018-07-19 23:36           ` Ezequiel Garcia
2018-07-20  9:35             ` Matwey V. Kornilov
2018-07-30 15:42             ` Laurent Pinchart
2018-07-30 16:07         ` Mauro Carvalho Chehab
2018-07-31  6:06           ` Tomasz Figa
2018-06-18 18:58 ` [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Steven Rostedt
2018-06-19 16:23   ` Matwey V. Kornilov
2018-06-19 16:33     ` Steven Rostedt
2018-06-20  8:05       ` Matwey V. Kornilov
2018-06-20 13:09         ` Steven Rostedt

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.