* [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() @ 2018-06-17 14:36 Matwey V. Kornilov 2018-06-17 14:36 ` [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer Matwey V. Kornilov 2018-06-18 18:58 ` [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Steven Rostedt 0 siblings, 2 replies; 38+ messages in thread From: Matwey V. Kornilov @ 2018-06-17 14:36 UTC (permalink / raw) To: hverkuil, mchehab Cc: rostedt, mingo, isely, bhumirks, colin.king, linux-media, linux-kernel, Matwey V. Kornilov Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru> --- drivers/media/usb/pwc/pwc-if.c | 7 +++++++ include/trace/events/pwc.h | 45 ++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 52 insertions(+) create mode 100644 include/trace/events/pwc.h diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c index 54b036d39c5b..5775d1f60668 100644 --- a/drivers/media/usb/pwc/pwc-if.c +++ b/drivers/media/usb/pwc/pwc-if.c @@ -57,6 +57,9 @@ - Pham Thanh Nam: webcam snapshot button as an event input device */ +#define CREATE_TRACE_POINTS +#include <trace/events/pwc.h> + #include <linux/errno.h> #include <linux/init.h> #include <linux/mm.h> @@ -260,6 +263,8 @@ static void pwc_isoc_handler(struct urb *urb) int i, fst, flen; unsigned char *iso_buf = NULL; + trace_pwc_handler_enter(urb); + if (urb->status == -ENOENT || urb->status == -ECONNRESET || urb->status == -ESHUTDOWN) { PWC_DEBUG_OPEN("URB (%p) unlinked %ssynchronously.\n", @@ -347,6 +352,8 @@ static void pwc_isoc_handler(struct urb *urb) pdev->vlast_packet_size = flen; } + trace_pwc_handler_exit(urb); + handler_end: i = usb_submit_urb(urb, GFP_ATOMIC); if (i != 0) diff --git a/include/trace/events/pwc.h b/include/trace/events/pwc.h new file mode 100644 index 000000000000..b13d2118bb7a --- /dev/null +++ b/include/trace/events/pwc.h @@ -0,0 +1,45 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#if !defined(_TRACE_PWC_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_PWC_H + +#include <linux/usb.h> +#include <linux/tracepoint.h> + +#undef TRACE_SYSTEM +#define TRACE_SYSTEM pwc + +TRACE_EVENT(pwc_handler_enter, + TP_PROTO(struct urb *urb), + TP_ARGS(urb), + TP_STRUCT__entry( + __field(struct urb*, urb) + __field(int, urb__status) + __field(u32, urb__actual_length) + ), + TP_fast_assign( + __entry->urb = urb; + __entry->urb__status = urb->status; + __entry->urb__actual_length = urb->actual_length; + ), + TP_printk("urb=%p (status=%d actual_length=%u)", + __entry->urb, + __entry->urb__status, + __entry->urb__actual_length) +); + +TRACE_EVENT(pwc_handler_exit, + TP_PROTO(struct urb *urb), + TP_ARGS(urb), + TP_STRUCT__entry( + __field(struct urb*, urb) + ), + TP_fast_assign( + __entry->urb = urb; + ), + TP_printk("urb=%p", __entry->urb) +); + +#endif /* _TRACE_PWC_H */ + +/* This part must be outside protection */ +#include <trace/define_trace.h> -- 2.16.0-rc1 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-06-17 14:36 [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Matwey V. Kornilov @ 2018-06-17 14:36 ` Matwey V. Kornilov 2018-06-18 5:11 ` Ezequiel Garcia 2018-06-18 18:58 ` [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Steven Rostedt 1 sibling, 1 reply; 38+ messages in thread From: Matwey V. Kornilov @ 2018-06-17 14:36 UTC (permalink / raw) To: hverkuil, mchehab Cc: rostedt, mingo, isely, bhumirks, colin.king, linux-media, linux-kernel, Matwey V. Kornilov DMA cocherency slows the transfer down on systems without hardware coherent DMA. Based on previous commit the following performance benchmarks have been carried out. Average memcpy() data transfer rate (rate) and handler completion time (time) have been measured when running video stream at 640x480 resolution at 10fps. x86_64 based system (Intel Core i5-3470). This platform has hardware coherent DMA support and proposed change doesn't make big difference here. * kmalloc: rate = (4.4 +- 1.0) GBps time = (2.4 +- 1.2) usec * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps time = (2.5 +- 1.0) usec We see that the measurements agree well within error ranges in this case. So no performance downgrade is introduced. armv7l based system (TI AM335x BeagleBone Black). This platform has no hardware coherent DMA support. DMA coherence is implemented via disabled page caching that slows down memcpy() due to memory controller behaviour. * kmalloc: rate = (190 +- 30) MBps time = (50 +- 10) usec * usb_alloc_coherent: rate = (33 +- 4) MBps time = (3000 +- 400) usec Note, that quantative difference leads (this commit leads to 5 times acceleration) to qualitative behavior change in this case. As it was stated before, the video stream can not be successfully received at AM335x platforms with MUSB based USB host controller due to performance issues [1]. [1] https://www.spinics.net/lists/linux-usb/msg165735.html Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru> --- drivers/media/usb/pwc/pwc-if.c | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c index 5775d1f60668..6a3cd9680a7f 100644 --- a/drivers/media/usb/pwc/pwc-if.c +++ b/drivers/media/usb/pwc/pwc-if.c @@ -427,11 +427,8 @@ static int pwc_isoc_init(struct pwc_device *pdev) urb->interval = 1; // devik urb->dev = udev; urb->pipe = usb_rcvisocpipe(udev, pdev->vendpoint); - urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP; - urb->transfer_buffer = usb_alloc_coherent(udev, - ISO_BUFFER_SIZE, - GFP_KERNEL, - &urb->transfer_dma); + urb->transfer_flags = URB_ISO_ASAP; + urb->transfer_buffer = kmalloc(ISO_BUFFER_SIZE, GFP_KERNEL); if (urb->transfer_buffer == NULL) { PWC_ERROR("Failed to allocate urb buffer %d\n", i); pwc_isoc_cleanup(pdev); @@ -491,10 +488,7 @@ static void pwc_iso_free(struct pwc_device *pdev) if (pdev->urbs[i]) { PWC_DEBUG_MEMORY("Freeing URB\n"); if (pdev->urbs[i]->transfer_buffer) { - usb_free_coherent(pdev->udev, - pdev->urbs[i]->transfer_buffer_length, - pdev->urbs[i]->transfer_buffer, - pdev->urbs[i]->transfer_dma); + kfree(pdev->urbs[i]->transfer_buffer); } usb_free_urb(pdev->urbs[i]); pdev->urbs[i] = NULL; -- 2.16.0-rc1 ^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-06-17 14:36 ` [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer Matwey V. Kornilov @ 2018-06-18 5:11 ` Ezequiel Garcia 2018-06-18 7:10 ` Matwey V. Kornilov 0 siblings, 1 reply; 38+ messages in thread From: Ezequiel Garcia @ 2018-06-18 5:11 UTC (permalink / raw) To: Matwey V. Kornilov, hverkuil, mchehab, Laurent Pinchart Cc: rostedt, mingo, isely, bhumirks, colin.king, linux-media, linux-kernel + Laurent On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote: > DMA cocherency slows the transfer down on systems without hardware > coherent DMA. > > Based on previous commit the following performance benchmarks have been > carried out. Average memcpy() data transfer rate (rate) and handler > completion time (time) have been measured when running video stream at > 640x480 resolution at 10fps. > > x86_64 based system (Intel Core i5-3470). This platform has hardware > coherent DMA support and proposed change doesn't make big difference here. > > * kmalloc: rate = (4.4 +- 1.0) GBps > time = (2.4 +- 1.2) usec > * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps > time = (2.5 +- 1.0) usec > > We see that the measurements agree well within error ranges in this case. > So no performance downgrade is introduced. > > armv7l based system (TI AM335x BeagleBone Black). This platform has no > hardware coherent DMA support. DMA coherence is implemented via disabled > page caching that slows down memcpy() due to memory controller behaviour. > > * kmalloc: rate = (190 +- 30) MBps > time = (50 +- 10) usec > * usb_alloc_coherent: rate = (33 +- 4) MBps > time = (3000 +- 400) usec > > Note, that quantative difference leads (this commit leads to 5 times > acceleration) to qualitative behavior change in this case. As it was > stated before, the video stream can not be successfully received at AM335x > platforms with MUSB based USB host controller due to performance issues > [1]. > > [1] https://www.spinics.net/lists/linux-usb/msg165735.html > This is quite interesting! I have receive similar complaints from users wanting to use stk1160 on BBB and Raspberrys, without much luck on either, due to insufficient isoc bandwidth. I'm guessing other ARM platforms could be suffering from the same issue. Note that stk1160 and uvcvideo drivers use kmalloc on platforms where DMA_NONCOHERENT is defined, but this is not the case on ARM platforms. So, what is the benefit of using consistent for these URBs, as opposed to streaming? If the choice is simply platform dependent, can't we somehow detect which mapping should be prefered? > Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru> > --- > drivers/media/usb/pwc/pwc-if.c | 12 +++--------- > 1 file changed, 3 insertions(+), 9 deletions(-) > > diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c > index 5775d1f60668..6a3cd9680a7f 100644 > --- a/drivers/media/usb/pwc/pwc-if.c > +++ b/drivers/media/usb/pwc/pwc-if.c > @@ -427,11 +427,8 @@ static int pwc_isoc_init(struct pwc_device *pdev) > urb->interval = 1; // devik > urb->dev = udev; > urb->pipe = usb_rcvisocpipe(udev, pdev->vendpoint); > - urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP; > - urb->transfer_buffer = usb_alloc_coherent(udev, > - ISO_BUFFER_SIZE, > - GFP_KERNEL, > - &urb->transfer_dma); > + urb->transfer_flags = URB_ISO_ASAP; > + urb->transfer_buffer = kmalloc(ISO_BUFFER_SIZE, GFP_KERNEL); > if (urb->transfer_buffer == NULL) { > PWC_ERROR("Failed to allocate urb buffer %d\n", i); > pwc_isoc_cleanup(pdev); > @@ -491,10 +488,7 @@ static void pwc_iso_free(struct pwc_device *pdev) > if (pdev->urbs[i]) { > PWC_DEBUG_MEMORY("Freeing URB\n"); > if (pdev->urbs[i]->transfer_buffer) { > - usb_free_coherent(pdev->udev, > - pdev->urbs[i]->transfer_buffer_length, > - pdev->urbs[i]->transfer_buffer, > - pdev->urbs[i]->transfer_dma); > + kfree(pdev->urbs[i]->transfer_buffer); > } > usb_free_urb(pdev->urbs[i]); > pdev->urbs[i] = NULL; ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-06-18 5:11 ` Ezequiel Garcia @ 2018-06-18 7:10 ` Matwey V. Kornilov 2018-07-17 20:10 ` Ezequiel Garcia 0 siblings, 1 reply; 38+ messages in thread From: Matwey V. Kornilov @ 2018-06-18 7:10 UTC (permalink / raw) To: Ezequiel Garcia Cc: hverkuil, mchehab, Laurent Pinchart, rostedt, mingo, isely, bhumirks, colin.king, linux-media, open list Hi Ezequiel, 2018-06-18 8:11 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>: > + Laurent > > On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote: >> DMA cocherency slows the transfer down on systems without hardware >> coherent DMA. >> >> Based on previous commit the following performance benchmarks have been >> carried out. Average memcpy() data transfer rate (rate) and handler >> completion time (time) have been measured when running video stream at >> 640x480 resolution at 10fps. >> >> x86_64 based system (Intel Core i5-3470). This platform has hardware >> coherent DMA support and proposed change doesn't make big difference here. >> >> * kmalloc: rate = (4.4 +- 1.0) GBps >> time = (2.4 +- 1.2) usec >> * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps >> time = (2.5 +- 1.0) usec >> >> We see that the measurements agree well within error ranges in this case. >> So no performance downgrade is introduced. >> >> armv7l based system (TI AM335x BeagleBone Black). This platform has no >> hardware coherent DMA support. DMA coherence is implemented via disabled >> page caching that slows down memcpy() due to memory controller behaviour. >> >> * kmalloc: rate = (190 +- 30) MBps >> time = (50 +- 10) usec >> * usb_alloc_coherent: rate = (33 +- 4) MBps >> time = (3000 +- 400) usec >> >> Note, that quantative difference leads (this commit leads to 5 times >> acceleration) to qualitative behavior change in this case. As it was >> stated before, the video stream can not be successfully received at AM335x >> platforms with MUSB based USB host controller due to performance issues >> [1]. >> >> [1] https://www.spinics.net/lists/linux-usb/msg165735.html >> > > This is quite interesting! I have receive similar complaints > from users wanting to use stk1160 on BBB and Raspberrys, > without much luck on either, due to insufficient isoc bandwidth. > > I'm guessing other ARM platforms could be suffering > from the same issue. > > Note that stk1160 and uvcvideo drivers use kmalloc on platforms > where DMA_NONCOHERENT is defined, but this is not the case > on ARM platforms. There are some ARMv7 platforms that have coherent DMA (for instance Broadcome Horthstar Plus series), but the most of them don't have. It is defined in device tree file, and there is no way to recover this information at runtime in USB perepherial driver. > > So, what is the benefit of using consistent > for these URBs, as opposed to streaming? I don't know, I think there is no real benefit and all we see is a consequence of copy-pasta when some webcam drivers were inspired by others and development priparily was going at x86 platforms. It would be great if somebody corrected me here. DMA Coherence is quite strong property and I cannot figure out how can it help when streaming video. The CPU host always reads from the buffer and never writes to. Hardware perepherial always writes to and never reads from. Moreover, buffer access is mutually exclusive and separated in time by Interrupt fireing and URB starting (when we reuse existing URB for new request). Only single one memory barrier is really required here. I understand that there are cases when DMA coherence is really needed, for instane VirtIO VRing when we accessing same data structure in both directions from the both sides, but this has nothing common with our case. > > If the choice is simply platform dependent, > can't we somehow detect which mapping should > be prefered? Now, we don't have this way. > >> Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru> >> --- >> drivers/media/usb/pwc/pwc-if.c | 12 +++--------- >> 1 file changed, 3 insertions(+), 9 deletions(-) >> >> diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c >> index 5775d1f60668..6a3cd9680a7f 100644 >> --- a/drivers/media/usb/pwc/pwc-if.c >> +++ b/drivers/media/usb/pwc/pwc-if.c >> @@ -427,11 +427,8 @@ static int pwc_isoc_init(struct pwc_device *pdev) >> urb->interval = 1; // devik >> urb->dev = udev; >> urb->pipe = usb_rcvisocpipe(udev, pdev->vendpoint); >> - urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP; >> - urb->transfer_buffer = usb_alloc_coherent(udev, >> - ISO_BUFFER_SIZE, >> - GFP_KERNEL, >> - &urb->transfer_dma); >> + urb->transfer_flags = URB_ISO_ASAP; >> + urb->transfer_buffer = kmalloc(ISO_BUFFER_SIZE, GFP_KERNEL); >> if (urb->transfer_buffer == NULL) { >> PWC_ERROR("Failed to allocate urb buffer %d\n", i); >> pwc_isoc_cleanup(pdev); >> @@ -491,10 +488,7 @@ static void pwc_iso_free(struct pwc_device *pdev) >> if (pdev->urbs[i]) { >> PWC_DEBUG_MEMORY("Freeing URB\n"); >> if (pdev->urbs[i]->transfer_buffer) { >> - usb_free_coherent(pdev->udev, >> - pdev->urbs[i]->transfer_buffer_length, >> - pdev->urbs[i]->transfer_buffer, >> - pdev->urbs[i]->transfer_dma); >> + kfree(pdev->urbs[i]->transfer_buffer); >> } >> usb_free_urb(pdev->urbs[i]); >> pdev->urbs[i] = NULL; > -- With best regards, Matwey V. Kornilov. Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-06-18 7:10 ` Matwey V. Kornilov @ 2018-07-17 20:10 ` Ezequiel Garcia 2018-07-17 20:51 ` Alan Stern ` (2 more replies) 0 siblings, 3 replies; 38+ messages in thread From: Ezequiel Garcia @ 2018-07-17 20:10 UTC (permalink / raw) To: Matwey V. Kornilov Cc: Alan Stern, Hans de Goede, hverkuil, mchehab, Laurent Pinchart, rostedt, mingo, isely, bhumirks, colin.king, linux-media, open list Hi Matwey, First of all, sorry for the delay. Adding Alan and Hans. Guys, do you have any feedback here? See below for some feedback on my side. On Mon, 2018-06-18 at 10:10 +0300, Matwey V. Kornilov wrote: > Hi Ezequiel, > > 2018-06-18 8:11 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>: > > + Laurent > > > > On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote: > > > DMA cocherency slows the transfer down on systems without hardware > > > coherent DMA. > > > > > > Based on previous commit the following performance benchmarks have been > > > carried out. Average memcpy() data transfer rate (rate) and handler > > > completion time (time) have been measured when running video stream at > > > 640x480 resolution at 10fps. > > > > > > x86_64 based system (Intel Core i5-3470). This platform has hardware > > > coherent DMA support and proposed change doesn't make big difference here. > > > > > > * kmalloc: rate = (4.4 +- 1.0) GBps > > > time = (2.4 +- 1.2) usec > > > * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps > > > time = (2.5 +- 1.0) usec > > > > > > We see that the measurements agree well within error ranges in this case. > > > So no performance downgrade is introduced. > > > > > > armv7l based system (TI AM335x BeagleBone Black). This platform has no > > > hardware coherent DMA support. DMA coherence is implemented via disabled > > > page caching that slows down memcpy() due to memory controller behaviour. > > > > > > * kmalloc: rate = (190 +- 30) MBps > > > time = (50 +- 10) usec > > > * usb_alloc_coherent: rate = (33 +- 4) MBps > > > time = (3000 +- 400) usec > > > > > > Note, that quantative difference leads (this commit leads to 5 times > > > acceleration) to qualitative behavior change in this case. As it was > > > stated before, the video stream can not be successfully received at AM335x > > > platforms with MUSB based USB host controller due to performance issues > > > [1]. > > > > > > [1] https://www.spinics.net/lists/linux-usb/msg165735.html > > > > > > > This is quite interesting! I have receive similar complaints > > from users wanting to use stk1160 on BBB and Raspberrys, > > without much luck on either, due to insufficient isoc bandwidth. > > > > I'm guessing other ARM platforms could be suffering > > from the same issue. > > > > Note that stk1160 and uvcvideo drivers use kmalloc on platforms > > where DMA_NONCOHERENT is defined, but this is not the case > > on ARM platforms. > > There are some ARMv7 platforms that have coherent DMA (for instance > Broadcome Horthstar Plus series), but the most of them don't have. It > is defined in device tree file, and there is no way to recover this > information at runtime in USB perepherial driver. > > > > > So, what is the benefit of using consistent > > for these URBs, as opposed to streaming? > > I don't know, I think there is no real benefit and all we see is a > consequence of copy-pasta when some webcam drivers were inspired by > others and development priparily was going at x86 platforms. You are probably right about the copy-pasta. > It would > be great if somebody corrected me here. DMA Coherence is quite strong > property and I cannot figure out how can it help when streaming video. > The CPU host always reads from the buffer and never writes to. > Hardware perepherial always writes to and never reads from. Moreover, > buffer access is mutually exclusive and separated in time by Interrupt > fireing and URB starting (when we reuse existing URB for new request). > Only single one memory barrier is really required here. > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core create DMA mappings and use the streaming API. Which makes more sense in hardware without hardware coherency. The only thing that bothers me with this patch is that it's not really something specific to this driver. If this fix is valid for pwc, then it's valid for all the drivers allocating coherent memory. And also, this path won't prevent further copy-paste spread of the coherent allocation. Is there any chance we can introduce a helper to allocate isoc URBs, and then change all drivers to use it? No need to do all of them now, but it would be good to at least have a plan for it. Chances are, only a handful of platforms would care to use coherent mappings. > I understand that there are cases when DMA coherence is really needed, > for instane VirtIO VRing when we accessing same data structure in both > directions from the both sides, but this has nothing common with our > case. > > > > > If the choice is simply platform dependent, > > can't we somehow detect which mapping should > > be prefered? > > Now, we don't have this way. > > > > > > Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru> > > > --- > > > drivers/media/usb/pwc/pwc-if.c | 12 +++--------- > > > 1 file changed, 3 insertions(+), 9 deletions(-) > > > > > > diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c > > > index 5775d1f60668..6a3cd9680a7f 100644 > > > --- a/drivers/media/usb/pwc/pwc-if.c > > > +++ b/drivers/media/usb/pwc/pwc-if.c > > > @@ -427,11 +427,8 @@ static int pwc_isoc_init(struct pwc_device *pdev) > > > urb->interval = 1; // devik > > > urb->dev = udev; > > > urb->pipe = usb_rcvisocpipe(udev, pdev->vendpoint); > > > - urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP; > > > - urb->transfer_buffer = usb_alloc_coherent(udev, > > > - ISO_BUFFER_SIZE, > > > - GFP_KERNEL, > > > - &urb->transfer_dma); > > > + urb->transfer_flags = URB_ISO_ASAP; > > > + urb->transfer_buffer = kmalloc(ISO_BUFFER_SIZE, GFP_KERNEL); > > > if (urb->transfer_buffer == NULL) { > > > PWC_ERROR("Failed to allocate urb buffer %d\n", i); > > > pwc_isoc_cleanup(pdev); > > > @@ -491,10 +488,7 @@ static void pwc_iso_free(struct pwc_device *pdev) > > > if (pdev->urbs[i]) { > > > PWC_DEBUG_MEMORY("Freeing URB\n"); > > > if (pdev->urbs[i]->transfer_buffer) { > > > - usb_free_coherent(pdev->udev, > > > - pdev->urbs[i]->transfer_buffer_length, > > > - pdev->urbs[i]->transfer_buffer, > > > - pdev->urbs[i]->transfer_dma); > > > + kfree(pdev->urbs[i]->transfer_buffer); > > > } > > > usb_free_urb(pdev->urbs[i]); > > > pdev->urbs[i] = NULL; > > > ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-17 20:10 ` Ezequiel Garcia @ 2018-07-17 20:51 ` Alan Stern 2018-07-20 10:55 ` Tomasz Figa 2018-07-18 12:10 ` Matwey V. Kornilov 2018-07-30 16:07 ` Mauro Carvalho Chehab 2 siblings, 1 reply; 38+ messages in thread From: Alan Stern @ 2018-07-17 20:51 UTC (permalink / raw) To: Ezequiel Garcia Cc: Matwey V. Kornilov, Hans de Goede, hverkuil, mchehab, Laurent Pinchart, rostedt, mingo, isely, bhumirks, colin.king, linux-media, open list On Tue, 17 Jul 2018, Ezequiel Garcia wrote: > Hi Matwey, > > First of all, sorry for the delay. > > Adding Alan and Hans. Guys, do you have any feedback here? ... > > > So, what is the benefit of using consistent > > > for these URBs, as opposed to streaming? > > > > I don't know, I think there is no real benefit and all we see is a > > consequence of copy-pasta when some webcam drivers were inspired by > > others and development priparily was going at x86 platforms. > > You are probably right about the copy-pasta. > > > It would > > be great if somebody corrected me here. DMA Coherence is quite strong > > property and I cannot figure out how can it help when streaming video. > > The CPU host always reads from the buffer and never writes to. > > Hardware perepherial always writes to and never reads from. Moreover, > > buffer access is mutually exclusive and separated in time by Interrupt > > fireing and URB starting (when we reuse existing URB for new request). > > Only single one memory barrier is really required here. > > > > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core > create DMA mappings and use the streaming API. Which makes more > sense in hardware without hardware coherency. As far as I know, the _only_ advantage to using coherent DMA in this situation is that you then do not have to pay the overhead of constantly setting up and tearing down the streaming mappings. So it depends very much on the platform: If coherent buffers are cached then it's a slight win and otherwise it's a big lose. Alan Stern ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-17 20:51 ` Alan Stern @ 2018-07-20 10:55 ` Tomasz Figa 2018-07-20 11:22 ` Matwey V. Kornilov 0 siblings, 1 reply; 38+ messages in thread From: Tomasz Figa @ 2018-07-20 10:55 UTC (permalink / raw) To: stern Cc: Ezequiel Garcia, matwey, hdegoede, Hans Verkuil, Mauro Carvalho Chehab, Laurent Pinchart, rostedt, mingo, isely, bhumirks, colin.king, Linux Media Mailing List, Linux Kernel Mailing List On Wed, Jul 18, 2018 at 5:51 AM Alan Stern <stern@rowland.harvard.edu> wrote: > > On Tue, 17 Jul 2018, Ezequiel Garcia wrote: > > > Hi Matwey, > > > > First of all, sorry for the delay. > > > > Adding Alan and Hans. Guys, do you have any feedback here? > > ... > > > > > So, what is the benefit of using consistent > > > > for these URBs, as opposed to streaming? > > > > > > I don't know, I think there is no real benefit and all we see is a > > > consequence of copy-pasta when some webcam drivers were inspired by > > > others and development priparily was going at x86 platforms. > > > > You are probably right about the copy-pasta. > > > > > It would > > > be great if somebody corrected me here. DMA Coherence is quite strong > > > property and I cannot figure out how can it help when streaming video. > > > The CPU host always reads from the buffer and never writes to. > > > Hardware perepherial always writes to and never reads from. Moreover, > > > buffer access is mutually exclusive and separated in time by Interrupt > > > fireing and URB starting (when we reuse existing URB for new request). > > > Only single one memory barrier is really required here. > > > > > > > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core > > create DMA mappings and use the streaming API. Which makes more > > sense in hardware without hardware coherency. > > As far as I know, the _only_ advantage to using coherent DMA in this > situation is that you then do not have to pay the overhead of > constantly setting up and tearing down the streaming mappings. So it > depends very much on the platform: If coherent buffers are cached then > it's a slight win and otherwise it's a big lose. Isn't it about usb_alloc_coherent() being backed by DMA coherent API (dma_alloc_coherent/attrs()) and ending up allocating uncached (or write-combine) memory for devices with non-coherent DMAs? I'm not sure how this memory is used by this driver, but if it reads from it using CPU, uncached mapping might be significantly slower than cached mapping of memory allocated with kmalloc(). Best regards, Tomasz ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-20 10:55 ` Tomasz Figa @ 2018-07-20 11:22 ` Matwey V. Kornilov 2018-07-20 11:33 ` Tomasz Figa 0 siblings, 1 reply; 38+ messages in thread From: Matwey V. Kornilov @ 2018-07-20 11:22 UTC (permalink / raw) To: Tomasz Figa Cc: Alan Stern, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List 2018-07-20 13:55 GMT+03:00 Tomasz Figa <tfiga@chromium.org>: > On Wed, Jul 18, 2018 at 5:51 AM Alan Stern <stern@rowland.harvard.edu> wrote: >> >> On Tue, 17 Jul 2018, Ezequiel Garcia wrote: >> >> > Hi Matwey, >> > >> > First of all, sorry for the delay. >> > >> > Adding Alan and Hans. Guys, do you have any feedback here? >> >> ... >> >> > > > So, what is the benefit of using consistent >> > > > for these URBs, as opposed to streaming? >> > > >> > > I don't know, I think there is no real benefit and all we see is a >> > > consequence of copy-pasta when some webcam drivers were inspired by >> > > others and development priparily was going at x86 platforms. >> > >> > You are probably right about the copy-pasta. >> > >> > > It would >> > > be great if somebody corrected me here. DMA Coherence is quite strong >> > > property and I cannot figure out how can it help when streaming video. >> > > The CPU host always reads from the buffer and never writes to. >> > > Hardware perepherial always writes to and never reads from. Moreover, >> > > buffer access is mutually exclusive and separated in time by Interrupt >> > > fireing and URB starting (when we reuse existing URB for new request). >> > > Only single one memory barrier is really required here. >> > > >> > >> > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core >> > create DMA mappings and use the streaming API. Which makes more >> > sense in hardware without hardware coherency. >> >> As far as I know, the _only_ advantage to using coherent DMA in this >> situation is that you then do not have to pay the overhead of >> constantly setting up and tearing down the streaming mappings. So it >> depends very much on the platform: If coherent buffers are cached then >> it's a slight win and otherwise it's a big lose. > > Isn't it about usb_alloc_coherent() being backed by DMA coherent API > (dma_alloc_coherent/attrs()) and ending up allocating uncached (or > write-combine) memory for devices with non-coherent DMAs? I'm not sure Yes, this is what exactly happens at armv7l platforms. > how this memory is used by this driver, but if it reads from it using > CPU, uncached mapping might be significantly slower than cached > mapping of memory allocated with kmalloc(). > > Best regards, > Tomasz > -- With best regards, Matwey V. Kornilov. Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-20 11:22 ` Matwey V. Kornilov @ 2018-07-20 11:33 ` Tomasz Figa 2018-07-20 11:57 ` Matwey V. Kornilov 2018-07-30 15:13 ` Laurent Pinchart 0 siblings, 2 replies; 38+ messages in thread From: Tomasz Figa @ 2018-07-20 11:33 UTC (permalink / raw) To: Matwey V. Kornilov Cc: Alan Stern, Ezequiel Garcia, hdegoede, Hans Verkuil, Mauro Carvalho Chehab, Laurent Pinchart, rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw On Fri, Jul 20, 2018 at 8:23 PM Matwey V. Kornilov <matwey@sai.msu.ru> wrote: > > 2018-07-20 13:55 GMT+03:00 Tomasz Figa <tfiga@chromium.org>: > > On Wed, Jul 18, 2018 at 5:51 AM Alan Stern <stern@rowland.harvard.edu> wrote: > >> > >> On Tue, 17 Jul 2018, Ezequiel Garcia wrote: > >> > >> > Hi Matwey, > >> > > >> > First of all, sorry for the delay. > >> > > >> > Adding Alan and Hans. Guys, do you have any feedback here? > >> > >> ... > >> > >> > > > So, what is the benefit of using consistent > >> > > > for these URBs, as opposed to streaming? > >> > > > >> > > I don't know, I think there is no real benefit and all we see is a > >> > > consequence of copy-pasta when some webcam drivers were inspired by > >> > > others and development priparily was going at x86 platforms. > >> > > >> > You are probably right about the copy-pasta. > >> > > >> > > It would > >> > > be great if somebody corrected me here. DMA Coherence is quite strong > >> > > property and I cannot figure out how can it help when streaming video. > >> > > The CPU host always reads from the buffer and never writes to. > >> > > Hardware perepherial always writes to and never reads from. Moreover, > >> > > buffer access is mutually exclusive and separated in time by Interrupt > >> > > fireing and URB starting (when we reuse existing URB for new request). > >> > > Only single one memory barrier is really required here. > >> > > > >> > > >> > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core > >> > create DMA mappings and use the streaming API. Which makes more > >> > sense in hardware without hardware coherency. > >> > >> As far as I know, the _only_ advantage to using coherent DMA in this > >> situation is that you then do not have to pay the overhead of > >> constantly setting up and tearing down the streaming mappings. So it > >> depends very much on the platform: If coherent buffers are cached then > >> it's a slight win and otherwise it's a big lose. > > > > Isn't it about usb_alloc_coherent() being backed by DMA coherent API > > (dma_alloc_coherent/attrs()) and ending up allocating uncached (or > > write-combine) memory for devices with non-coherent DMAs? I'm not sure > > Yes, this is what exactly happens at armv7l platforms. Okay, thanks. So this seems to be exactly the same thing that is happening in the UVC driver. There is quite a bit of random accesses to extract some header fields and then a big memcpy into VB2 buffer to assemble final frame. If we don't want to pay the cost of creating and destroying the streaming mapping, we could map (dma_map_single()) once, set transfer_dma of URB and URB_NO_TRANSFER_DMA_MAP and then just synchronize the caches (dma_sync_single()) before submitting/after completing the URB. Best regards, Tomasz ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-20 11:33 ` Tomasz Figa @ 2018-07-20 11:57 ` Matwey V. Kornilov 2018-07-23 17:04 ` Matwey V. Kornilov 2018-07-30 15:13 ` Laurent Pinchart 1 sibling, 1 reply; 38+ messages in thread From: Matwey V. Kornilov @ 2018-07-20 11:57 UTC (permalink / raw) To: Tomasz Figa Cc: Alan Stern, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw 2018-07-20 14:33 GMT+03:00 Tomasz Figa <tfiga@chromium.org>: > On Fri, Jul 20, 2018 at 8:23 PM Matwey V. Kornilov <matwey@sai.msu.ru> wrote: >> >> 2018-07-20 13:55 GMT+03:00 Tomasz Figa <tfiga@chromium.org>: >> > On Wed, Jul 18, 2018 at 5:51 AM Alan Stern <stern@rowland.harvard.edu> wrote: >> >> >> >> On Tue, 17 Jul 2018, Ezequiel Garcia wrote: >> >> >> >> > Hi Matwey, >> >> > >> >> > First of all, sorry for the delay. >> >> > >> >> > Adding Alan and Hans. Guys, do you have any feedback here? >> >> >> >> ... >> >> >> >> > > > So, what is the benefit of using consistent >> >> > > > for these URBs, as opposed to streaming? >> >> > > >> >> > > I don't know, I think there is no real benefit and all we see is a >> >> > > consequence of copy-pasta when some webcam drivers were inspired by >> >> > > others and development priparily was going at x86 platforms. >> >> > >> >> > You are probably right about the copy-pasta. >> >> > >> >> > > It would >> >> > > be great if somebody corrected me here. DMA Coherence is quite strong >> >> > > property and I cannot figure out how can it help when streaming video. >> >> > > The CPU host always reads from the buffer and never writes to. >> >> > > Hardware perepherial always writes to and never reads from. Moreover, >> >> > > buffer access is mutually exclusive and separated in time by Interrupt >> >> > > fireing and URB starting (when we reuse existing URB for new request). >> >> > > Only single one memory barrier is really required here. >> >> > > >> >> > >> >> > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core >> >> > create DMA mappings and use the streaming API. Which makes more >> >> > sense in hardware without hardware coherency. >> >> >> >> As far as I know, the _only_ advantage to using coherent DMA in this >> >> situation is that you then do not have to pay the overhead of >> >> constantly setting up and tearing down the streaming mappings. So it >> >> depends very much on the platform: If coherent buffers are cached then >> >> it's a slight win and otherwise it's a big lose. >> > >> > Isn't it about usb_alloc_coherent() being backed by DMA coherent API >> > (dma_alloc_coherent/attrs()) and ending up allocating uncached (or >> > write-combine) memory for devices with non-coherent DMAs? I'm not sure >> >> Yes, this is what exactly happens at armv7l platforms. > > Okay, thanks. So this seems to be exactly the same thing that is > happening in the UVC driver. There is quite a bit of random accesses > to extract some header fields and then a big memcpy into VB2 buffer to > assemble final frame. > > If we don't want to pay the cost of creating and destroying the > streaming mapping, we could map (dma_map_single()) once, set > transfer_dma of URB and URB_NO_TRANSFER_DMA_MAP and then just > synchronize the caches (dma_sync_single()) before submitting/after > completing the URB. Thank you. Now it is becoming clearer to me. Please allow me some time to try to sketch the implementation. > > Best regards, > Tomasz > -- With best regards, Matwey V. Kornilov. Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-20 11:57 ` Matwey V. Kornilov @ 2018-07-23 17:04 ` Matwey V. Kornilov 2018-07-23 18:57 ` Alan Stern 0 siblings, 1 reply; 38+ messages in thread From: Matwey V. Kornilov @ 2018-07-23 17:04 UTC (permalink / raw) To: Tomasz Figa Cc: Alan Stern, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw 2018-07-20 14:57 GMT+03:00 Matwey V. Kornilov <matwey@sai.msu.ru>: > 2018-07-20 14:33 GMT+03:00 Tomasz Figa <tfiga@chromium.org>: >> On Fri, Jul 20, 2018 at 8:23 PM Matwey V. Kornilov <matwey@sai.msu.ru> wrote: >>> >>> 2018-07-20 13:55 GMT+03:00 Tomasz Figa <tfiga@chromium.org>: >>> > On Wed, Jul 18, 2018 at 5:51 AM Alan Stern <stern@rowland.harvard.edu> wrote: >>> >> >>> >> On Tue, 17 Jul 2018, Ezequiel Garcia wrote: >>> >> >>> >> > Hi Matwey, >>> >> > >>> >> > First of all, sorry for the delay. >>> >> > >>> >> > Adding Alan and Hans. Guys, do you have any feedback here? >>> >> >>> >> ... >>> >> >>> >> > > > So, what is the benefit of using consistent >>> >> > > > for these URBs, as opposed to streaming? >>> >> > > >>> >> > > I don't know, I think there is no real benefit and all we see is a >>> >> > > consequence of copy-pasta when some webcam drivers were inspired by >>> >> > > others and development priparily was going at x86 platforms. >>> >> > >>> >> > You are probably right about the copy-pasta. >>> >> > >>> >> > > It would >>> >> > > be great if somebody corrected me here. DMA Coherence is quite strong >>> >> > > property and I cannot figure out how can it help when streaming video. >>> >> > > The CPU host always reads from the buffer and never writes to. >>> >> > > Hardware perepherial always writes to and never reads from. Moreover, >>> >> > > buffer access is mutually exclusive and separated in time by Interrupt >>> >> > > fireing and URB starting (when we reuse existing URB for new request). >>> >> > > Only single one memory barrier is really required here. >>> >> > > >>> >> > >>> >> > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core >>> >> > create DMA mappings and use the streaming API. Which makes more >>> >> > sense in hardware without hardware coherency. >>> >> >>> >> As far as I know, the _only_ advantage to using coherent DMA in this >>> >> situation is that you then do not have to pay the overhead of >>> >> constantly setting up and tearing down the streaming mappings. So it >>> >> depends very much on the platform: If coherent buffers are cached then >>> >> it's a slight win and otherwise it's a big lose. >>> > >>> > Isn't it about usb_alloc_coherent() being backed by DMA coherent API >>> > (dma_alloc_coherent/attrs()) and ending up allocating uncached (or >>> > write-combine) memory for devices with non-coherent DMAs? I'm not sure >>> >>> Yes, this is what exactly happens at armv7l platforms. >> >> Okay, thanks. So this seems to be exactly the same thing that is >> happening in the UVC driver. There is quite a bit of random accesses >> to extract some header fields and then a big memcpy into VB2 buffer to >> assemble final frame. >> >> If we don't want to pay the cost of creating and destroying the >> streaming mapping, we could map (dma_map_single()) once, set >> transfer_dma of URB and URB_NO_TRANSFER_DMA_MAP and then just >> synchronize the caches (dma_sync_single()) before submitting/after >> completing the URB. > > Thank you. Now it is becoming clearer to me. Please allow me some time > to try to sketch the implementation. I've tried to strategies: 1) Use dma_unmap and dma_map inside the handler (I suppose this is similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP) 2) Use sync_cpu and sync_device inside the handler (and dma_map only once at memory allocation) It is interesting that dma_unmap/dma_map pair leads to the lower overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform. At armv7l platform using dma_unmap/dma_map leads to ~50 usec in the handler, and sync_cpu/sync_device - ~65 usec. However, I am not sure is it mandatory to call dma_sync_single_for_device for FROM_DEVICE direction? > >> >> Best regards, >> Tomasz >> > > > > -- > With best regards, > Matwey V. Kornilov. > Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia > 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 -- With best regards, Matwey V. Kornilov. Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-23 17:04 ` Matwey V. Kornilov @ 2018-07-23 18:57 ` Alan Stern 2018-07-24 18:56 ` Matwey V. Kornilov 0 siblings, 1 reply; 38+ messages in thread From: Alan Stern @ 2018-07-23 18:57 UTC (permalink / raw) To: Matwey V. Kornilov Cc: Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw On Mon, 23 Jul 2018, Matwey V. Kornilov wrote: > I've tried to strategies: > > 1) Use dma_unmap and dma_map inside the handler (I suppose this is > similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP) Yes. > 2) Use sync_cpu and sync_device inside the handler (and dma_map only > once at memory allocation) > > It is interesting that dma_unmap/dma_map pair leads to the lower > overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform. > At armv7l platform using dma_unmap/dma_map leads to ~50 usec in the > handler, and sync_cpu/sync_device - ~65 usec. > > However, I am not sure is it mandatory to call > dma_sync_single_for_device for FROM_DEVICE direction? According to Documentation/DMA-API-HOWTO.txt, the CPU should not write to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is not needed. Alan Stern ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-23 18:57 ` Alan Stern @ 2018-07-24 18:56 ` Matwey V. Kornilov 2018-07-24 20:55 ` Alan Stern 2018-07-30 15:35 ` Laurent Pinchart 0 siblings, 2 replies; 38+ messages in thread From: Matwey V. Kornilov @ 2018-07-24 18:56 UTC (permalink / raw) To: Alan Stern Cc: Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw 2018-07-23 21:57 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>: > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote: > >> I've tried to strategies: >> >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP) > > Yes. > >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only >> once at memory allocation) >> >> It is interesting that dma_unmap/dma_map pair leads to the lower >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform. >> At armv7l platform using dma_unmap/dma_map leads to ~50 usec in the >> handler, and sync_cpu/sync_device - ~65 usec. >> >> However, I am not sure is it mandatory to call >> dma_sync_single_for_device for FROM_DEVICE direction? > > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is > not needed. Well, I measured the following at armv7l. The handler execution time (URB_NO_TRANSFER_DMA_MAP is used for all cases): 1) coherent DMA: ~3000 usec (pwc is not functional) 2) explicit dma_unmap and dma_map in the handler: ~52 usec 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 usec So, I suppose that unfortunately Tomasz suggestion doesn't work. There is no performance improvement when dma_sync_single is used. At x86_64 the following happens: 1) coherent DMA: ~2 usec 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4 usec So, whats to do next? Personally, I think that DMA streaming API introduces not so great overhead. Does anybody happy with turning to streaming DMA or I'll introduce module-level switch as Ezequiel suggested? > > Alan Stern > -- With best regards, Matwey V. Kornilov. Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-24 18:56 ` Matwey V. Kornilov @ 2018-07-24 20:55 ` Alan Stern 2018-07-25 13:46 ` Matwey V. Kornilov 2018-07-30 15:35 ` Laurent Pinchart 1 sibling, 1 reply; 38+ messages in thread From: Alan Stern @ 2018-07-24 20:55 UTC (permalink / raw) To: Matwey V. Kornilov Cc: Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw On Tue, 24 Jul 2018, Matwey V. Kornilov wrote: > 2018-07-23 21:57 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>: > > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote: > > > >> I've tried to strategies: > >> > >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is > >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP) > > > > Yes. > > > >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only > >> once at memory allocation) > >> > >> It is interesting that dma_unmap/dma_map pair leads to the lower > >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform. > >> At armv7l platform using dma_unmap/dma_map leads to ~50 usec in the > >> handler, and sync_cpu/sync_device - ~65 usec. > >> > >> However, I am not sure is it mandatory to call > >> dma_sync_single_for_device for FROM_DEVICE direction? > > > > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write > > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is > > not needed. > > Well, I measured the following at armv7l. The handler execution time > (URB_NO_TRANSFER_DMA_MAP is used for all cases): > > 1) coherent DMA: ~3000 usec (pwc is not functional) > 2) explicit dma_unmap and dma_map in the handler: ~52 usec > 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 usec > > So, I suppose that unfortunately Tomasz suggestion doesn't work. There > is no performance improvement when dma_sync_single is used. > > At x86_64 the following happens: > > 1) coherent DMA: ~2 usec > 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec > 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4 usec > > So, whats to do next? Personally, I think that DMA streaming API > introduces not so great overhead. > Does anybody happy with turning to streaming DMA or I'll introduce > module-level switch as Ezequiel suggested? How about using the dma_unmap and dma_map calls in the USB core? If they add the same overhead as putting them in the handler, I think it would be acceptable for x86_64. It certainly is odd that the dma_sync_single APIs take longer than simply mapping and unmapping. Alan Stern ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-24 20:55 ` Alan Stern @ 2018-07-25 13:46 ` Matwey V. Kornilov 2018-07-30 4:14 ` Tomasz Figa 0 siblings, 1 reply; 38+ messages in thread From: Matwey V. Kornilov @ 2018-07-25 13:46 UTC (permalink / raw) To: Alan Stern Cc: Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw 2018-07-24 23:55 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>: > On Tue, 24 Jul 2018, Matwey V. Kornilov wrote: > >> 2018-07-23 21:57 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>: >> > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote: >> > >> >> I've tried to strategies: >> >> >> >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is >> >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP) >> > >> > Yes. >> > >> >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only >> >> once at memory allocation) >> >> >> >> It is interesting that dma_unmap/dma_map pair leads to the lower >> >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform. >> >> At armv7l platform using dma_unmap/dma_map leads to ~50 usec in the >> >> handler, and sync_cpu/sync_device - ~65 usec. >> >> >> >> However, I am not sure is it mandatory to call >> >> dma_sync_single_for_device for FROM_DEVICE direction? >> > >> > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write >> > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is >> > not needed. >> >> Well, I measured the following at armv7l. The handler execution time >> (URB_NO_TRANSFER_DMA_MAP is used for all cases): >> >> 1) coherent DMA: ~3000 usec (pwc is not functional) >> 2) explicit dma_unmap and dma_map in the handler: ~52 usec >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 usec >> >> So, I suppose that unfortunately Tomasz suggestion doesn't work. There >> is no performance improvement when dma_sync_single is used. >> >> At x86_64 the following happens: >> >> 1) coherent DMA: ~2 usec >> 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4 usec >> >> So, whats to do next? Personally, I think that DMA streaming API >> introduces not so great overhead. >> Does anybody happy with turning to streaming DMA or I'll introduce >> module-level switch as Ezequiel suggested? > > How about using the dma_unmap and dma_map calls in the USB core? If > they add the same overhead as putting them in the handler, I think it > would be acceptable for x86_64. Sure, that is the simplest way to implement streaming API. > > It certainly is odd that the dma_sync_single APIs take longer than > simply mapping and unmapping. Well. I am surprised too. Probably, it is related to that only 9560 bytes are used for each URB. It is only three memory pages. > > Alan Stern > -- With best regards, Matwey V. Kornilov. Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-25 13:46 ` Matwey V. Kornilov @ 2018-07-30 4:14 ` Tomasz Figa 2018-08-04 8:05 ` Matwey V. Kornilov 0 siblings, 1 reply; 38+ messages in thread From: Tomasz Figa @ 2018-07-30 4:14 UTC (permalink / raw) To: Matwey V. Kornilov Cc: Alan Stern, Ezequiel Garcia, hdegoede, Hans Verkuil, Mauro Carvalho Chehab, Laurent Pinchart, rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw On Wed, Jul 25, 2018 at 10:47 PM Matwey V. Kornilov <matwey@sai.msu.ru> wrote: > > 2018-07-24 23:55 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>: > > On Tue, 24 Jul 2018, Matwey V. Kornilov wrote: > > > >> 2018-07-23 21:57 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>: > >> > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote: > >> > > >> >> I've tried to strategies: > >> >> > >> >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is > >> >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP) > >> > > >> > Yes. > >> > > >> >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only > >> >> once at memory allocation) > >> >> > >> >> It is interesting that dma_unmap/dma_map pair leads to the lower > >> >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform. > >> >> At armv7l platform using dma_unmap/dma_map leads to ~50 usec in the > >> >> handler, and sync_cpu/sync_device - ~65 usec. > >> >> > >> >> However, I am not sure is it mandatory to call > >> >> dma_sync_single_for_device for FROM_DEVICE direction? > >> > > >> > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write > >> > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is > >> > not needed. > >> > >> Well, I measured the following at armv7l. The handler execution time > >> (URB_NO_TRANSFER_DMA_MAP is used for all cases): > >> > >> 1) coherent DMA: ~3000 usec (pwc is not functional) > >> 2) explicit dma_unmap and dma_map in the handler: ~52 usec > >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 usec That's very strange because on ARM dma_unmap_single() does exactly the same thing as dma_sync_single_for_cpu(): arm_dma_map_page() https://elixir.bootlin.com/linux/v4.18-rc7/source/arch/arm/mm/dma-mapping.c#L129 arm_dma_unmap_page() https://elixir.bootlin.com/linux/v4.18-rc7/source/arch/arm/mm/dma-mapping.c#L159 arm_dma_sync_single_for_cpu() https://elixir.bootlin.com/linux/v4.18-rc7/source/arch/arm/mm/dma-mapping.c#L167 Could you post the code you used for testing of cases 2) and 3)? Best regards, Tomasz ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-30 4:14 ` Tomasz Figa @ 2018-08-04 8:05 ` Matwey V. Kornilov 0 siblings, 0 replies; 38+ messages in thread From: Matwey V. Kornilov @ 2018-08-04 8:05 UTC (permalink / raw) To: Tomasz Figa Cc: Alan Stern, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Laurent Pinchart, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw 2018-07-30 7:14 GMT+03:00 Tomasz Figa <tfiga@chromium.org>: > On Wed, Jul 25, 2018 at 10:47 PM Matwey V. Kornilov <matwey@sai.msu.ru> wrote: >> >> 2018-07-24 23:55 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>: >> > On Tue, 24 Jul 2018, Matwey V. Kornilov wrote: >> > >> >> 2018-07-23 21:57 GMT+03:00 Alan Stern <stern@rowland.harvard.edu>: >> >> > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote: >> >> > >> >> >> I've tried to strategies: >> >> >> >> >> >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is >> >> >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP) >> >> > >> >> > Yes. >> >> > >> >> >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only >> >> >> once at memory allocation) >> >> >> >> >> >> It is interesting that dma_unmap/dma_map pair leads to the lower >> >> >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform. >> >> >> At armv7l platform using dma_unmap/dma_map leads to ~50 usec in the >> >> >> handler, and sync_cpu/sync_device - ~65 usec. >> >> >> >> >> >> However, I am not sure is it mandatory to call >> >> >> dma_sync_single_for_device for FROM_DEVICE direction? >> >> > >> >> > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write >> >> > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is >> >> > not needed. >> >> >> >> Well, I measured the following at armv7l. The handler execution time >> >> (URB_NO_TRANSFER_DMA_MAP is used for all cases): >> >> >> >> 1) coherent DMA: ~3000 usec (pwc is not functional) >> >> 2) explicit dma_unmap and dma_map in the handler: ~52 usec >> >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 usec > > That's very strange because on ARM dma_unmap_single() does exactly the > same thing as dma_sync_single_for_cpu(): > > arm_dma_map_page() > https://elixir.bootlin.com/linux/v4.18-rc7/source/arch/arm/mm/dma-mapping.c#L129 > arm_dma_unmap_page() > https://elixir.bootlin.com/linux/v4.18-rc7/source/arch/arm/mm/dma-mapping.c#L159 > > arm_dma_sync_single_for_cpu() > https://elixir.bootlin.com/linux/v4.18-rc7/source/arch/arm/mm/dma-mapping.c#L167 > > Could you post the code you used for testing of cases 2) and 3)? I remeasured the timings (see my other mail). The code you asked is here: https://github.com/matwey/linux/tree/pwc_dma_v3 but I am going to rebase this branch. > > Best regards, > Tomasz > -- With best regards, Matwey V. Kornilov. Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-24 18:56 ` Matwey V. Kornilov 2018-07-24 20:55 ` Alan Stern @ 2018-07-30 15:35 ` Laurent Pinchart 2018-08-04 8:00 ` Matwey V. Kornilov 1 sibling, 1 reply; 38+ messages in thread From: Laurent Pinchart @ 2018-07-30 15:35 UTC (permalink / raw) To: Matwey V. Kornilov Cc: Alan Stern, Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw Hi Matwey, On Tuesday, 24 July 2018 21:56:09 EEST Matwey V. Kornilov wrote: > 2018-07-23 21:57 GMT+03:00 Alan Stern: > > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote: > >> I've tried to strategies: > >> > >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is > >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP) > > > > Yes. > > > >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only > >> once at memory allocation) > >> > >> It is interesting that dma_unmap/dma_map pair leads to the lower > >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform. > >> At armv7l platform using dma_unmap/dma_map leads to ~50 usec in the > >> handler, and sync_cpu/sync_device - ~65 usec. > >> > >> However, I am not sure is it mandatory to call > >> dma_sync_single_for_device for FROM_DEVICE direction? > > > > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write > > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is > > not needed. > > Well, I measured the following at armv7l. The handler execution time > (URB_NO_TRANSFER_DMA_MAP is used for all cases): > > 1) coherent DMA: ~3000 usec (pwc is not functional) > 2) explicit dma_unmap and dma_map in the handler: ~52 usec > 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 > usec I really don't understand why the sync option is slower. Could you please investigate ? Before doing anything we need to make sure we have a full understanding of the problem. > So, I suppose that unfortunately Tomasz suggestion doesn't work. There > is no performance improvement when dma_sync_single is used. > > At x86_64 the following happens: > > 1) coherent DMA: ~2 usec What do you mean by coherent DMA for x86_64 ? Is that usb_alloc_coherent() ? Could you trace it to see how memory is allocated exactly, and how it's mapped to the CPU ? I suspect that it will end up in dma_direct_alloc() but I'd like a confirmation. > 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec > 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4 usec > > So, whats to do next? Personally, I think that DMA streaming API > introduces not so great overhead. It might not be very large, but with USB3 cameras at high resolutions and framerates, it might still become noticeable. I wouldn't degrade performances on x86, especially if we can decide which option to use based on the platform (or perhaps even better based on Kconfig options such as DMA_NONCOHERENT). > Does anybody happy with turning to streaming DMA or I'll introduce > module-level switch as Ezequiel suggested? A module-level switch isn't a good idea, it will just confuse users. We need to establish a strategy and come up with a good heuristic that can be applied at compile and/or runtime to automatically decide how to allocate buffers. -- Regards, Laurent Pinchart ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-30 15:35 ` Laurent Pinchart @ 2018-08-04 8:00 ` Matwey V. Kornilov 2018-08-04 14:46 ` Alan Stern 2018-08-08 22:32 ` Laurent Pinchart 0 siblings, 2 replies; 38+ messages in thread From: Matwey V. Kornilov @ 2018-08-04 8:00 UTC (permalink / raw) To: Laurent Pinchart Cc: Alan Stern, Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw 2018-07-30 18:35 GMT+03:00 Laurent Pinchart <laurent.pinchart@ideasonboard.com>: > Hi Matwey, > > On Tuesday, 24 July 2018 21:56:09 EEST Matwey V. Kornilov wrote: >> 2018-07-23 21:57 GMT+03:00 Alan Stern: >> > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote: >> >> I've tried to strategies: >> >> >> >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is >> >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP) >> > >> > Yes. >> > >> >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only >> >> once at memory allocation) >> >> >> >> It is interesting that dma_unmap/dma_map pair leads to the lower >> >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform. >> >> At armv7l platform using dma_unmap/dma_map leads to ~50 usec in the >> >> handler, and sync_cpu/sync_device - ~65 usec. >> >> >> >> However, I am not sure is it mandatory to call >> >> dma_sync_single_for_device for FROM_DEVICE direction? >> > >> > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write >> > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is >> > not needed. >> >> Well, I measured the following at armv7l. The handler execution time >> (URB_NO_TRANSFER_DMA_MAP is used for all cases): >> >> 1) coherent DMA: ~3000 usec (pwc is not functional) >> 2) explicit dma_unmap and dma_map in the handler: ~52 usec >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 >> usec > > I really don't understand why the sync option is slower. Could you please > investigate ? Before doing anything we need to make sure we have a full > understanding of the problem. Hi, I've found one drawback in my measurements. I forgot to fix CPU frequency at lowest state 300MHz. Now, I remeasured 2) dma_unmap and dma_map in the handler: 2A) dma_unmap_single call: 28.8 +- 1.5 usec 2B) memcpy and the rest: 58 +- 6 usec 2C) dma_map_single call: 22 +- 2 usec Total: 110 +- 7 usec 3) dma_sync_single_for_cpu 3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec 3B) memcpy and the rest: 59 +- 6 usec 3C) noop (trace events overhead): 5 +- 2 usec Total: 93 +- 7 usec So, now we see that 2A and 3A (as well as 2B and 3B) agree good within error ranges. > >> So, I suppose that unfortunately Tomasz suggestion doesn't work. There >> is no performance improvement when dma_sync_single is used. >> >> At x86_64 the following happens: >> >> 1) coherent DMA: ~2 usec > > What do you mean by coherent DMA for x86_64 ? Is that usb_alloc_coherent() ? > Could you trace it to see how memory is allocated exactly, and how it's mapped > to the CPU ? I suspect that it will end up in dma_direct_alloc() but I'd like > a confirmation. usb_alloc_coherents() ends up inside hcd_buffer_alloc() where dma_alloc_coherent() is called. Keep in mind, that requested size is 9560 in our case and pool is not used. > >> 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4 usec >> >> So, whats to do next? Personally, I think that DMA streaming API >> introduces not so great overhead. > > It might not be very large, but with USB3 cameras at high resolutions and > framerates, it might still become noticeable. I wouldn't degrade performances > on x86, especially if we can decide which option to use based on the platform > (or perhaps even better based on Kconfig options such as DMA_NONCOHERENT). PWC is discontinued chip, so there will not be any new USB3 cameras. Kconfig won't work here, as I said before, DMA config is stored inside device tree blob on ARM architecture. > >> Does anybody happy with turning to streaming DMA or I'll introduce >> module-level switch as Ezequiel suggested? > > A module-level switch isn't a good idea, it will just confuse users. We need > to establish a strategy and come up with a good heuristic that can be applied > at compile and/or runtime to automatically decide how to allocate buffers. I am agree in general, but I cannot understand why webcam driver should think about memory allocation heuristics. > > -- > Regards, > > Laurent Pinchart > > > -- With best regards, Matwey V. Kornilov. Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-08-04 8:00 ` Matwey V. Kornilov @ 2018-08-04 14:46 ` Alan Stern 2018-08-05 7:49 ` Christoph Hellwig 2018-08-08 22:32 ` Laurent Pinchart 1 sibling, 1 reply; 38+ messages in thread From: Alan Stern @ 2018-08-04 14:46 UTC (permalink / raw) To: Matwey V. Kornilov Cc: Laurent Pinchart, Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw On Sat, 4 Aug 2018, Matwey V. Kornilov wrote: > 2018-07-30 18:35 GMT+03:00 Laurent Pinchart <laurent.pinchart@ideasonboard.com>: > > Hi Matwey, > > > > On Tuesday, 24 July 2018 21:56:09 EEST Matwey V. Kornilov wrote: > >> 2018-07-23 21:57 GMT+03:00 Alan Stern: > >> > On Mon, 23 Jul 2018, Matwey V. Kornilov wrote: > >> >> I've tried to strategies: > >> >> > >> >> 1) Use dma_unmap and dma_map inside the handler (I suppose this is > >> >> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP) > >> > > >> > Yes. > >> > > >> >> 2) Use sync_cpu and sync_device inside the handler (and dma_map only > >> >> once at memory allocation) > >> >> > >> >> It is interesting that dma_unmap/dma_map pair leads to the lower > >> >> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform. > >> >> At armv7l platform using dma_unmap/dma_map leads to ~50 usec in the > >> >> handler, and sync_cpu/sync_device - ~65 usec. > >> >> > >> >> However, I am not sure is it mandatory to call > >> >> dma_sync_single_for_device for FROM_DEVICE direction? > >> > > >> > According to Documentation/DMA-API-HOWTO.txt, the CPU should not write > >> > to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is > >> > not needed. > >> > >> Well, I measured the following at armv7l. The handler execution time > >> (URB_NO_TRANSFER_DMA_MAP is used for all cases): > >> > >> 1) coherent DMA: ~3000 usec (pwc is not functional) > >> 2) explicit dma_unmap and dma_map in the handler: ~52 usec > >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 > >> usec > > > > I really don't understand why the sync option is slower. Could you please > > investigate ? Before doing anything we need to make sure we have a full > > understanding of the problem. > > Hi, > > I've found one drawback in my measurements. I forgot to fix CPU > frequency at lowest state 300MHz. Now, I remeasured > > 2) dma_unmap and dma_map in the handler: > 2A) dma_unmap_single call: 28.8 +- 1.5 usec > 2B) memcpy and the rest: 58 +- 6 usec > 2C) dma_map_single call: 22 +- 2 usec > Total: 110 +- 7 usec > > 3) dma_sync_single_for_cpu > 3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec > 3B) memcpy and the rest: 59 +- 6 usec > 3C) noop (trace events overhead): 5 +- 2 usec > Total: 93 +- 7 usec > > So, now we see that 2A and 3A (as well as 2B and 3B) agree good within > error ranges. Taken together, those measurements look like a pretty good argument for always using dma_sync_single_for_cpu in the driver. Provided results on other platforms aren't too far out of line with these results. Alan Stern ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-08-04 14:46 ` Alan Stern @ 2018-08-05 7:49 ` Christoph Hellwig 2018-08-05 8:33 ` Matwey V. Kornilov 0 siblings, 1 reply; 38+ messages in thread From: Christoph Hellwig @ 2018-08-05 7:49 UTC (permalink / raw) To: Alan Stern Cc: Matwey V. Kornilov, Laurent Pinchart, Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw On Sat, Aug 04, 2018 at 10:46:35AM -0400, Alan Stern wrote: > > 2) dma_unmap and dma_map in the handler: > > 2A) dma_unmap_single call: 28.8 +- 1.5 usec > > 2B) memcpy and the rest: 58 +- 6 usec > > 2C) dma_map_single call: 22 +- 2 usec > > Total: 110 +- 7 usec > > > > 3) dma_sync_single_for_cpu > > 3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec > > 3B) memcpy and the rest: 59 +- 6 usec > > 3C) noop (trace events overhead): 5 +- 2 usec > > Total: 93 +- 7 usec > > > > So, now we see that 2A and 3A (as well as 2B and 3B) agree good within > > error ranges. > > Taken together, those measurements look like a pretty good argument for > always using dma_sync_single_for_cpu in the driver. Provided results > on other platforms aren't too far out of line with these results. Logically speaking on no-mmio no-swiotlb platforms dma_sync_single_for_cpu and dma_unmap should always be identical. With the migration towards everyone using dma-direct and dma-noncoherent this is actually going to be enforced, and I plan to move that enforcement to common code in the next merge window or two. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-08-05 7:49 ` Christoph Hellwig @ 2018-08-05 8:33 ` Matwey V. Kornilov 2018-08-05 8:41 ` Christoph Hellwig 0 siblings, 1 reply; 38+ messages in thread From: Matwey V. Kornilov @ 2018-08-05 8:33 UTC (permalink / raw) To: Christoph Hellwig Cc: Alan Stern, Laurent Pinchart, Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw 2018-08-05 10:49 GMT+03:00 Christoph Hellwig <hch@infradead.org>: > On Sat, Aug 04, 2018 at 10:46:35AM -0400, Alan Stern wrote: >> > 2) dma_unmap and dma_map in the handler: >> > 2A) dma_unmap_single call: 28.8 +- 1.5 usec >> > 2B) memcpy and the rest: 58 +- 6 usec >> > 2C) dma_map_single call: 22 +- 2 usec >> > Total: 110 +- 7 usec >> > >> > 3) dma_sync_single_for_cpu >> > 3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec >> > 3B) memcpy and the rest: 59 +- 6 usec >> > 3C) noop (trace events overhead): 5 +- 2 usec >> > Total: 93 +- 7 usec >> > >> > So, now we see that 2A and 3A (as well as 2B and 3B) agree good within >> > error ranges. >> >> Taken together, those measurements look like a pretty good argument for >> always using dma_sync_single_for_cpu in the driver. Provided results >> on other platforms aren't too far out of line with these results. > > Logically speaking on no-mmio no-swiotlb platforms dma_sync_single_for_cpu > and dma_unmap should always be identical. With the migration towards > everyone using dma-direct and dma-noncoherent this is actually going to > be enforced, and I plan to move that enforcement to common code in the > next merge window or two. > I think that Alan means that using dma_sync_single_for_cpu() we save time required for subsequent dma_map() call (which is required when we do dma_unmap()). So, does everybody happy with dma_sync_single_for_cpu() for now? If so, then I'll prepare new version of the patch series. -- With best regards, Matwey V. Kornilov. Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-08-05 8:33 ` Matwey V. Kornilov @ 2018-08-05 8:41 ` Christoph Hellwig 0 siblings, 0 replies; 38+ messages in thread From: Christoph Hellwig @ 2018-08-05 8:41 UTC (permalink / raw) To: Matwey V. Kornilov Cc: Christoph Hellwig, Alan Stern, Laurent Pinchart, Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw On Sun, Aug 05, 2018 at 11:33:38AM +0300, Matwey V. Kornilov wrote: > >> Taken together, those measurements look like a pretty good argument for > >> always using dma_sync_single_for_cpu in the driver. Provided results > >> on other platforms aren't too far out of line with these results. > > > > Logically speaking on no-mmio no-swiotlb platforms dma_sync_single_for_cpu > > and dma_unmap should always be identical. With the migration towards > > everyone using dma-direct and dma-noncoherent this is actually going to > > be enforced, and I plan to move that enforcement to common code in the > > next merge window or two. > > > > I think that Alan means that using dma_sync_single_for_cpu() we save > time required for subsequent dma_map() call (which is required when we > do dma_unmap()). The point still stands. By definition for a correct DMA API implementation a dma_sync_single_for_cpu/dma_sync_single_for_device pair is always going to be cheaper than a dma_unmap/dma_map pair, although for many cases the difference might be so small that it is not measureable. If you reuse a buffer using dma_sync_single* is always the right thing to do vs unmapping and remapping it. ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-08-04 8:00 ` Matwey V. Kornilov 2018-08-04 14:46 ` Alan Stern @ 2018-08-08 22:32 ` Laurent Pinchart 2018-08-09 2:36 ` Tomasz Figa 1 sibling, 1 reply; 38+ messages in thread From: Laurent Pinchart @ 2018-08-08 22:32 UTC (permalink / raw) To: Matwey V. Kornilov Cc: Alan Stern, Tomasz Figa, Ezequiel Garcia, Hans de Goede, Hans Verkuil, Mauro Carvalho Chehab, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw Hi Matwey, On Saturday, 4 August 2018 11:00:05 EEST Matwey V. Kornilov wrote: > 2018-07-30 18:35 GMT+03:00 Laurent Pinchart: > > On Tuesday, 24 July 2018 21:56:09 EEST Matwey V. Kornilov wrote: > >> 2018-07-23 21:57 GMT+03:00 Alan Stern: > >>> On Mon, 23 Jul 2018, Matwey V. Kornilov wrote: > >>>> I've tried to strategies: > >>>> > >>>> 1) Use dma_unmap and dma_map inside the handler (I suppose this is > >>>> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP) > >>> > >>> Yes. > >>> > >>>> 2) Use sync_cpu and sync_device inside the handler (and dma_map only > >>>> once at memory allocation) > >>>> > >>>> It is interesting that dma_unmap/dma_map pair leads to the lower > >>>> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform. > >>>> At armv7l platform using dma_unmap/dma_map leads to ~50 usec in the > >>>> handler, and sync_cpu/sync_device - ~65 usec. > >>>> > >>>> However, I am not sure is it mandatory to call > >>>> dma_sync_single_for_device for FROM_DEVICE direction? > >>> > >>> According to Documentation/DMA-API-HOWTO.txt, the CPU should not write > >>> to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is > >>> not needed. > >> > >> Well, I measured the following at armv7l. The handler execution time > >> (URB_NO_TRANSFER_DMA_MAP is used for all cases): > >> > >> 1) coherent DMA: ~3000 usec (pwc is not functional) > >> 2) explicit dma_unmap and dma_map in the handler: ~52 usec > >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 > >> usec > > > > I really don't understand why the sync option is slower. Could you please > > investigate ? Before doing anything we need to make sure we have a full > > understanding of the problem. > > Hi, > > I've found one drawback in my measurements. I forgot to fix CPU > frequency at lowest state 300MHz. Now, I remeasured > > 2) dma_unmap and dma_map in the handler: > 2A) dma_unmap_single call: 28.8 +- 1.5 usec > 2B) memcpy and the rest: 58 +- 6 usec > 2C) dma_map_single call: 22 +- 2 usec > Total: 110 +- 7 usec > > 3) dma_sync_single_for_cpu > 3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec > 3B) memcpy and the rest: 59 +- 6 usec > 3C) noop (trace events overhead): 5 +- 2 usec > Total: 93 +- 7 usec > > So, now we see that 2A and 3A (as well as 2B and 3B) agree good within > error ranges. Thank you for the time you've spent on these measurements, the information is useful and your work very appreciated. > >> So, I suppose that unfortunately Tomasz suggestion doesn't work. There > >> is no performance improvement when dma_sync_single is used. > >> > >> At x86_64 the following happens: > >> > >> 1) coherent DMA: ~2 usec > > > > What do you mean by coherent DMA for x86_64 ? Is that usb_alloc_coherent() > > ? Could you trace it to see how memory is allocated exactly, and how it's > > mapped to the CPU ? I suspect that it will end up in dma_direct_alloc() > > but I'd like a confirmation. > > usb_alloc_coherents() ends up inside hcd_buffer_alloc() where > dma_alloc_coherent() is called. Keep in mind, that requested size is > 9560 in our case and pool is not used. > > >> 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec > >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4 > >> usec > >> > >> So, whats to do next? Personally, I think that DMA streaming API > >> introduces not so great overhead. > > > > It might not be very large, but with USB3 cameras at high resolutions and > > framerates, it might still become noticeable. I wouldn't degrade > > performances on x86, especially if we can decide which option to use > > based on the platform (or perhaps even better based on Kconfig options > > such as DMA_NONCOHERENT). > > PWC is discontinued chip, so there will not be any new USB3 cameras. You're right. I had in mind other USB cameras that would benefit from the same change, and in particular the uvcvideo driver, which is used by USB3 cameras. > Kconfig won't work here, as I said before, DMA config is stored inside > device tree blob on ARM architecture. But couldn't we skip it at least on x86 ? > >> Does anybody happy with turning to streaming DMA or I'll introduce > >> module-level switch as Ezequiel suggested? > > > > A module-level switch isn't a good idea, it will just confuse users. We > > need to establish a strategy and come up with a good heuristic that can > > be applied at compile and/or runtime to automatically decide how to > > allocate buffers. > > I am agree in general, but I cannot understand why webcam driver > should think about memory allocation heuristics. I fully agree with you, this should be handled by either the USB core or the media core (possibly with a few static hints from the driver, such as buffer sizes, to help with heuristics, if needed at all). -- Regards, Laurent Pinchart ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-08-08 22:32 ` Laurent Pinchart @ 2018-08-09 2:36 ` Tomasz Figa 2018-08-09 10:28 ` Laurent Pinchart 0 siblings, 1 reply; 38+ messages in thread From: Tomasz Figa @ 2018-08-09 2:36 UTC (permalink / raw) To: Laurent Pinchart Cc: Matwey V. Kornilov, Alan Stern, Ezequiel Garcia, hdegoede, Hans Verkuil, Mauro Carvalho Chehab, rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw On Thu, Aug 9, 2018 at 7:31 AM Laurent Pinchart <laurent.pinchart@ideasonboard.com> wrote: > > Hi Matwey, > > On Saturday, 4 August 2018 11:00:05 EEST Matwey V. Kornilov wrote: > > 2018-07-30 18:35 GMT+03:00 Laurent Pinchart: > > > On Tuesday, 24 July 2018 21:56:09 EEST Matwey V. Kornilov wrote: > > >> 2018-07-23 21:57 GMT+03:00 Alan Stern: > > >>> On Mon, 23 Jul 2018, Matwey V. Kornilov wrote: > > >>>> I've tried to strategies: > > >>>> > > >>>> 1) Use dma_unmap and dma_map inside the handler (I suppose this is > > >>>> similar to how USB core does when there is no URB_NO_TRANSFER_DMA_MAP) > > >>> > > >>> Yes. > > >>> > > >>>> 2) Use sync_cpu and sync_device inside the handler (and dma_map only > > >>>> once at memory allocation) > > >>>> > > >>>> It is interesting that dma_unmap/dma_map pair leads to the lower > > >>>> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 platform. > > >>>> At armv7l platform using dma_unmap/dma_map leads to ~50 usec in the > > >>>> handler, and sync_cpu/sync_device - ~65 usec. > > >>>> > > >>>> However, I am not sure is it mandatory to call > > >>>> dma_sync_single_for_device for FROM_DEVICE direction? > > >>> > > >>> According to Documentation/DMA-API-HOWTO.txt, the CPU should not write > > >>> to a DMA_FROM_DEVICE-mapped area, so dma_sync_single_for_device() is > > >>> not needed. > > >> > > >> Well, I measured the following at armv7l. The handler execution time > > >> (URB_NO_TRANSFER_DMA_MAP is used for all cases): > > >> > > >> 1) coherent DMA: ~3000 usec (pwc is not functional) > > >> 2) explicit dma_unmap and dma_map in the handler: ~52 usec > > >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~56 > > >> usec > > > > > > I really don't understand why the sync option is slower. Could you please > > > investigate ? Before doing anything we need to make sure we have a full > > > understanding of the problem. > > > > Hi, > > > > I've found one drawback in my measurements. I forgot to fix CPU > > frequency at lowest state 300MHz. Now, I remeasured > > > > 2) dma_unmap and dma_map in the handler: > > 2A) dma_unmap_single call: 28.8 +- 1.5 usec > > 2B) memcpy and the rest: 58 +- 6 usec > > 2C) dma_map_single call: 22 +- 2 usec > > Total: 110 +- 7 usec > > > > 3) dma_sync_single_for_cpu > > 3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec > > 3B) memcpy and the rest: 59 +- 6 usec > > 3C) noop (trace events overhead): 5 +- 2 usec > > Total: 93 +- 7 usec > > > > So, now we see that 2A and 3A (as well as 2B and 3B) agree good within > > error ranges. > > Thank you for the time you've spent on these measurements, the information is > useful and your work very appreciated. > > > >> So, I suppose that unfortunately Tomasz suggestion doesn't work. There > > >> is no performance improvement when dma_sync_single is used. > > >> > > >> At x86_64 the following happens: > > >> > > >> 1) coherent DMA: ~2 usec > > > > > > What do you mean by coherent DMA for x86_64 ? Is that usb_alloc_coherent() > > > ? Could you trace it to see how memory is allocated exactly, and how it's > > > mapped to the CPU ? I suspect that it will end up in dma_direct_alloc() > > > but I'd like a confirmation. > > > > usb_alloc_coherents() ends up inside hcd_buffer_alloc() where > > dma_alloc_coherent() is called. Keep in mind, that requested size is > > 9560 in our case and pool is not used. > > > > >> 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec > > >> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): ~4 > > >> usec > > >> > > >> So, whats to do next? Personally, I think that DMA streaming API > > >> introduces not so great overhead. > > > > > > It might not be very large, but with USB3 cameras at high resolutions and > > > framerates, it might still become noticeable. I wouldn't degrade > > > performances on x86, especially if we can decide which option to use > > > based on the platform (or perhaps even better based on Kconfig options > > > such as DMA_NONCOHERENT). > > > > PWC is discontinued chip, so there will not be any new USB3 cameras. > > You're right. I had in mind other USB cameras that would benefit from the same > change, and in particular the uvcvideo driver, which is used by USB3 cameras. > > > Kconfig won't work here, as I said before, DMA config is stored inside > > device tree blob on ARM architecture. > > But couldn't we skip it at least on x86 ? If we use the map-once, sync-repeatedly approach, would there be anything to gain on x86? I believe the sync ops there would be effectively no-ops, so the only overhead would be of a function call. Best regards, Tomasz ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-08-09 2:36 ` Tomasz Figa @ 2018-08-09 10:28 ` Laurent Pinchart 0 siblings, 0 replies; 38+ messages in thread From: Laurent Pinchart @ 2018-08-09 10:28 UTC (permalink / raw) To: Tomasz Figa Cc: Matwey V. Kornilov, Alan Stern, Ezequiel Garcia, hdegoede, Hans Verkuil, Mauro Carvalho Chehab, rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw Hi Tomasz, On Thursday, 9 August 2018 05:36:46 EEST Tomasz Figa wrote: > On Thu, Aug 9, 2018 at 7:31 AM Laurent Pinchart wrote: > > On Saturday, 4 August 2018 11:00:05 EEST Matwey V. Kornilov wrote: > >> 2018-07-30 18:35 GMT+03:00 Laurent Pinchart: > >>> On Tuesday, 24 July 2018 21:56:09 EEST Matwey V. Kornilov wrote: > >>>> 2018-07-23 21:57 GMT+03:00 Alan Stern: > >>>>> On Mon, 23 Jul 2018, Matwey V. Kornilov wrote: > >>>>>> I've tried to strategies: > >>>>>> > >>>>>> 1) Use dma_unmap and dma_map inside the handler (I suppose this is > >>>>>> similar to how USB core does when there is no > >>>>>> URB_NO_TRANSFER_DMA_MAP) > >>>>> > >>>>> Yes. > >>>>> > >>>>>> 2) Use sync_cpu and sync_device inside the handler (and dma_map > >>>>>> only once at memory allocation) > >>>>>> > >>>>>> It is interesting that dma_unmap/dma_map pair leads to the lower > >>>>>> overhead (+1us) than sync_cpu/sync_device (+2us) at x86_64 > >>>>>> platform. At armv7l platform using dma_unmap/dma_map leads to ~50 > >>>>>> usec in the handler, and sync_cpu/sync_device - ~65 usec. > >>>>>> > >>>>>> However, I am not sure is it mandatory to call > >>>>>> dma_sync_single_for_device for FROM_DEVICE direction? > >>>>> > >>>>> According to Documentation/DMA-API-HOWTO.txt, the CPU should not > >>>>> write to a DMA_FROM_DEVICE-mapped area, so > >>>>> dma_sync_single_for_device() is not needed. > >>>> > >>>> Well, I measured the following at armv7l. The handler execution time > >>>> (URB_NO_TRANSFER_DMA_MAP is used for all cases): > >>>> > >>>> 1) coherent DMA: ~3000 usec (pwc is not functional) > >>>> 2) explicit dma_unmap and dma_map in the handler: ~52 usec > >>>> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): > >>>> ~56 usec > >>> > >>> I really don't understand why the sync option is slower. Could you > >>> please investigate ? Before doing anything we need to make sure we have > >>> a full understanding of the problem. > >> > >> Hi, > >> > >> I've found one drawback in my measurements. I forgot to fix CPU > >> frequency at lowest state 300MHz. Now, I remeasured > >> > >> 2) dma_unmap and dma_map in the handler: > >> 2A) dma_unmap_single call: 28.8 +- 1.5 usec > >> 2B) memcpy and the rest: 58 +- 6 usec > >> 2C) dma_map_single call: 22 +- 2 usec > >> Total: 110 +- 7 usec > >> > >> 3) dma_sync_single_for_cpu > >> 3A) dma_sync_single_for_cpu call: 29.4 +- 1.7 usec > >> 3B) memcpy and the rest: 59 +- 6 usec > >> 3C) noop (trace events overhead): 5 +- 2 usec > >> Total: 93 +- 7 usec > >> > >> So, now we see that 2A and 3A (as well as 2B and 3B) agree good within > >> error ranges. > > > > Thank you for the time you've spent on these measurements, the information > > is useful and your work very appreciated. > > > >>>> So, I suppose that unfortunately Tomasz suggestion doesn't work. > >>>> There is no performance improvement when dma_sync_single is used. > >>>> > >>>> At x86_64 the following happens: > >>>> > >>>> 1) coherent DMA: ~2 usec > >>> > >>> What do you mean by coherent DMA for x86_64 ? Is that > >>> usb_alloc_coherent() ? Could you trace it to see how memory is allocated > >>> exactly, and how it's mapped to the CPU ? I suspect that it will end up > >>> in dma_direct_alloc() but I'd like a confirmation. > >> > >> usb_alloc_coherents() ends up inside hcd_buffer_alloc() where > >> dma_alloc_coherent() is called. Keep in mind, that requested size is > >> 9560 in our case and pool is not used. > >> > >>>> 2) explicit dma_unmap and dma_map in the handler: ~3.5 usec > >>>> 3) explicit dma_sync_single_for_cpu (no dma_sync_single_for_device): > >>>> ~4 usec > >>>> > >>>> So, whats to do next? Personally, I think that DMA streaming API > >>>> introduces not so great overhead. > >>> > >>> It might not be very large, but with USB3 cameras at high resolutions > >>> and framerates, it might still become noticeable. I wouldn't degrade > >>> performances on x86, especially if we can decide which option to use > >>> based on the platform (or perhaps even better based on Kconfig options > >>> such as DMA_NONCOHERENT). > >> > >> PWC is discontinued chip, so there will not be any new USB3 cameras. > > > > You're right. I had in mind other USB cameras that would benefit from the > > same change, and in particular the uvcvideo driver, which is used by USB3 > > cameras. > > > >> Kconfig won't work here, as I said before, DMA config is stored inside > >> device tree blob on ARM architecture. > > > > But couldn't we skip it at least on x86 ? > > If we use the map-once, sync-repeatedly approach, would there be > anything to gain on x86? I believe the sync ops there would be > effectively no-ops, so the only overhead would be of a function call. With that approach, and iff they're effectively no-ops, that should be fine. We thus need to double-check. -- Regards, Laurent Pinchart ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-20 11:33 ` Tomasz Figa 2018-07-20 11:57 ` Matwey V. Kornilov @ 2018-07-30 15:13 ` Laurent Pinchart 1 sibling, 0 replies; 38+ messages in thread From: Laurent Pinchart @ 2018-07-30 15:13 UTC (permalink / raw) To: Tomasz Figa Cc: Matwey V. Kornilov, Alan Stern, Ezequiel Garcia, hdegoede, Hans Verkuil, Mauro Carvalho Chehab, rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List, Kieran Bingham, keiichiw Hi Tomasz, On Friday, 20 July 2018 14:33:33 EEST Tomasz Figa wrote: > On Fri, Jul 20, 2018 at 8:23 PM Matwey V. Kornilov wrote: > > 2018-07-20 13:55 GMT+03:00 Tomasz Figa: > >> On Wed, Jul 18, 2018 at 5:51 AM Alan Stern wrote: > >>> On Tue, 17 Jul 2018, Ezequiel Garcia wrote: > >>>> Hi Matwey, > >>>> > >>>> First of all, sorry for the delay. > >>>> > >>>> Adding Alan and Hans. Guys, do you have any feedback here? > >>> > >>> ... > >>> > >>>>>> So, what is the benefit of using consistent > >>>>>> for these URBs, as opposed to streaming? > >>>>> > >>>>> I don't know, I think there is no real benefit and all we see is a > >>>>> consequence of copy-pasta when some webcam drivers were inspired by > >>>>> others and development priparily was going at x86 platforms. > >>>> > >>>> You are probably right about the copy-pasta. > >>>> > >>>>> It would be great if somebody corrected me here. DMA Coherence is > >>>>> quite strong property and I cannot figure out how can it help when > >>>>> streaming video. The CPU host always reads from the buffer and never > >>>>> writes to. Hardware perepherial always writes to and never reads from. > >>>>> Moreover, buffer access is mutually exclusive and separated in time by > >>>>> Interrupt fireing and URB starting (when we reuse existing URB for new > >>>>> request). Only single one memory barrier is really required here. > >>>> > >>>> Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core > >>>> create DMA mappings and use the streaming API. Which makes more > >>>> sense in hardware without hardware coherency. > >>> > >>> As far as I know, the _only_ advantage to using coherent DMA in this > >>> situation is that you then do not have to pay the overhead of > >>> constantly setting up and tearing down the streaming mappings. So it > >>> depends very much on the platform: If coherent buffers are cached then > >>> it's a slight win and otherwise it's a big lose. > >> > >> Isn't it about usb_alloc_coherent() being backed by DMA coherent API > >> (dma_alloc_coherent/attrs()) and ending up allocating uncached (or > >> write-combine) memory for devices with non-coherent DMAs? I'm not sure > > > > Yes, this is what exactly happens at armv7l platforms. > > Okay, thanks. So this seems to be exactly the same thing that is > happening in the UVC driver. There is quite a bit of random accesses > to extract some header fields and then a big memcpy into VB2 buffer to > assemble final frame. > > If we don't want to pay the cost of creating and destroying the > streaming mapping, we could map (dma_map_single()) once, set > transfer_dma of URB and URB_NO_TRANSFER_DMA_MAP and then just > synchronize the caches (dma_sync_single()) before submitting/after > completing the URB. The problem is that the dma_sync_single() calls can end up being quite costly, depending on the platform, its cache architecture, and the buffer size. For a given system and use case we should always be able to decide which option is best, but finding that dynamically at runtime isn't an easy task. I remember that when developing the OMAP3 ISP driver at Nokia we had a heuristics that forced a full D-cache clean above a certain picture resolution as that was faster than selectively cleaning cache lines. Furthermore, the DMA mapping API doesn't help us here, as it doesn't allow a platform to optimize operations based on the buffer mappings. An easy example is that there's no way for the DMA mapping implementation on ARM to find out in the dma_sync_single() operation that the buffer isn't mapped to the CPU and that the CPU cache clean can be skipped. This doesn't affect USB drivers as a CPU mapping always exists, but there could be some limitations there too when we'll try to optimize the implementation. -- Regards, Laurent Pinchart ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-17 20:10 ` Ezequiel Garcia 2018-07-17 20:51 ` Alan Stern @ 2018-07-18 12:10 ` Matwey V. Kornilov 2018-07-19 23:36 ` Ezequiel Garcia 2018-07-30 16:07 ` Mauro Carvalho Chehab 2 siblings, 1 reply; 38+ messages in thread From: Matwey V. Kornilov @ 2018-07-18 12:10 UTC (permalink / raw) To: Ezequiel Garcia Cc: Alan Stern, Hans de Goede, hverkuil, mchehab, Laurent Pinchart, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, linux-media, open list 2018-07-17 23:10 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>: > Hi Matwey, > > First of all, sorry for the delay. > > Adding Alan and Hans. Guys, do you have any feedback here? > > See below for some feedback on my side. > > On Mon, 2018-06-18 at 10:10 +0300, Matwey V. Kornilov wrote: >> Hi Ezequiel, >> >> 2018-06-18 8:11 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>: >> > + Laurent >> > >> > On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote: >> > > DMA cocherency slows the transfer down on systems without hardware >> > > coherent DMA. >> > > >> > > Based on previous commit the following performance benchmarks have been >> > > carried out. Average memcpy() data transfer rate (rate) and handler >> > > completion time (time) have been measured when running video stream at >> > > 640x480 resolution at 10fps. >> > > >> > > x86_64 based system (Intel Core i5-3470). This platform has hardware >> > > coherent DMA support and proposed change doesn't make big difference here. >> > > >> > > * kmalloc: rate = (4.4 +- 1.0) GBps >> > > time = (2.4 +- 1.2) usec >> > > * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps >> > > time = (2.5 +- 1.0) usec >> > > >> > > We see that the measurements agree well within error ranges in this case. >> > > So no performance downgrade is introduced. >> > > >> > > armv7l based system (TI AM335x BeagleBone Black). This platform has no >> > > hardware coherent DMA support. DMA coherence is implemented via disabled >> > > page caching that slows down memcpy() due to memory controller behaviour. >> > > >> > > * kmalloc: rate = (190 +- 30) MBps >> > > time = (50 +- 10) usec >> > > * usb_alloc_coherent: rate = (33 +- 4) MBps >> > > time = (3000 +- 400) usec >> > > >> > > Note, that quantative difference leads (this commit leads to 5 times >> > > acceleration) to qualitative behavior change in this case. As it was >> > > stated before, the video stream can not be successfully received at AM335x >> > > platforms with MUSB based USB host controller due to performance issues >> > > [1]. >> > > >> > > [1] https://www.spinics.net/lists/linux-usb/msg165735.html >> > > >> > >> > This is quite interesting! I have receive similar complaints >> > from users wanting to use stk1160 on BBB and Raspberrys, >> > without much luck on either, due to insufficient isoc bandwidth. >> > >> > I'm guessing other ARM platforms could be suffering >> > from the same issue. >> > >> > Note that stk1160 and uvcvideo drivers use kmalloc on platforms >> > where DMA_NONCOHERENT is defined, but this is not the case >> > on ARM platforms. >> >> There are some ARMv7 platforms that have coherent DMA (for instance >> Broadcome Horthstar Plus series), but the most of them don't have. It >> is defined in device tree file, and there is no way to recover this >> information at runtime in USB perepherial driver. >> >> > >> > So, what is the benefit of using consistent >> > for these URBs, as opposed to streaming? >> >> I don't know, I think there is no real benefit and all we see is a >> consequence of copy-pasta when some webcam drivers were inspired by >> others and development priparily was going at x86 platforms. > > You are probably right about the copy-pasta. > >> It would >> be great if somebody corrected me here. DMA Coherence is quite strong >> property and I cannot figure out how can it help when streaming video. >> The CPU host always reads from the buffer and never writes to. >> Hardware perepherial always writes to and never reads from. Moreover, >> buffer access is mutually exclusive and separated in time by Interrupt >> fireing and URB starting (when we reuse existing URB for new request). >> Only single one memory barrier is really required here. >> > > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core > create DMA mappings and use the streaming API. Which makes more > sense in hardware without hardware coherency. > > The only thing that bothers me with this patch is that it's not > really something specific to this driver. If this fix is valid > for pwc, then it's valid for all the drivers allocating coherent > memory. > > And also, this path won't prevent further copy-paste spread > of the coherent allocation. > > Is there any chance we can introduce a helper to allocate > isoc URBs, and then change all drivers to use it? No need > to do all of them now, but it would be good to at least have > a plan for it. Well, basically I am agree with you. However, I don't have all possible hardware to test, so I can't fix all possible drivers. Also I can not figure out how could the helper looked like. What do you think about usb_alloc() (c.f. usb_alloc_coherent()) ? > > Chances are, only a handful of platforms would care to use > coherent mappings. > >> I understand that there are cases when DMA coherence is really needed, >> for instane VirtIO VRing when we accessing same data structure in both >> directions from the both sides, but this has nothing common with our >> case. >> >> > >> > If the choice is simply platform dependent, >> > can't we somehow detect which mapping should >> > be prefered? >> >> Now, we don't have this way. >> >> > >> > > Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru> >> > > --- >> > > drivers/media/usb/pwc/pwc-if.c | 12 +++--------- >> > > 1 file changed, 3 insertions(+), 9 deletions(-) >> > > >> > > diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c >> > > index 5775d1f60668..6a3cd9680a7f 100644 >> > > --- a/drivers/media/usb/pwc/pwc-if.c >> > > +++ b/drivers/media/usb/pwc/pwc-if.c >> > > @@ -427,11 +427,8 @@ static int pwc_isoc_init(struct pwc_device *pdev) >> > > urb->interval = 1; // devik >> > > urb->dev = udev; >> > > urb->pipe = usb_rcvisocpipe(udev, pdev->vendpoint); >> > > - urb->transfer_flags = URB_ISO_ASAP | URB_NO_TRANSFER_DMA_MAP; >> > > - urb->transfer_buffer = usb_alloc_coherent(udev, >> > > - ISO_BUFFER_SIZE, >> > > - GFP_KERNEL, >> > > - &urb->transfer_dma); >> > > + urb->transfer_flags = URB_ISO_ASAP; >> > > + urb->transfer_buffer = kmalloc(ISO_BUFFER_SIZE, GFP_KERNEL); >> > > if (urb->transfer_buffer == NULL) { >> > > PWC_ERROR("Failed to allocate urb buffer %d\n", i); >> > > pwc_isoc_cleanup(pdev); >> > > @@ -491,10 +488,7 @@ static void pwc_iso_free(struct pwc_device *pdev) >> > > if (pdev->urbs[i]) { >> > > PWC_DEBUG_MEMORY("Freeing URB\n"); >> > > if (pdev->urbs[i]->transfer_buffer) { >> > > - usb_free_coherent(pdev->udev, >> > > - pdev->urbs[i]->transfer_buffer_length, >> > > - pdev->urbs[i]->transfer_buffer, >> > > - pdev->urbs[i]->transfer_dma); >> > > + kfree(pdev->urbs[i]->transfer_buffer); >> > > } >> > > usb_free_urb(pdev->urbs[i]); >> > > pdev->urbs[i] = NULL; >> >> >> > -- With best regards, Matwey V. Kornilov. Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-18 12:10 ` Matwey V. Kornilov @ 2018-07-19 23:36 ` Ezequiel Garcia 2018-07-20 9:35 ` Matwey V. Kornilov 2018-07-30 15:42 ` Laurent Pinchart 0 siblings, 2 replies; 38+ messages in thread From: Ezequiel Garcia @ 2018-07-19 23:36 UTC (permalink / raw) To: Matwey V. Kornilov Cc: Alan Stern, Hans de Goede, hverkuil, mchehab, Laurent Pinchart, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, linux-media, open list On Wed, 2018-07-18 at 15:10 +0300, Matwey V. Kornilov wrote: > 2018-07-17 23:10 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>: > > Hi Matwey, > > > > First of all, sorry for the delay. > > > > Adding Alan and Hans. Guys, do you have any feedback here? > > > > See below for some feedback on my side. > > > > On Mon, 2018-06-18 at 10:10 +0300, Matwey V. Kornilov wrote: > > > Hi Ezequiel, > > > > > > 2018-06-18 8:11 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>: > > > > + Laurent > > > > > > > > On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote: > > > > > DMA cocherency slows the transfer down on systems without hardware > > > > > coherent DMA. > > > > > > > > > > Based on previous commit the following performance benchmarks have been > > > > > carried out. Average memcpy() data transfer rate (rate) and handler > > > > > completion time (time) have been measured when running video stream at > > > > > 640x480 resolution at 10fps. > > > > > > > > > > x86_64 based system (Intel Core i5-3470). This platform has hardware > > > > > coherent DMA support and proposed change doesn't make big difference here. > > > > > > > > > > * kmalloc: rate = (4.4 +- 1.0) GBps > > > > > time = (2.4 +- 1.2) usec > > > > > * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps > > > > > time = (2.5 +- 1.0) usec > > > > > > > > > > We see that the measurements agree well within error ranges in this case. > > > > > So no performance downgrade is introduced. > > > > > > > > > > armv7l based system (TI AM335x BeagleBone Black). This platform has no > > > > > hardware coherent DMA support. DMA coherence is implemented via disabled > > > > > page caching that slows down memcpy() due to memory controller behaviour. > > > > > > > > > > * kmalloc: rate = (190 +- 30) MBps > > > > > time = (50 +- 10) usec > > > > > * usb_alloc_coherent: rate = (33 +- 4) MBps > > > > > time = (3000 +- 400) usec > > > > > > > > > > Note, that quantative difference leads (this commit leads to 5 times > > > > > acceleration) to qualitative behavior change in this case. As it was > > > > > stated before, the video stream can not be successfully received at AM335x > > > > > platforms with MUSB based USB host controller due to performance issues > > > > > [1]. > > > > > > > > > > [1] https://www.spinics.net/lists/linux-usb/msg165735.html > > > > > > > > > > > > > This is quite interesting! I have receive similar complaints > > > > from users wanting to use stk1160 on BBB and Raspberrys, > > > > without much luck on either, due to insufficient isoc bandwidth. > > > > > > > > I'm guessing other ARM platforms could be suffering > > > > from the same issue. > > > > > > > > Note that stk1160 and uvcvideo drivers use kmalloc on platforms > > > > where DMA_NONCOHERENT is defined, but this is not the case > > > > on ARM platforms. > > > > > > There are some ARMv7 platforms that have coherent DMA (for instance > > > Broadcome Horthstar Plus series), but the most of them don't have. It > > > is defined in device tree file, and there is no way to recover this > > > information at runtime in USB perepherial driver. > > > > > > > > > > > So, what is the benefit of using consistent > > > > for these URBs, as opposed to streaming? > > > > > > I don't know, I think there is no real benefit and all we see is a > > > consequence of copy-pasta when some webcam drivers were inspired by > > > others and development priparily was going at x86 platforms. > > > > You are probably right about the copy-pasta. > > > > > It would > > > be great if somebody corrected me here. DMA Coherence is quite strong > > > property and I cannot figure out how can it help when streaming video. > > > The CPU host always reads from the buffer and never writes to. > > > Hardware perepherial always writes to and never reads from. Moreover, > > > buffer access is mutually exclusive and separated in time by Interrupt > > > fireing and URB starting (when we reuse existing URB for new request). > > > Only single one memory barrier is really required here. > > > > > > > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core > > create DMA mappings and use the streaming API. Which makes more > > sense in hardware without hardware coherency. > > > > The only thing that bothers me with this patch is that it's not > > really something specific to this driver. If this fix is valid > > for pwc, then it's valid for all the drivers allocating coherent > > memory. > > > > And also, this path won't prevent further copy-paste spread > > of the coherent allocation. > > > > Is there any chance we can introduce a helper to allocate > > isoc URBs, and then change all drivers to use it? No need > > to do all of them now, but it would be good to at least have > > a plan for it. > > Well, basically I am agree with you. > However, I don't have all possible hardware to test, so I can't fix > all possible drivers. Sure. And keep in mind this is more about the USB host controller, than about this specific driver. So it's the controller what we would have to test! > Also I can not figure out how could the helper looked like. What do > you think about usb_alloc() (c.f. usb_alloc_coherent()) ? > I do not know that either. But it's something we can think about. Meanwhile, it would be a shame to loose or stall this excellent effort (which is effectively enabling a cameras on a bunch of devices). How about you introduce a driver parameter (or device attribute), to avoid changing the behavior for USB host controllers we don't know about? Something like 'alloc_coherent_urbs=y/n'. Perhaps set that to 'yes' by default in x86, and 'no' by default in the rest? We can think about a generic solution later. Thanks, Eze ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-19 23:36 ` Ezequiel Garcia @ 2018-07-20 9:35 ` Matwey V. Kornilov 2018-07-30 15:42 ` Laurent Pinchart 1 sibling, 0 replies; 38+ messages in thread From: Matwey V. Kornilov @ 2018-07-20 9:35 UTC (permalink / raw) To: Ezequiel Garcia Cc: Alan Stern, Hans de Goede, hverkuil, mchehab, Laurent Pinchart, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, linux-media, open list 2018-07-20 2:36 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>: > On Wed, 2018-07-18 at 15:10 +0300, Matwey V. Kornilov wrote: >> 2018-07-17 23:10 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>: >> > Hi Matwey, >> > >> > First of all, sorry for the delay. >> > >> > Adding Alan and Hans. Guys, do you have any feedback here? >> > >> > See below for some feedback on my side. >> > >> > On Mon, 2018-06-18 at 10:10 +0300, Matwey V. Kornilov wrote: >> > > Hi Ezequiel, >> > > >> > > 2018-06-18 8:11 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>: >> > > > + Laurent >> > > > >> > > > On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote: >> > > > > DMA cocherency slows the transfer down on systems without hardware >> > > > > coherent DMA. >> > > > > >> > > > > Based on previous commit the following performance benchmarks have been >> > > > > carried out. Average memcpy() data transfer rate (rate) and handler >> > > > > completion time (time) have been measured when running video stream at >> > > > > 640x480 resolution at 10fps. >> > > > > >> > > > > x86_64 based system (Intel Core i5-3470). This platform has hardware >> > > > > coherent DMA support and proposed change doesn't make big difference here. >> > > > > >> > > > > * kmalloc: rate = (4.4 +- 1.0) GBps >> > > > > time = (2.4 +- 1.2) usec >> > > > > * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps >> > > > > time = (2.5 +- 1.0) usec >> > > > > >> > > > > We see that the measurements agree well within error ranges in this case. >> > > > > So no performance downgrade is introduced. >> > > > > >> > > > > armv7l based system (TI AM335x BeagleBone Black). This platform has no >> > > > > hardware coherent DMA support. DMA coherence is implemented via disabled >> > > > > page caching that slows down memcpy() due to memory controller behaviour. >> > > > > >> > > > > * kmalloc: rate = (190 +- 30) MBps >> > > > > time = (50 +- 10) usec >> > > > > * usb_alloc_coherent: rate = (33 +- 4) MBps >> > > > > time = (3000 +- 400) usec >> > > > > >> > > > > Note, that quantative difference leads (this commit leads to 5 times >> > > > > acceleration) to qualitative behavior change in this case. As it was >> > > > > stated before, the video stream can not be successfully received at AM335x >> > > > > platforms with MUSB based USB host controller due to performance issues >> > > > > [1]. >> > > > > >> > > > > [1] https://www.spinics.net/lists/linux-usb/msg165735.html >> > > > > >> > > > >> > > > This is quite interesting! I have receive similar complaints >> > > > from users wanting to use stk1160 on BBB and Raspberrys, >> > > > without much luck on either, due to insufficient isoc bandwidth. >> > > > >> > > > I'm guessing other ARM platforms could be suffering >> > > > from the same issue. >> > > > >> > > > Note that stk1160 and uvcvideo drivers use kmalloc on platforms >> > > > where DMA_NONCOHERENT is defined, but this is not the case >> > > > on ARM platforms. >> > > >> > > There are some ARMv7 platforms that have coherent DMA (for instance >> > > Broadcome Horthstar Plus series), but the most of them don't have. It >> > > is defined in device tree file, and there is no way to recover this >> > > information at runtime in USB perepherial driver. >> > > >> > > > >> > > > So, what is the benefit of using consistent >> > > > for these URBs, as opposed to streaming? >> > > >> > > I don't know, I think there is no real benefit and all we see is a >> > > consequence of copy-pasta when some webcam drivers were inspired by >> > > others and development priparily was going at x86 platforms. >> > >> > You are probably right about the copy-pasta. >> > >> > > It would >> > > be great if somebody corrected me here. DMA Coherence is quite strong >> > > property and I cannot figure out how can it help when streaming video. >> > > The CPU host always reads from the buffer and never writes to. >> > > Hardware perepherial always writes to and never reads from. Moreover, >> > > buffer access is mutually exclusive and separated in time by Interrupt >> > > fireing and URB starting (when we reuse existing URB for new request). >> > > Only single one memory barrier is really required here. >> > > >> > >> > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core >> > create DMA mappings and use the streaming API. Which makes more >> > sense in hardware without hardware coherency. >> > >> > The only thing that bothers me with this patch is that it's not >> > really something specific to this driver. If this fix is valid >> > for pwc, then it's valid for all the drivers allocating coherent >> > memory. >> > >> > And also, this path won't prevent further copy-paste spread >> > of the coherent allocation. >> > >> > Is there any chance we can introduce a helper to allocate >> > isoc URBs, and then change all drivers to use it? No need >> > to do all of them now, but it would be good to at least have >> > a plan for it. >> >> Well, basically I am agree with you. >> However, I don't have all possible hardware to test, so I can't fix >> all possible drivers. > > Sure. And keep in mind this is more about the USB host controller, > than about this specific driver. So it's the controller what we > would have to test! > >> Also I can not figure out how could the helper looked like. What do >> you think about usb_alloc() (c.f. usb_alloc_coherent()) ? >> > > I do not know that either. But it's something we can think about. > > Meanwhile, it would be a shame to loose or stall this excellent > effort (which is effectively enabling a cameras on a bunch of devices). > > How about you introduce a driver parameter (or device attribute), > to avoid changing the behavior for USB host controllers we don't know > about? It would be fine for me, if there are no objections. I know that some maintainers don't like module parameters per se. > > Something like 'alloc_coherent_urbs=y/n'. Perhaps set that > to 'yes' by default in x86, and 'no' by default in the rest? > > We can think about a generic solution later. > > Thanks, > Eze > -- With best regards, Matwey V. Kornilov. Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-19 23:36 ` Ezequiel Garcia 2018-07-20 9:35 ` Matwey V. Kornilov @ 2018-07-30 15:42 ` Laurent Pinchart 1 sibling, 0 replies; 38+ messages in thread From: Laurent Pinchart @ 2018-07-30 15:42 UTC (permalink / raw) To: Ezequiel Garcia Cc: Matwey V. Kornilov, Alan Stern, Hans de Goede, hverkuil, mchehab, Steven Rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, linux-media, open list Hi Ezequiel, On Friday, 20 July 2018 02:36:40 EEST Ezequiel Garcia wrote: > On Wed, 2018-07-18 at 15:10 +0300, Matwey V. Kornilov wrote: > > 2018-07-17 23:10 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>: > >> On Mon, 2018-06-18 at 10:10 +0300, Matwey V. Kornilov wrote: > >>> 2018-06-18 8:11 GMT+03:00 Ezequiel Garcia <ezequiel@collabora.com>: > >>>> On Sun, 2018-06-17 at 17:36 +0300, Matwey V. Kornilov wrote: > >>>>> DMA cocherency slows the transfer down on systems without hardware > >>>>> coherent DMA. > >>>>> > >>>>> Based on previous commit the following performance benchmarks have > >>>>> been carried out. Average memcpy() data transfer rate (rate) and > >>>>> handler completion time (time) have been measured when running > >>>>> video stream at 640x480 resolution at 10fps. > >>>>> > >>>>> x86_64 based system (Intel Core i5-3470). This platform has > >>>>> hardware coherent DMA support and proposed change doesn't make big > >>>>> difference here. > >>>>> > >>>>> * kmalloc: rate = (4.4 +- 1.0) GBps > >>>>> time = (2.4 +- 1.2) usec > >>>>> > >>>>> * usb_alloc_coherent: rate = (4.1 +- 0.9) GBps > >>>>> time = (2.5 +- 1.0) usec > >>>>> > >>>>> We see that the measurements agree well within error ranges in > >>>>> this case. So no performance downgrade is introduced. > >>>>> > >>>>> armv7l based system (TI AM335x BeagleBone Black). This platform > >>>>> has no hardware coherent DMA support. DMA coherence is implemented > >>>>> via disabled page caching that slows down memcpy() due to memory > >>>>> controller behaviour. > >>>>> > >>>>> * kmalloc: rate = (190 +- 30) MBps > >>>>> time = (50 +- 10) usec > >>>>> > >>>>> * usb_alloc_coherent: rate = (33 +- 4) MBps > >>>>> time = (3000 +- 400) usec > >>>>> > >>>>> Note, that quantative difference leads (this commit leads to 5 > >>>>> times acceleration) to qualitative behavior change in this case. > >>>>> As it was stated before, the video stream can not be successfully > >>>>> received at AM335x platforms with MUSB based USB host controller > >>>>> due to performance issues [1]. > >>>>> > >>>>> [1] https://www.spinics.net/lists/linux-usb/msg165735.html > >>>> > >>>> This is quite interesting! I have receive similar complaints > >>>> from users wanting to use stk1160 on BBB and Raspberrys, > >>>> without much luck on either, due to insufficient isoc bandwidth. > >>>> > >>>> I'm guessing other ARM platforms could be suffering > >>>> from the same issue. > >>>> > >>>> Note that stk1160 and uvcvideo drivers use kmalloc on platforms > >>>> where DMA_NONCOHERENT is defined, but this is not the case > >>>> on ARM platforms. > >>> > >>> There are some ARMv7 platforms that have coherent DMA (for instance > >>> Broadcome Horthstar Plus series), but the most of them don't have. It > >>> is defined in device tree file, and there is no way to recover this > >>> information at runtime in USB perepherial driver. > >>> > >>>> So, what is the benefit of using consistent > >>>> for these URBs, as opposed to streaming? > >>> > >>> I don't know, I think there is no real benefit and all we see is a > >>> consequence of copy-pasta when some webcam drivers were inspired by > >>> others and development priparily was going at x86 platforms. > >> > >> You are probably right about the copy-pasta. > >> > >>> It would be great if somebody corrected me here. DMA Coherence is quite > >>> strong property and I cannot figure out how can it help when streaming > >>> video. The CPU host always reads from the buffer and never writes to. > >>> Hardware perepherial always writes to and never reads from. Moreover, > >>> buffer access is mutually exclusive and separated in time by Interrupt > >>> fireing and URB starting (when we reuse existing URB for new request). > >>> Only single one memory barrier is really required here. > >> > >> Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core > >> create DMA mappings and use the streaming API. Which makes more > >> sense in hardware without hardware coherency. > >> > >> The only thing that bothers me with this patch is that it's not > >> really something specific to this driver. If this fix is valid > >> for pwc, then it's valid for all the drivers allocating coherent > >> memory. > >> > >> And also, this path won't prevent further copy-paste spread > >> of the coherent allocation. > >> > >> Is there any chance we can introduce a helper to allocate > >> isoc URBs, and then change all drivers to use it? No need > >> to do all of them now, but it would be good to at least have > >> a plan for it. > > > > Well, basically I am agree with you. > > However, I don't have all possible hardware to test, so I can't fix > > all possible drivers. > > Sure. And keep in mind this is more about the USB host controller, > than about this specific driver. So it's the controller what we > would have to test! > > > Also I can not figure out how could the helper looked like. What do > > you think about usb_alloc() (c.f. usb_alloc_coherent()) ? > > I do not know that either. But it's something we can think about. > > Meanwhile, it would be a shame to loose or stall this excellent > effort (which is effectively enabling a cameras on a bunch of devices). > > How about you introduce a driver parameter (or device attribute), > to avoid changing the behavior for USB host controllers we don't know > about? > > Something like 'alloc_coherent_urbs=y/n'. Perhaps set that > to 'yes' by default in x86, and 'no' by default in the rest? > > We can think about a generic solution later. A generic solution would be much better though. Could we still try to achieve one, and only go for a hack as a last resort ? With an analysis of code flows and performances on x86 vs. ARM I don't think it would be too difficult to decide what to do. -- Regards, Laurent Pinchart ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-17 20:10 ` Ezequiel Garcia 2018-07-17 20:51 ` Alan Stern 2018-07-18 12:10 ` Matwey V. Kornilov @ 2018-07-30 16:07 ` Mauro Carvalho Chehab 2018-07-31 6:06 ` Tomasz Figa 2 siblings, 1 reply; 38+ messages in thread From: Mauro Carvalho Chehab @ 2018-07-30 16:07 UTC (permalink / raw) To: Ezequiel Garcia Cc: Matwey V. Kornilov, Alan Stern, Hans de Goede, hverkuil, mchehab, Laurent Pinchart, rostedt, mingo, isely, bhumirks, colin.king, linux-media, open list Em Tue, 17 Jul 2018 17:10:22 -0300 Ezequiel Garcia <ezequiel@collabora.com> escreveu: > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core > create DMA mappings and use the streaming API. Which makes more > sense in hardware without hardware coherency. > > The only thing that bothers me with this patch is that it's not > really something specific to this driver. If this fix is valid > for pwc, then it's valid for all the drivers allocating coherent > memory. We're actually doing this change on other drivers: https://git.linuxtv.org/media_tree.git/commit/?id=d571b592c6206 I suspect that the reason why all USB media drivers were using URB_NO_TRANSFER_DMA_MAP is just because the first media USB driver upstream used it. On that time, I remember I tried once to not use this flag, but there was something that broke (perhaps I just didn't know enough about the USB layer - or perhaps some fixes happened at USB core - allowing it to be used with ISOC transfers). Anyway, nowadays, I fail to see a reason why not let the USB core do the DMA maps. On my tests after this patch, at the boards I tested (arm and x86), I was unable to see any regressions. Thanks, Mauro ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer 2018-07-30 16:07 ` Mauro Carvalho Chehab @ 2018-07-31 6:06 ` Tomasz Figa 0 siblings, 0 replies; 38+ messages in thread From: Tomasz Figa @ 2018-07-31 6:06 UTC (permalink / raw) To: Mauro Carvalho Chehab Cc: Ezequiel Garcia, Matwey V. Kornilov, Alan Stern, hdegoede, Hans Verkuil, Mauro Carvalho Chehab, Laurent Pinchart, rostedt, mingo, Mike Isely, Bhumika Goyal, Colin King, Linux Media Mailing List, Linux Kernel Mailing List On Tue, Jul 31, 2018 at 1:07 AM Mauro Carvalho Chehab <mchehab+samsung@kernel.org> wrote: > > Em Tue, 17 Jul 2018 17:10:22 -0300 > Ezequiel Garcia <ezequiel@collabora.com> escreveu: > > > Yeah, and not setting URB_NO_TRANSFER_DMA_MAP makes the USB core > > create DMA mappings and use the streaming API. Which makes more > > sense in hardware without hardware coherency. > > > > The only thing that bothers me with this patch is that it's not > > really something specific to this driver. If this fix is valid > > for pwc, then it's valid for all the drivers allocating coherent > > memory. > > We're actually doing this change on other drivers: > https://git.linuxtv.org/media_tree.git/commit/?id=d571b592c6206 > > I suspect that the reason why all USB media drivers were using > URB_NO_TRANSFER_DMA_MAP is just because the first media USB driver > upstream used it. > > On that time, I remember I tried once to not use this flag, but there > was something that broke (perhaps I just didn't know enough about the > USB layer - or perhaps some fixes happened at USB core - allowing it > to be used with ISOC transfers). > > Anyway, nowadays, I fail to see a reason why not let the USB core > do the DMA maps. On my tests after this patch, at the boards I tested > (arm and x86), I was unable to see any regressions. I can see one reason: usb_alloc_coherent() would use dma_pool_alloc() with a fallback to dma_alloc_coherent() to do the allocation. I'm not sure what a typical size for an URB buffer is, but I assume that it's definitely more than 1 page. Order >0 allocations with page allocator (and SLAB, which eventually just falls back to page allocator for such large allocations) are generally considered costly. With allocation through DMA API, mechanisms such as CMA or IOMMU can be used (if available), making it much more likely to have the allocation satisfied on heavy load / long uptime systems. Best regards, Tomasz ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() 2018-06-17 14:36 [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Matwey V. Kornilov 2018-06-17 14:36 ` [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer Matwey V. Kornilov @ 2018-06-18 18:58 ` Steven Rostedt 2018-06-19 16:23 ` Matwey V. Kornilov 1 sibling, 1 reply; 38+ messages in thread From: Steven Rostedt @ 2018-06-18 18:58 UTC (permalink / raw) To: Matwey V. Kornilov Cc: hverkuil, mchehab, mingo, isely, bhumirks, colin.king, linux-media, linux-kernel On Sun, 17 Jun 2018 17:36:24 +0300 "Matwey V. Kornilov" <matwey@sai.msu.ru> wrote: I would prefer a change log here that would explain why these tracepoints are being added. > Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru> > --- > drivers/media/usb/pwc/pwc-if.c | 7 +++++++ > include/trace/events/pwc.h | 45 ++++++++++++++++++++++++++++++++++++++++++ > 2 files changed, 52 insertions(+) > create mode 100644 include/trace/events/pwc.h > > diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c > index 54b036d39c5b..5775d1f60668 100644 > --- a/drivers/media/usb/pwc/pwc-if.c > +++ b/drivers/media/usb/pwc/pwc-if.c > @@ -57,6 +57,9 @@ > - Pham Thanh Nam: webcam snapshot button as an event input device > */ > > +#define CREATE_TRACE_POINTS > +#include <trace/events/pwc.h> > + > #include <linux/errno.h> > #include <linux/init.h> > #include <linux/mm.h> > @@ -260,6 +263,8 @@ static void pwc_isoc_handler(struct urb *urb) > int i, fst, flen; > unsigned char *iso_buf = NULL; > > + trace_pwc_handler_enter(urb); > + > if (urb->status == -ENOENT || urb->status == -ECONNRESET || > urb->status == -ESHUTDOWN) { > PWC_DEBUG_OPEN("URB (%p) unlinked %ssynchronously.\n", Looks like if this is hit, we will return from the function without calling trace_pwc_handler_exit(). > @@ -347,6 +352,8 @@ static void pwc_isoc_handler(struct urb *urb) > pdev->vlast_packet_size = flen; > } > > + trace_pwc_handler_exit(urb); > + > handler_end: Why not add the tracepoint after handler_end. In fact, why not add some exit status to the trace event? I would think that would be useful as well. > i = usb_submit_urb(urb, GFP_ATOMIC); > if (i != 0) > diff --git a/include/trace/events/pwc.h b/include/trace/events/pwc.h > new file mode 100644 > index 000000000000..b13d2118bb7a > --- /dev/null > +++ b/include/trace/events/pwc.h > @@ -0,0 +1,45 @@ > +/* SPDX-License-Identifier: GPL-2.0 */ > +#if !defined(_TRACE_PWC_H) || defined(TRACE_HEADER_MULTI_READ) > +#define _TRACE_PWC_H > + > +#include <linux/usb.h> > +#include <linux/tracepoint.h> > + > +#undef TRACE_SYSTEM > +#define TRACE_SYSTEM pwc > + > +TRACE_EVENT(pwc_handler_enter, > + TP_PROTO(struct urb *urb), > + TP_ARGS(urb), > + TP_STRUCT__entry( > + __field(struct urb*, urb) > + __field(int, urb__status) > + __field(u32, urb__actual_length) > + ), > + TP_fast_assign( > + __entry->urb = urb; > + __entry->urb__status = urb->status; > + __entry->urb__actual_length = urb->actual_length; Is there any other data that may be interesting to record here? -- Steve > + ), > + TP_printk("urb=%p (status=%d actual_length=%u)", > + __entry->urb, > + __entry->urb__status, > + __entry->urb__actual_length) > +); > + > +TRACE_EVENT(pwc_handler_exit, > + TP_PROTO(struct urb *urb), > + TP_ARGS(urb), > + TP_STRUCT__entry( > + __field(struct urb*, urb) > + ), > + TP_fast_assign( > + __entry->urb = urb; > + ), > + TP_printk("urb=%p", __entry->urb) > +); > + > +#endif /* _TRACE_PWC_H */ > + > +/* This part must be outside protection */ > +#include <trace/define_trace.h> ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() 2018-06-18 18:58 ` [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Steven Rostedt @ 2018-06-19 16:23 ` Matwey V. Kornilov 2018-06-19 16:33 ` Steven Rostedt 0 siblings, 1 reply; 38+ messages in thread From: Matwey V. Kornilov @ 2018-06-19 16:23 UTC (permalink / raw) To: Steven Rostedt Cc: hverkuil, mchehab, mingo, Mike Isely, Bhumika Goyal, Colin King, linux-media, open list Hi Steven, Thank you for valuable comments. This is for measuring performance of URB completion callback inside PWC driver. What do you think about moving events to __usb_hcd_giveback_urb() in order to make this more generic? Like the following: local_irq_save(flags); // trace urb complete enter urb->complete(urb); // trace urb complete exit local_irq_restore(flags); 2018-06-18 21:58 GMT+03:00 Steven Rostedt <rostedt@goodmis.org>: > On Sun, 17 Jun 2018 17:36:24 +0300 > "Matwey V. Kornilov" <matwey@sai.msu.ru> wrote: > > I would prefer a change log here that would explain why these > tracepoints are being added. > > >> Signed-off-by: Matwey V. Kornilov <matwey@sai.msu.ru> >> --- >> drivers/media/usb/pwc/pwc-if.c | 7 +++++++ >> include/trace/events/pwc.h | 45 ++++++++++++++++++++++++++++++++++++++++++ >> 2 files changed, 52 insertions(+) >> create mode 100644 include/trace/events/pwc.h >> >> diff --git a/drivers/media/usb/pwc/pwc-if.c b/drivers/media/usb/pwc/pwc-if.c >> index 54b036d39c5b..5775d1f60668 100644 >> --- a/drivers/media/usb/pwc/pwc-if.c >> +++ b/drivers/media/usb/pwc/pwc-if.c >> @@ -57,6 +57,9 @@ >> - Pham Thanh Nam: webcam snapshot button as an event input device >> */ >> >> +#define CREATE_TRACE_POINTS >> +#include <trace/events/pwc.h> >> + >> #include <linux/errno.h> >> #include <linux/init.h> >> #include <linux/mm.h> >> @@ -260,6 +263,8 @@ static void pwc_isoc_handler(struct urb *urb) >> int i, fst, flen; >> unsigned char *iso_buf = NULL; >> >> + trace_pwc_handler_enter(urb); >> + >> if (urb->status == -ENOENT || urb->status == -ECONNRESET || >> urb->status == -ESHUTDOWN) { >> PWC_DEBUG_OPEN("URB (%p) unlinked %ssynchronously.\n", > > Looks like if this is hit, we will return from the function without > calling trace_pwc_handler_exit(). > >> @@ -347,6 +352,8 @@ static void pwc_isoc_handler(struct urb *urb) >> pdev->vlast_packet_size = flen; >> } >> >> + trace_pwc_handler_exit(urb); >> + >> handler_end: > > Why not add the tracepoint after handler_end. In fact, why not add some > exit status to the trace event? I would think that would be useful as > well. > > >> i = usb_submit_urb(urb, GFP_ATOMIC); >> if (i != 0) >> diff --git a/include/trace/events/pwc.h b/include/trace/events/pwc.h >> new file mode 100644 >> index 000000000000..b13d2118bb7a >> --- /dev/null >> +++ b/include/trace/events/pwc.h >> @@ -0,0 +1,45 @@ >> +/* SPDX-License-Identifier: GPL-2.0 */ >> +#if !defined(_TRACE_PWC_H) || defined(TRACE_HEADER_MULTI_READ) >> +#define _TRACE_PWC_H >> + >> +#include <linux/usb.h> >> +#include <linux/tracepoint.h> >> + >> +#undef TRACE_SYSTEM >> +#define TRACE_SYSTEM pwc >> + >> +TRACE_EVENT(pwc_handler_enter, >> + TP_PROTO(struct urb *urb), >> + TP_ARGS(urb), >> + TP_STRUCT__entry( >> + __field(struct urb*, urb) >> + __field(int, urb__status) >> + __field(u32, urb__actual_length) >> + ), >> + TP_fast_assign( >> + __entry->urb = urb; >> + __entry->urb__status = urb->status; >> + __entry->urb__actual_length = urb->actual_length; > > Is there any other data that may be interesting to record here? > > -- Steve > >> + ), >> + TP_printk("urb=%p (status=%d actual_length=%u)", >> + __entry->urb, >> + __entry->urb__status, >> + __entry->urb__actual_length) >> +); >> + >> +TRACE_EVENT(pwc_handler_exit, >> + TP_PROTO(struct urb *urb), >> + TP_ARGS(urb), >> + TP_STRUCT__entry( >> + __field(struct urb*, urb) >> + ), >> + TP_fast_assign( >> + __entry->urb = urb; >> + ), >> + TP_printk("urb=%p", __entry->urb) >> +); >> + >> +#endif /* _TRACE_PWC_H */ >> + >> +/* This part must be outside protection */ >> +#include <trace/define_trace.h> > -- With best regards, Matwey V. Kornilov. Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() 2018-06-19 16:23 ` Matwey V. Kornilov @ 2018-06-19 16:33 ` Steven Rostedt 2018-06-20 8:05 ` Matwey V. Kornilov 0 siblings, 1 reply; 38+ messages in thread From: Steven Rostedt @ 2018-06-19 16:33 UTC (permalink / raw) To: Matwey V. Kornilov Cc: hverkuil, mchehab, mingo, Mike Isely, Bhumika Goyal, Colin King, linux-media, open list On Tue, 19 Jun 2018 19:23:04 +0300 "Matwey V. Kornilov" <matwey@sai.msu.ru> wrote: > Hi Steven, > > Thank you for valuable comments. > > This is for measuring performance of URB completion callback inside PWC driver. > What do you think about moving events to __usb_hcd_giveback_urb() in > order to make this more generic? Like the following: > > local_irq_save(flags); > // trace urb complete enter > urb->complete(urb); > // trace urb complete exit > local_irq_restore(flags); > > If that can work for you, I'm fine with that. Trace events may be cheap, but they do come with some cost. I'd like to have all trace events be as valuable as possible, and limit the "special case" ones. -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() 2018-06-19 16:33 ` Steven Rostedt @ 2018-06-20 8:05 ` Matwey V. Kornilov 2018-06-20 13:09 ` Steven Rostedt 0 siblings, 1 reply; 38+ messages in thread From: Matwey V. Kornilov @ 2018-06-20 8:05 UTC (permalink / raw) To: Steven Rostedt Cc: hverkuil, mchehab, mingo, Mike Isely, Bhumika Goyal, Colin King, linux-media, open list 2018-06-19 19:33 GMT+03:00 Steven Rostedt <rostedt@goodmis.org>: > On Tue, 19 Jun 2018 19:23:04 +0300 > "Matwey V. Kornilov" <matwey@sai.msu.ru> wrote: > >> Hi Steven, >> >> Thank you for valuable comments. >> >> This is for measuring performance of URB completion callback inside PWC driver. >> What do you think about moving events to __usb_hcd_giveback_urb() in >> order to make this more generic? Like the following: >> >> local_irq_save(flags); >> // trace urb complete enter >> urb->complete(urb); >> // trace urb complete exit >> local_irq_restore(flags); >> >> > > If that can work for you, I'm fine with that. Trace events may be > cheap, but they do come with some cost. I'd like to have all trace > events be as valuable as possible, and limit the "special case" ones. What is the cost for events? I suppose one conditional check when trace is disabled? There is already similar debugging stuff related to usbmon in __usb_hcd_giveback_urb(), so I don't think that another conditional check will hurt performance dramatically there. When discussing second patch in this series I see that the issue that it is intended to resolve may be common to other USB media drivers. > > -- Steve > -- With best regards, Matwey V. Kornilov. Sternberg Astronomical Institute, Lomonosov Moscow State University, Russia 119234, Moscow, Universitetsky pr-k 13, +7 (495) 9392382 ^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() 2018-06-20 8:05 ` Matwey V. Kornilov @ 2018-06-20 13:09 ` Steven Rostedt 0 siblings, 0 replies; 38+ messages in thread From: Steven Rostedt @ 2018-06-20 13:09 UTC (permalink / raw) To: Matwey V. Kornilov Cc: hverkuil, mchehab, mingo, Mike Isely, Bhumika Goyal, Colin King, linux-media, open list On Wed, 20 Jun 2018 11:05:51 +0300 "Matwey V. Kornilov" <matwey@sai.msu.ru> wrote: > > If that can work for you, I'm fine with that. Trace events may be > > cheap, but they do come with some cost. I'd like to have all trace > > events be as valuable as possible, and limit the "special case" ones. > > What is the cost for events? I suppose one conditional check when > trace is disabled? There is already similar debugging stuff related to > usbmon in __usb_hcd_giveback_urb(), so I don't think that another > conditional check will hurt performance dramatically there. When > discussing second patch in this series I see that the issue that it is > intended to resolve may be common to other USB media drivers. The cost isn't just about performance. In fact, the performance overhead of trace events is pretty negligible. The cost I'm worried about is bloat. Each event can take up to 5K of memory. That can add up quickly when we add a bunch of events without thinking about that cost. -- Steve ^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2018-08-09 10:27 UTC | newest] Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-06-17 14:36 [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Matwey V. Kornilov 2018-06-17 14:36 ` [PATCH 2/2] media: usb: pwc: Don't use coherent DMA buffers for ISO transfer Matwey V. Kornilov 2018-06-18 5:11 ` Ezequiel Garcia 2018-06-18 7:10 ` Matwey V. Kornilov 2018-07-17 20:10 ` Ezequiel Garcia 2018-07-17 20:51 ` Alan Stern 2018-07-20 10:55 ` Tomasz Figa 2018-07-20 11:22 ` Matwey V. Kornilov 2018-07-20 11:33 ` Tomasz Figa 2018-07-20 11:57 ` Matwey V. Kornilov 2018-07-23 17:04 ` Matwey V. Kornilov 2018-07-23 18:57 ` Alan Stern 2018-07-24 18:56 ` Matwey V. Kornilov 2018-07-24 20:55 ` Alan Stern 2018-07-25 13:46 ` Matwey V. Kornilov 2018-07-30 4:14 ` Tomasz Figa 2018-08-04 8:05 ` Matwey V. Kornilov 2018-07-30 15:35 ` Laurent Pinchart 2018-08-04 8:00 ` Matwey V. Kornilov 2018-08-04 14:46 ` Alan Stern 2018-08-05 7:49 ` Christoph Hellwig 2018-08-05 8:33 ` Matwey V. Kornilov 2018-08-05 8:41 ` Christoph Hellwig 2018-08-08 22:32 ` Laurent Pinchart 2018-08-09 2:36 ` Tomasz Figa 2018-08-09 10:28 ` Laurent Pinchart 2018-07-30 15:13 ` Laurent Pinchart 2018-07-18 12:10 ` Matwey V. Kornilov 2018-07-19 23:36 ` Ezequiel Garcia 2018-07-20 9:35 ` Matwey V. Kornilov 2018-07-30 15:42 ` Laurent Pinchart 2018-07-30 16:07 ` Mauro Carvalho Chehab 2018-07-31 6:06 ` Tomasz Figa 2018-06-18 18:58 ` [PATCH 1/2] Add TRACE_EVENTs in pwc_isoc_handler() Steven Rostedt 2018-06-19 16:23 ` Matwey V. Kornilov 2018-06-19 16:33 ` Steven Rostedt 2018-06-20 8:05 ` Matwey V. Kornilov 2018-06-20 13:09 ` Steven Rostedt
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.