All of lore.kernel.org
 help / color / mirror / Atom feed
* CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-08 17:42 ` Mario Kleiner
  0 siblings, 0 replies; 30+ messages in thread
From: Mario Kleiner @ 2014-08-08 17:42 UTC (permalink / raw)
  To: dri-devel
  Cc: Ben Skeggs, Alex Deucher, Christian König, Thomas Hellstrom,
	m.szyprowski, LKML, kamal, ben, Mario Kleiner

Hi all,

there is a rather severe performance problem i accidentally found when 
trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under 
Ubuntu 14.04 LTS with nouveau as graphics driver.

I was lazy and just installed the Ubuntu precompiled mainline kernel. 
That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA 
(contiguous memory allocator) size of 64 MB. Older Ubuntu kernels 
weren't compiled with CMA, so i only observed this on 3.16, but previous 
kernels would likely be affected too.

After a few minutes of regular desktop use like switching workspaces, 
scrolling text in a terminal window, Firefox with multiple tabs open, 
Thunderbird etc. (tested with KDE/Kwin, with/without desktop 
composition), i get chunky desktop updates, then multi-second freezes, 
after a few minutes the desktop hangs for over a minute on almost any 
GUI action like switching windows etc. --> Unuseable.

ftrace'ing shows the culprit being this callchain (typical good/bad 
example ftrace snippets at the end of this mail):

...ttm dma coherent memory allocations, e.g., from 
__ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform 
specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] --> 
dma_alloc_from_contiguous()

dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or when 
the machine is booted with kernel boot cmdline parameter "cma=0", so it 
triggers the fast alloc_pages_node() fallback at least on x86_64.

With CMA, this function becomes progressively more slow with every 
minute of desktop use, e.g., runtimes going up from < 0.3 usecs to 
hundreds or thousands of microseconds (before it gives up and 
alloc_pages_node() fallback is used), so this causes the 
multi-second/minute hangs of the desktop.

So it seems ttm memory allocations quickly fragment and/or exhaust the 
CMA memory area, and dma_alloc_from_contiguous() tries very hard to find 
a fitting hole big enough to satisfy allocations with a retry loop (see 
http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339) 
that takes forever.

This is not good, also not for other devices which actually need a 
non-fragmented CMA for DMA, so what to do? I doubt most current gpus 
still need physically contiguous dma memory, maybe with exception of 
some embedded gpus?

My naive approach would be to add a new gfp_t flag a la ___GFP_AVOIDCMA, 
and make callers of dma_alloc_from_contiguous() refrain from doing so if 
they have some fallback for getting memory. And then add that flag to 
ttm's ttm_dma_populate() gfp_flags, e.g., around here: 
http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884

However i'm not familiar enough with memory management, so likely 
greater minds here have much better ideas on how to deal with this?

thanks,
-mario

Typical snippet from an example trace of a badly stalling desktop with 
CMA (alloc_pages_node() fallback may have been missing in this traces 
ftrace_filter settings):

1)               |                          ttm_dma_pool_get_pages [ttm]() {
  1)               | ttm_dma_page_pool_fill_locked [ttm]() {
  1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1873.071 us | dma_alloc_from_contiguous();
  1) ! 1874.292 us |                                  }
  1) ! 1875.400 us |                                }
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1868.372 us | dma_alloc_from_contiguous();
  1) ! 1869.586 us |                                  }
  1) ! 1870.053 us |                                }
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1871.085 us | dma_alloc_from_contiguous();
  1) ! 1872.240 us |                                  }
  1) ! 1872.669 us |                                }
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1888.934 us | dma_alloc_from_contiguous();
  1) ! 1890.179 us |                                  }
  1) ! 1890.608 us |                                }
  1)   0.048 us    | ttm_set_pages_caching [ttm]();
  1) ! 7511.000 us |                              }
  1) ! 7511.306 us |                            }
  1) ! 7511.623 us |                          }

The good case (with cma=0 kernel cmdline, so dma_alloc_from_contiguous() 
no-ops,)

0)               |                          ttm_dma_pool_get_pages [ttm]() {
  0)               | ttm_dma_page_pool_fill_locked [ttm]() {
  0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.171 us    | dma_alloc_from_contiguous();
  0)   0.849 us    | __alloc_pages_nodemask();
  0)   3.029 us    |                                  }
  0)   3.882 us    |                                }
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.037 us    | dma_alloc_from_contiguous();
  0)   0.163 us    | __alloc_pages_nodemask();
  0)   1.408 us    |                                  }
  0)   1.719 us    |                                }
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.035 us    | dma_alloc_from_contiguous();
  0)   0.153 us    | __alloc_pages_nodemask();
  0)   1.454 us    |                                  }
  0)   1.720 us    |                                }
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.036 us    | dma_alloc_from_contiguous();
  0)   0.112 us    | __alloc_pages_nodemask();
  0)   1.211 us    |                                  }
  0)   1.541 us    |                                }
  0)   0.035 us    | ttm_set_pages_caching [ttm]();
  0) + 10.902 us   |                              }
  0) + 11.577 us   |                            }
  0) + 11.988 us   |                          }


^ permalink raw reply	[flat|nested] 30+ messages in thread

* CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-08 17:42 ` Mario Kleiner
  0 siblings, 0 replies; 30+ messages in thread
From: Mario Kleiner @ 2014-08-08 17:42 UTC (permalink / raw)
  To: dri-devel; +Cc: Thomas Hellstrom, kamal, LKML, ben, m.szyprowski

Hi all,

there is a rather severe performance problem i accidentally found when 
trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under 
Ubuntu 14.04 LTS with nouveau as graphics driver.

I was lazy and just installed the Ubuntu precompiled mainline kernel. 
That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA 
(contiguous memory allocator) size of 64 MB. Older Ubuntu kernels 
weren't compiled with CMA, so i only observed this on 3.16, but previous 
kernels would likely be affected too.

After a few minutes of regular desktop use like switching workspaces, 
scrolling text in a terminal window, Firefox with multiple tabs open, 
Thunderbird etc. (tested with KDE/Kwin, with/without desktop 
composition), i get chunky desktop updates, then multi-second freezes, 
after a few minutes the desktop hangs for over a minute on almost any 
GUI action like switching windows etc. --> Unuseable.

ftrace'ing shows the culprit being this callchain (typical good/bad 
example ftrace snippets at the end of this mail):

...ttm dma coherent memory allocations, e.g., from 
__ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform 
specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] --> 
dma_alloc_from_contiguous()

dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or when 
the machine is booted with kernel boot cmdline parameter "cma=0", so it 
triggers the fast alloc_pages_node() fallback at least on x86_64.

With CMA, this function becomes progressively more slow with every 
minute of desktop use, e.g., runtimes going up from < 0.3 usecs to 
hundreds or thousands of microseconds (before it gives up and 
alloc_pages_node() fallback is used), so this causes the 
multi-second/minute hangs of the desktop.

So it seems ttm memory allocations quickly fragment and/or exhaust the 
CMA memory area, and dma_alloc_from_contiguous() tries very hard to find 
a fitting hole big enough to satisfy allocations with a retry loop (see 
http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339) 
that takes forever.

This is not good, also not for other devices which actually need a 
non-fragmented CMA for DMA, so what to do? I doubt most current gpus 
still need physically contiguous dma memory, maybe with exception of 
some embedded gpus?

My naive approach would be to add a new gfp_t flag a la ___GFP_AVOIDCMA, 
and make callers of dma_alloc_from_contiguous() refrain from doing so if 
they have some fallback for getting memory. And then add that flag to 
ttm's ttm_dma_populate() gfp_flags, e.g., around here: 
http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884

However i'm not familiar enough with memory management, so likely 
greater minds here have much better ideas on how to deal with this?

thanks,
-mario

Typical snippet from an example trace of a badly stalling desktop with 
CMA (alloc_pages_node() fallback may have been missing in this traces 
ftrace_filter settings):

1)               |                          ttm_dma_pool_get_pages [ttm]() {
  1)               | ttm_dma_page_pool_fill_locked [ttm]() {
  1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1873.071 us | dma_alloc_from_contiguous();
  1) ! 1874.292 us |                                  }
  1) ! 1875.400 us |                                }
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1868.372 us | dma_alloc_from_contiguous();
  1) ! 1869.586 us |                                  }
  1) ! 1870.053 us |                                }
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1871.085 us | dma_alloc_from_contiguous();
  1) ! 1872.240 us |                                  }
  1) ! 1872.669 us |                                }
  1)               | __ttm_dma_alloc_page [ttm]() {
  1)               | dma_generic_alloc_coherent() {
  1) ! 1888.934 us | dma_alloc_from_contiguous();
  1) ! 1890.179 us |                                  }
  1) ! 1890.608 us |                                }
  1)   0.048 us    | ttm_set_pages_caching [ttm]();
  1) ! 7511.000 us |                              }
  1) ! 7511.306 us |                            }
  1) ! 7511.623 us |                          }

The good case (with cma=0 kernel cmdline, so dma_alloc_from_contiguous() 
no-ops,)

0)               |                          ttm_dma_pool_get_pages [ttm]() {
  0)               | ttm_dma_page_pool_fill_locked [ttm]() {
  0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.171 us    | dma_alloc_from_contiguous();
  0)   0.849 us    | __alloc_pages_nodemask();
  0)   3.029 us    |                                  }
  0)   3.882 us    |                                }
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.037 us    | dma_alloc_from_contiguous();
  0)   0.163 us    | __alloc_pages_nodemask();
  0)   1.408 us    |                                  }
  0)   1.719 us    |                                }
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.035 us    | dma_alloc_from_contiguous();
  0)   0.153 us    | __alloc_pages_nodemask();
  0)   1.454 us    |                                  }
  0)   1.720 us    |                                }
  0)               | __ttm_dma_alloc_page [ttm]() {
  0)               | dma_generic_alloc_coherent() {
  0)   0.036 us    | dma_alloc_from_contiguous();
  0)   0.112 us    | __alloc_pages_nodemask();
  0)   1.211 us    |                                  }
  0)   1.541 us    |                                }
  0)   0.035 us    | ttm_set_pages_caching [ttm]();
  0) + 10.902 us   |                              }
  0) + 11.577 us   |                            }
  0) + 11.988 us   |                          }

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-08 17:42 ` Mario Kleiner
@ 2014-08-09  5:39   ` Thomas Hellstrom
  -1 siblings, 0 replies; 30+ messages in thread
From: Thomas Hellstrom @ 2014-08-09  5:39 UTC (permalink / raw)
  To: Mario Kleiner, Konrad Rzeszutek Wilk
  Cc: dri-devel, Thomas Hellstrom, kamal, LKML, ben, m.szyprowski

Hi.

IIRC I don't think the TTM DMA pool allocates coherent pages more than
one page at a time, and _if that's true_ it's pretty unnecessary for the
dma subsystem to route those allocations to CMA. Maybe Konrad could shed
some light over this?

/Thomas


On 08/08/2014 07:42 PM, Mario Kleiner wrote:
> Hi all,
>
> there is a rather severe performance problem i accidentally found when
> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
> Ubuntu 14.04 LTS with nouveau as graphics driver.
>
> I was lazy and just installed the Ubuntu precompiled mainline kernel.
> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
> weren't compiled with CMA, so i only observed this on 3.16, but
> previous kernels would likely be affected too.
>
> After a few minutes of regular desktop use like switching workspaces,
> scrolling text in a terminal window, Firefox with multiple tabs open,
> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
> composition), i get chunky desktop updates, then multi-second freezes,
> after a few minutes the desktop hangs for over a minute on almost any
> GUI action like switching windows etc. --> Unuseable.
>
> ftrace'ing shows the culprit being this callchain (typical good/bad
> example ftrace snippets at the end of this mail):
>
> ...ttm dma coherent memory allocations, e.g., from
> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
> dma_alloc_from_contiguous()
>
> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or when
> the machine is booted with kernel boot cmdline parameter "cma=0", so
> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>
> With CMA, this function becomes progressively more slow with every
> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
> hundreds or thousands of microseconds (before it gives up and
> alloc_pages_node() fallback is used), so this causes the
> multi-second/minute hangs of the desktop.
>
> So it seems ttm memory allocations quickly fragment and/or exhaust the
> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
> find a fitting hole big enough to satisfy allocations with a retry
> loop (see
> http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339)
> that takes forever.
>
> This is not good, also not for other devices which actually need a
> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
> still need physically contiguous dma memory, maybe with exception of
> some embedded gpus?
>
> My naive approach would be to add a new gfp_t flag a la
> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
> refrain from doing so if they have some fallback for getting memory.
> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
> around here:
> http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884
>
> However i'm not familiar enough with memory management, so likely
> greater minds here have much better ideas on how to deal with this?
>
> thanks,
> -mario
>
> Typical snippet from an example trace of a badly stalling desktop with
> CMA (alloc_pages_node() fallback may have been missing in this traces
> ftrace_filter settings):
>
> 1)               |                          ttm_dma_pool_get_pages
> [ttm]() {
>  1)               | ttm_dma_page_pool_fill_locked [ttm]() {
>  1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>  1)               | __ttm_dma_alloc_page [ttm]() {
>  1)               | dma_generic_alloc_coherent() {
>  1) ! 1873.071 us | dma_alloc_from_contiguous();
>  1) ! 1874.292 us |                                  }
>  1) ! 1875.400 us |                                }
>  1)               | __ttm_dma_alloc_page [ttm]() {
>  1)               | dma_generic_alloc_coherent() {
>  1) ! 1868.372 us | dma_alloc_from_contiguous();
>  1) ! 1869.586 us |                                  }
>  1) ! 1870.053 us |                                }
>  1)               | __ttm_dma_alloc_page [ttm]() {
>  1)               | dma_generic_alloc_coherent() {
>  1) ! 1871.085 us | dma_alloc_from_contiguous();
>  1) ! 1872.240 us |                                  }
>  1) ! 1872.669 us |                                }
>  1)               | __ttm_dma_alloc_page [ttm]() {
>  1)               | dma_generic_alloc_coherent() {
>  1) ! 1888.934 us | dma_alloc_from_contiguous();
>  1) ! 1890.179 us |                                  }
>  1) ! 1890.608 us |                                }
>  1)   0.048 us    | ttm_set_pages_caching [ttm]();
>  1) ! 7511.000 us |                              }
>  1) ! 7511.306 us |                            }
>  1) ! 7511.623 us |                          }
>
> The good case (with cma=0 kernel cmdline, so
> dma_alloc_from_contiguous() no-ops,)
>
> 0)               |                          ttm_dma_pool_get_pages
> [ttm]() {
>  0)               | ttm_dma_page_pool_fill_locked [ttm]() {
>  0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>  0)               | __ttm_dma_alloc_page [ttm]() {
>  0)               | dma_generic_alloc_coherent() {
>  0)   0.171 us    | dma_alloc_from_contiguous();
>  0)   0.849 us    | __alloc_pages_nodemask();
>  0)   3.029 us    |                                  }
>  0)   3.882 us    |                                }
>  0)               | __ttm_dma_alloc_page [ttm]() {
>  0)               | dma_generic_alloc_coherent() {
>  0)   0.037 us    | dma_alloc_from_contiguous();
>  0)   0.163 us    | __alloc_pages_nodemask();
>  0)   1.408 us    |                                  }
>  0)   1.719 us    |                                }
>  0)               | __ttm_dma_alloc_page [ttm]() {
>  0)               | dma_generic_alloc_coherent() {
>  0)   0.035 us    | dma_alloc_from_contiguous();
>  0)   0.153 us    | __alloc_pages_nodemask();
>  0)   1.454 us    |                                  }
>  0)   1.720 us    |                                }
>  0)               | __ttm_dma_alloc_page [ttm]() {
>  0)               | dma_generic_alloc_coherent() {
>  0)   0.036 us    | dma_alloc_from_contiguous();
>  0)   0.112 us    | __alloc_pages_nodemask();
>  0)   1.211 us    |                                  }
>  0)   1.541 us    |                                }
>  0)   0.035 us    | ttm_set_pages_caching [ttm]();
>  0) + 10.902 us   |                              }
>  0) + 11.577 us   |                            }
>  0) + 11.988 us   |                          }
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-09  5:39   ` Thomas Hellstrom
  0 siblings, 0 replies; 30+ messages in thread
From: Thomas Hellstrom @ 2014-08-09  5:39 UTC (permalink / raw)
  To: Mario Kleiner, Konrad Rzeszutek Wilk
  Cc: Thomas Hellstrom, kamal, LKML, dri-devel, ben, m.szyprowski

Hi.

IIRC I don't think the TTM DMA pool allocates coherent pages more than
one page at a time, and _if that's true_ it's pretty unnecessary for the
dma subsystem to route those allocations to CMA. Maybe Konrad could shed
some light over this?

/Thomas


On 08/08/2014 07:42 PM, Mario Kleiner wrote:
> Hi all,
>
> there is a rather severe performance problem i accidentally found when
> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
> Ubuntu 14.04 LTS with nouveau as graphics driver.
>
> I was lazy and just installed the Ubuntu precompiled mainline kernel.
> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
> weren't compiled with CMA, so i only observed this on 3.16, but
> previous kernels would likely be affected too.
>
> After a few minutes of regular desktop use like switching workspaces,
> scrolling text in a terminal window, Firefox with multiple tabs open,
> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
> composition), i get chunky desktop updates, then multi-second freezes,
> after a few minutes the desktop hangs for over a minute on almost any
> GUI action like switching windows etc. --> Unuseable.
>
> ftrace'ing shows the culprit being this callchain (typical good/bad
> example ftrace snippets at the end of this mail):
>
> ...ttm dma coherent memory allocations, e.g., from
> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
> dma_alloc_from_contiguous()
>
> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or when
> the machine is booted with kernel boot cmdline parameter "cma=0", so
> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>
> With CMA, this function becomes progressively more slow with every
> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
> hundreds or thousands of microseconds (before it gives up and
> alloc_pages_node() fallback is used), so this causes the
> multi-second/minute hangs of the desktop.
>
> So it seems ttm memory allocations quickly fragment and/or exhaust the
> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
> find a fitting hole big enough to satisfy allocations with a retry
> loop (see
> http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339)
> that takes forever.
>
> This is not good, also not for other devices which actually need a
> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
> still need physically contiguous dma memory, maybe with exception of
> some embedded gpus?
>
> My naive approach would be to add a new gfp_t flag a la
> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
> refrain from doing so if they have some fallback for getting memory.
> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
> around here:
> http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884
>
> However i'm not familiar enough with memory management, so likely
> greater minds here have much better ideas on how to deal with this?
>
> thanks,
> -mario
>
> Typical snippet from an example trace of a badly stalling desktop with
> CMA (alloc_pages_node() fallback may have been missing in this traces
> ftrace_filter settings):
>
> 1)               |                          ttm_dma_pool_get_pages
> [ttm]() {
>  1)               | ttm_dma_page_pool_fill_locked [ttm]() {
>  1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>  1)               | __ttm_dma_alloc_page [ttm]() {
>  1)               | dma_generic_alloc_coherent() {
>  1) ! 1873.071 us | dma_alloc_from_contiguous();
>  1) ! 1874.292 us |                                  }
>  1) ! 1875.400 us |                                }
>  1)               | __ttm_dma_alloc_page [ttm]() {
>  1)               | dma_generic_alloc_coherent() {
>  1) ! 1868.372 us | dma_alloc_from_contiguous();
>  1) ! 1869.586 us |                                  }
>  1) ! 1870.053 us |                                }
>  1)               | __ttm_dma_alloc_page [ttm]() {
>  1)               | dma_generic_alloc_coherent() {
>  1) ! 1871.085 us | dma_alloc_from_contiguous();
>  1) ! 1872.240 us |                                  }
>  1) ! 1872.669 us |                                }
>  1)               | __ttm_dma_alloc_page [ttm]() {
>  1)               | dma_generic_alloc_coherent() {
>  1) ! 1888.934 us | dma_alloc_from_contiguous();
>  1) ! 1890.179 us |                                  }
>  1) ! 1890.608 us |                                }
>  1)   0.048 us    | ttm_set_pages_caching [ttm]();
>  1) ! 7511.000 us |                              }
>  1) ! 7511.306 us |                            }
>  1) ! 7511.623 us |                          }
>
> The good case (with cma=0 kernel cmdline, so
> dma_alloc_from_contiguous() no-ops,)
>
> 0)               |                          ttm_dma_pool_get_pages
> [ttm]() {
>  0)               | ttm_dma_page_pool_fill_locked [ttm]() {
>  0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>  0)               | __ttm_dma_alloc_page [ttm]() {
>  0)               | dma_generic_alloc_coherent() {
>  0)   0.171 us    | dma_alloc_from_contiguous();
>  0)   0.849 us    | __alloc_pages_nodemask();
>  0)   3.029 us    |                                  }
>  0)   3.882 us    |                                }
>  0)               | __ttm_dma_alloc_page [ttm]() {
>  0)               | dma_generic_alloc_coherent() {
>  0)   0.037 us    | dma_alloc_from_contiguous();
>  0)   0.163 us    | __alloc_pages_nodemask();
>  0)   1.408 us    |                                  }
>  0)   1.719 us    |                                }
>  0)               | __ttm_dma_alloc_page [ttm]() {
>  0)               | dma_generic_alloc_coherent() {
>  0)   0.035 us    | dma_alloc_from_contiguous();
>  0)   0.153 us    | __alloc_pages_nodemask();
>  0)   1.454 us    |                                  }
>  0)   1.720 us    |                                }
>  0)               | __ttm_dma_alloc_page [ttm]() {
>  0)               | dma_generic_alloc_coherent() {
>  0)   0.036 us    | dma_alloc_from_contiguous();
>  0)   0.112 us    | __alloc_pages_nodemask();
>  0)   1.211 us    |                                  }
>  0)   1.541 us    |                                }
>  0)   0.035 us    | ttm_set_pages_caching [ttm]();
>  0) + 10.902 us   |                              }
>  0) + 11.577 us   |                            }
>  0) + 11.988 us   |                          }
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-09  5:39   ` Thomas Hellstrom
@ 2014-08-09 13:33     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 30+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-09 13:33 UTC (permalink / raw)
  To: Thomas Hellstrom, Mario Kleiner; +Cc: dri-devel, kamal, LKML, ben, m.szyprowski

On August 9, 2014 1:39:39 AM EDT, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>Hi.
>
Hey Thomas!

>IIRC I don't think the TTM DMA pool allocates coherent pages more than
>one page at a time, and _if that's true_ it's pretty unnecessary for
>the
>dma subsystem to route those allocations to CMA. Maybe Konrad could
>shed
>some light over this?

It should allocate in batches and keep them in the TTM DMA pool for some time to be reused.

The pages that it gets are in 4kb granularity though.
>
>/Thomas
>
>
>On 08/08/2014 07:42 PM, Mario Kleiner wrote:
>> Hi all,
>>
>> there is a rather severe performance problem i accidentally found
>when
>> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
>> Ubuntu 14.04 LTS with nouveau as graphics driver.
>>
>> I was lazy and just installed the Ubuntu precompiled mainline kernel.
>> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
>> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
>> weren't compiled with CMA, so i only observed this on 3.16, but
>> previous kernels would likely be affected too.
>>
>> After a few minutes of regular desktop use like switching workspaces,
>> scrolling text in a terminal window, Firefox with multiple tabs open,
>> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
>> composition), i get chunky desktop updates, then multi-second
>freezes,
>> after a few minutes the desktop hangs for over a minute on almost any
>> GUI action like switching windows etc. --> Unuseable.
>>
>> ftrace'ing shows the culprit being this callchain (typical good/bad
>> example ftrace snippets at the end of this mail):
>>
>> ...ttm dma coherent memory allocations, e.g., from
>> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
>> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
>> dma_alloc_from_contiguous()
>>
>> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
>when
>> the machine is booted with kernel boot cmdline parameter "cma=0", so
>> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>>
>> With CMA, this function becomes progressively more slow with every
>> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
>> hundreds or thousands of microseconds (before it gives up and
>> alloc_pages_node() fallback is used), so this causes the
>> multi-second/minute hangs of the desktop.
>>
>> So it seems ttm memory allocations quickly fragment and/or exhaust
>the
>> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
>> find a fitting hole big enough to satisfy allocations with a retry
>> loop (see
>>
>http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339)
>> that takes forever.

I am curious why it does not end up using the pool. As in use the TTM DMA pool to pick pages instead of allocating (and freeing) new ones?

>>
>> This is not good, also not for other devices which actually need a
>> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
>> still need physically contiguous dma memory, maybe with exception of
>> some embedded gpus?

Oh. If I understood you correctly - the CMA ends up giving huge chunks of contiguous area. But if the sizes are 4kb I wonder why it would do that?

The modern GPUs on x86 can deal with scatter gather and as you surmise don't need contiguous physical contiguous areas.
>>
>> My naive approach would be to add a new gfp_t flag a la
>> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
>> refrain from doing so if they have some fallback for getting memory.
>> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
>> around here:
>>
>http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884
>>
>> However i'm not familiar enough with memory management, so likely
>> greater minds here have much better ideas on how to deal with this?
>>

That is a bit of hack to deal with CMA being slow.

Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
>> thanks,
>> -mario
>>
>> Typical snippet from an example trace of a badly stalling desktop
>with
>> CMA (alloc_pages_node() fallback may have been missing in this traces
>> ftrace_filter settings):
>>
>> 1)               |                          ttm_dma_pool_get_pages
>> [ttm]() {
>>  1)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>  1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>  1)               | dma_generic_alloc_coherent() {
>>  1) ! 1873.071 us | dma_alloc_from_contiguous();
>>  1) ! 1874.292 us |                                  }
>>  1) ! 1875.400 us |                                }
>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>  1)               | dma_generic_alloc_coherent() {
>>  1) ! 1868.372 us | dma_alloc_from_contiguous();
>>  1) ! 1869.586 us |                                  }
>>  1) ! 1870.053 us |                                }
>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>  1)               | dma_generic_alloc_coherent() {
>>  1) ! 1871.085 us | dma_alloc_from_contiguous();
>>  1) ! 1872.240 us |                                  }
>>  1) ! 1872.669 us |                                }
>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>  1)               | dma_generic_alloc_coherent() {
>>  1) ! 1888.934 us | dma_alloc_from_contiguous();
>>  1) ! 1890.179 us |                                  }
>>  1) ! 1890.608 us |                                }
>>  1)   0.048 us    | ttm_set_pages_caching [ttm]();
>>  1) ! 7511.000 us |                              }
>>  1) ! 7511.306 us |                            }
>>  1) ! 7511.623 us |                          }
>>
>> The good case (with cma=0 kernel cmdline, so
>> dma_alloc_from_contiguous() no-ops,)
>>
>> 0)               |                          ttm_dma_pool_get_pages
>> [ttm]() {
>>  0)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>  0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>  0)               | dma_generic_alloc_coherent() {
>>  0)   0.171 us    | dma_alloc_from_contiguous();
>>  0)   0.849 us    | __alloc_pages_nodemask();
>>  0)   3.029 us    |                                  }
>>  0)   3.882 us    |                                }
>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>  0)               | dma_generic_alloc_coherent() {
>>  0)   0.037 us    | dma_alloc_from_contiguous();
>>  0)   0.163 us    | __alloc_pages_nodemask();
>>  0)   1.408 us    |                                  }
>>  0)   1.719 us    |                                }
>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>  0)               | dma_generic_alloc_coherent() {
>>  0)   0.035 us    | dma_alloc_from_contiguous();
>>  0)   0.153 us    | __alloc_pages_nodemask();
>>  0)   1.454 us    |                                  }
>>  0)   1.720 us    |                                }
>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>  0)               | dma_generic_alloc_coherent() {
>>  0)   0.036 us    | dma_alloc_from_contiguous();
>>  0)   0.112 us    | __alloc_pages_nodemask();
>>  0)   1.211 us    |                                  }
>>  0)   1.541 us    |                                }
>>  0)   0.035 us    | ttm_set_pages_caching [ttm]();
>>  0) + 10.902 us   |                              }
>>  0) + 11.577 us   |                            }
>>  0) + 11.988 us   |                          }
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-09 13:33     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 30+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-09 13:33 UTC (permalink / raw)
  To: Thomas Hellstrom, Mario Kleiner; +Cc: kamal, ben, LKML, dri-devel, m.szyprowski

On August 9, 2014 1:39:39 AM EDT, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>Hi.
>
Hey Thomas!

>IIRC I don't think the TTM DMA pool allocates coherent pages more than
>one page at a time, and _if that's true_ it's pretty unnecessary for
>the
>dma subsystem to route those allocations to CMA. Maybe Konrad could
>shed
>some light over this?

It should allocate in batches and keep them in the TTM DMA pool for some time to be reused.

The pages that it gets are in 4kb granularity though.
>
>/Thomas
>
>
>On 08/08/2014 07:42 PM, Mario Kleiner wrote:
>> Hi all,
>>
>> there is a rather severe performance problem i accidentally found
>when
>> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
>> Ubuntu 14.04 LTS with nouveau as graphics driver.
>>
>> I was lazy and just installed the Ubuntu precompiled mainline kernel.
>> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
>> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
>> weren't compiled with CMA, so i only observed this on 3.16, but
>> previous kernels would likely be affected too.
>>
>> After a few minutes of regular desktop use like switching workspaces,
>> scrolling text in a terminal window, Firefox with multiple tabs open,
>> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
>> composition), i get chunky desktop updates, then multi-second
>freezes,
>> after a few minutes the desktop hangs for over a minute on almost any
>> GUI action like switching windows etc. --> Unuseable.
>>
>> ftrace'ing shows the culprit being this callchain (typical good/bad
>> example ftrace snippets at the end of this mail):
>>
>> ...ttm dma coherent memory allocations, e.g., from
>> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
>> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
>> dma_alloc_from_contiguous()
>>
>> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
>when
>> the machine is booted with kernel boot cmdline parameter "cma=0", so
>> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>>
>> With CMA, this function becomes progressively more slow with every
>> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
>> hundreds or thousands of microseconds (before it gives up and
>> alloc_pages_node() fallback is used), so this causes the
>> multi-second/minute hangs of the desktop.
>>
>> So it seems ttm memory allocations quickly fragment and/or exhaust
>the
>> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
>> find a fitting hole big enough to satisfy allocations with a retry
>> loop (see
>>
>http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c#L339)
>> that takes forever.

I am curious why it does not end up using the pool. As in use the TTM DMA pool to pick pages instead of allocating (and freeing) new ones?

>>
>> This is not good, also not for other devices which actually need a
>> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
>> still need physically contiguous dma memory, maybe with exception of
>> some embedded gpus?

Oh. If I understood you correctly - the CMA ends up giving huge chunks of contiguous area. But if the sizes are 4kb I wonder why it would do that?

The modern GPUs on x86 can deal with scatter gather and as you surmise don't need contiguous physical contiguous areas.
>>
>> My naive approach would be to add a new gfp_t flag a la
>> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
>> refrain from doing so if they have some fallback for getting memory.
>> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
>> around here:
>>
>http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L884
>>
>> However i'm not familiar enough with memory management, so likely
>> greater minds here have much better ideas on how to deal with this?
>>

That is a bit of hack to deal with CMA being slow.

Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
>> thanks,
>> -mario
>>
>> Typical snippet from an example trace of a badly stalling desktop
>with
>> CMA (alloc_pages_node() fallback may have been missing in this traces
>> ftrace_filter settings):
>>
>> 1)               |                          ttm_dma_pool_get_pages
>> [ttm]() {
>>  1)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>  1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>  1)               | dma_generic_alloc_coherent() {
>>  1) ! 1873.071 us | dma_alloc_from_contiguous();
>>  1) ! 1874.292 us |                                  }
>>  1) ! 1875.400 us |                                }
>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>  1)               | dma_generic_alloc_coherent() {
>>  1) ! 1868.372 us | dma_alloc_from_contiguous();
>>  1) ! 1869.586 us |                                  }
>>  1) ! 1870.053 us |                                }
>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>  1)               | dma_generic_alloc_coherent() {
>>  1) ! 1871.085 us | dma_alloc_from_contiguous();
>>  1) ! 1872.240 us |                                  }
>>  1) ! 1872.669 us |                                }
>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>  1)               | dma_generic_alloc_coherent() {
>>  1) ! 1888.934 us | dma_alloc_from_contiguous();
>>  1) ! 1890.179 us |                                  }
>>  1) ! 1890.608 us |                                }
>>  1)   0.048 us    | ttm_set_pages_caching [ttm]();
>>  1) ! 7511.000 us |                              }
>>  1) ! 7511.306 us |                            }
>>  1) ! 7511.623 us |                          }
>>
>> The good case (with cma=0 kernel cmdline, so
>> dma_alloc_from_contiguous() no-ops,)
>>
>> 0)               |                          ttm_dma_pool_get_pages
>> [ttm]() {
>>  0)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>  0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>  0)               | dma_generic_alloc_coherent() {
>>  0)   0.171 us    | dma_alloc_from_contiguous();
>>  0)   0.849 us    | __alloc_pages_nodemask();
>>  0)   3.029 us    |                                  }
>>  0)   3.882 us    |                                }
>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>  0)               | dma_generic_alloc_coherent() {
>>  0)   0.037 us    | dma_alloc_from_contiguous();
>>  0)   0.163 us    | __alloc_pages_nodemask();
>>  0)   1.408 us    |                                  }
>>  0)   1.719 us    |                                }
>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>  0)               | dma_generic_alloc_coherent() {
>>  0)   0.035 us    | dma_alloc_from_contiguous();
>>  0)   0.153 us    | __alloc_pages_nodemask();
>>  0)   1.454 us    |                                  }
>>  0)   1.720 us    |                                }
>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>  0)               | dma_generic_alloc_coherent() {
>>  0)   0.036 us    | dma_alloc_from_contiguous();
>>  0)   0.112 us    | __alloc_pages_nodemask();
>>  0)   1.211 us    |                                  }
>>  0)   1.541 us    |                                }
>>  0)   0.035 us    | ttm_set_pages_caching [ttm]();
>>  0) + 10.902 us   |                              }
>>  0) + 11.577 us   |                            }
>>  0) + 11.988 us   |                          }
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-09 13:33     ` Konrad Rzeszutek Wilk
@ 2014-08-09 13:58       ` Thomas Hellstrom
  -1 siblings, 0 replies; 30+ messages in thread
From: Thomas Hellstrom @ 2014-08-09 13:58 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: Mario Kleiner, dri-devel, kamal, LKML, ben, m.szyprowski



On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
> On August 9, 2014 1:39:39 AM EDT, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>> Hi.
>>
> Hey Thomas!
>
>> IIRC I don't think the TTM DMA pool allocates coherent pages more than
>> one page at a time, and _if that's true_ it's pretty unnecessary for
>> the
>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>> shed
>> some light over this?
> It should allocate in batches and keep them in the TTM DMA pool for some time to be reused.
>
> The pages that it gets are in 4kb granularity though.

Then I feel inclined to say this is a DMA subsystem bug. Single page
allocations shouldn't get routed to CMA.

/Thomas


>> /Thomas
>>
>>
>> On 08/08/2014 07:42 PM, Mario Kleiner wrote:
>>> Hi all,
>>>
>>> there is a rather severe performance problem i accidentally found
>> when
>>> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
>>> Ubuntu 14.04 LTS with nouveau as graphics driver.
>>>
>>> I was lazy and just installed the Ubuntu precompiled mainline kernel.
>>> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
>>> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
>>> weren't compiled with CMA, so i only observed this on 3.16, but
>>> previous kernels would likely be affected too.
>>>
>>> After a few minutes of regular desktop use like switching workspaces,
>>> scrolling text in a terminal window, Firefox with multiple tabs open,
>>> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
>>> composition), i get chunky desktop updates, then multi-second
>> freezes,
>>> after a few minutes the desktop hangs for over a minute on almost any
>>> GUI action like switching windows etc. --> Unuseable.
>>>
>>> ftrace'ing shows the culprit being this callchain (typical good/bad
>>> example ftrace snippets at the end of this mail):
>>>
>>> ...ttm dma coherent memory allocations, e.g., from
>>> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
>>> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
>>> dma_alloc_from_contiguous()
>>>
>>> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
>> when
>>> the machine is booted with kernel boot cmdline parameter "cma=0", so
>>> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>>>
>>> With CMA, this function becomes progressively more slow with every
>>> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
>>> hundreds or thousands of microseconds (before it gives up and
>>> alloc_pages_node() fallback is used), so this causes the
>>> multi-second/minute hangs of the desktop.
>>>
>>> So it seems ttm memory allocations quickly fragment and/or exhaust
>> the
>>> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
>>> find a fitting hole big enough to satisfy allocations with a retry
>>> loop (see
>>>
>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c%23L339&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=42356aad2ff181236f4704283dc058fdd7b7e213cdea7378665094b35ee0dfdf)
>>> that takes forever.
> I am curious why it does not end up using the pool. As in use the TTM DMA pool to pick pages instead of allocating (and freeing) new ones?
>
>>> This is not good, also not for other devices which actually need a
>>> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
>>> still need physically contiguous dma memory, maybe with exception of
>>> some embedded gpus?
> Oh. If I understood you correctly - the CMA ends up giving huge chunks of contiguous area. But if the sizes are 4kb I wonder why it would do that?
>
> The modern GPUs on x86 can deal with scatter gather and as you surmise don't need contiguous physical contiguous areas.
>>> My naive approach would be to add a new gfp_t flag a la
>>> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
>>> refrain from doing so if they have some fallback for getting memory.
>>> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
>>> around here:
>>>
>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L884&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=0c2a37c8bac57e0ab7333a9580eb5114e09566d1d34ab43be7a80de8316bdcdd
>>> However i'm not familiar enough with memory management, so likely
>>> greater minds here have much better ideas on how to deal with this?
>>>
> That is a bit of hack to deal with CMA being slow.
>
> Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
>>> thanks,
>>> -mario
>>>
>>> Typical snippet from an example trace of a badly stalling desktop
>> with
>>> CMA (alloc_pages_node() fallback may have been missing in this traces
>>> ftrace_filter settings):
>>>
>>> 1)               |                          ttm_dma_pool_get_pages
>>> [ttm]() {
>>>  1)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>  1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>>  1)               | dma_generic_alloc_coherent() {
>>>  1) ! 1873.071 us | dma_alloc_from_contiguous();
>>>  1) ! 1874.292 us |                                  }
>>>  1) ! 1875.400 us |                                }
>>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>>  1)               | dma_generic_alloc_coherent() {
>>>  1) ! 1868.372 us | dma_alloc_from_contiguous();
>>>  1) ! 1869.586 us |                                  }
>>>  1) ! 1870.053 us |                                }
>>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>>  1)               | dma_generic_alloc_coherent() {
>>>  1) ! 1871.085 us | dma_alloc_from_contiguous();
>>>  1) ! 1872.240 us |                                  }
>>>  1) ! 1872.669 us |                                }
>>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>>  1)               | dma_generic_alloc_coherent() {
>>>  1) ! 1888.934 us | dma_alloc_from_contiguous();
>>>  1) ! 1890.179 us |                                  }
>>>  1) ! 1890.608 us |                                }
>>>  1)   0.048 us    | ttm_set_pages_caching [ttm]();
>>>  1) ! 7511.000 us |                              }
>>>  1) ! 7511.306 us |                            }
>>>  1) ! 7511.623 us |                          }
>>>
>>> The good case (with cma=0 kernel cmdline, so
>>> dma_alloc_from_contiguous() no-ops,)
>>>
>>> 0)               |                          ttm_dma_pool_get_pages
>>> [ttm]() {
>>>  0)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>  0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>>  0)               | dma_generic_alloc_coherent() {
>>>  0)   0.171 us    | dma_alloc_from_contiguous();
>>>  0)   0.849 us    | __alloc_pages_nodemask();
>>>  0)   3.029 us    |                                  }
>>>  0)   3.882 us    |                                }
>>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>>  0)               | dma_generic_alloc_coherent() {
>>>  0)   0.037 us    | dma_alloc_from_contiguous();
>>>  0)   0.163 us    | __alloc_pages_nodemask();
>>>  0)   1.408 us    |                                  }
>>>  0)   1.719 us    |                                }
>>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>>  0)               | dma_generic_alloc_coherent() {
>>>  0)   0.035 us    | dma_alloc_from_contiguous();
>>>  0)   0.153 us    | __alloc_pages_nodemask();
>>>  0)   1.454 us    |                                  }
>>>  0)   1.720 us    |                                }
>>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>>  0)               | dma_generic_alloc_coherent() {
>>>  0)   0.036 us    | dma_alloc_from_contiguous();
>>>  0)   0.112 us    | __alloc_pages_nodemask();
>>>  0)   1.211 us    |                                  }
>>>  0)   1.541 us    |                                }
>>>  0)   0.035 us    | ttm_set_pages_caching [ttm]();
>>>  0) + 10.902 us   |                              }
>>>  0) + 11.577 us   |                            }
>>>  0) + 11.988 us   |                          }
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://urldefense.proofpoint.com/v1/url?u=http://lists.freedesktop.org/mailman/listinfo/dri-devel&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=d2636419e1f7f56c0d270e29ffe6ab6c6e29249876a578d70d973058f9411831
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-09 13:58       ` Thomas Hellstrom
  0 siblings, 0 replies; 30+ messages in thread
From: Thomas Hellstrom @ 2014-08-09 13:58 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: kamal, LKML, dri-devel, ben, m.szyprowski



On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
> On August 9, 2014 1:39:39 AM EDT, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>> Hi.
>>
> Hey Thomas!
>
>> IIRC I don't think the TTM DMA pool allocates coherent pages more than
>> one page at a time, and _if that's true_ it's pretty unnecessary for
>> the
>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>> shed
>> some light over this?
> It should allocate in batches and keep them in the TTM DMA pool for some time to be reused.
>
> The pages that it gets are in 4kb granularity though.

Then I feel inclined to say this is a DMA subsystem bug. Single page
allocations shouldn't get routed to CMA.

/Thomas


>> /Thomas
>>
>>
>> On 08/08/2014 07:42 PM, Mario Kleiner wrote:
>>> Hi all,
>>>
>>> there is a rather severe performance problem i accidentally found
>> when
>>> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
>>> Ubuntu 14.04 LTS with nouveau as graphics driver.
>>>
>>> I was lazy and just installed the Ubuntu precompiled mainline kernel.
>>> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
>>> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
>>> weren't compiled with CMA, so i only observed this on 3.16, but
>>> previous kernels would likely be affected too.
>>>
>>> After a few minutes of regular desktop use like switching workspaces,
>>> scrolling text in a terminal window, Firefox with multiple tabs open,
>>> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
>>> composition), i get chunky desktop updates, then multi-second
>> freezes,
>>> after a few minutes the desktop hangs for over a minute on almost any
>>> GUI action like switching windows etc. --> Unuseable.
>>>
>>> ftrace'ing shows the culprit being this callchain (typical good/bad
>>> example ftrace snippets at the end of this mail):
>>>
>>> ...ttm dma coherent memory allocations, e.g., from
>>> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
>>> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
>>> dma_alloc_from_contiguous()
>>>
>>> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
>> when
>>> the machine is booted with kernel boot cmdline parameter "cma=0", so
>>> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>>>
>>> With CMA, this function becomes progressively more slow with every
>>> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
>>> hundreds or thousands of microseconds (before it gives up and
>>> alloc_pages_node() fallback is used), so this causes the
>>> multi-second/minute hangs of the desktop.
>>>
>>> So it seems ttm memory allocations quickly fragment and/or exhaust
>> the
>>> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
>>> find a fitting hole big enough to satisfy allocations with a retry
>>> loop (see
>>>
>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c%23L339&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=42356aad2ff181236f4704283dc058fdd7b7e213cdea7378665094b35ee0dfdf)
>>> that takes forever.
> I am curious why it does not end up using the pool. As in use the TTM DMA pool to pick pages instead of allocating (and freeing) new ones?
>
>>> This is not good, also not for other devices which actually need a
>>> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
>>> still need physically contiguous dma memory, maybe with exception of
>>> some embedded gpus?
> Oh. If I understood you correctly - the CMA ends up giving huge chunks of contiguous area. But if the sizes are 4kb I wonder why it would do that?
>
> The modern GPUs on x86 can deal with scatter gather and as you surmise don't need contiguous physical contiguous areas.
>>> My naive approach would be to add a new gfp_t flag a la
>>> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
>>> refrain from doing so if they have some fallback for getting memory.
>>> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
>>> around here:
>>>
>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L884&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=0c2a37c8bac57e0ab7333a9580eb5114e09566d1d34ab43be7a80de8316bdcdd
>>> However i'm not familiar enough with memory management, so likely
>>> greater minds here have much better ideas on how to deal with this?
>>>
> That is a bit of hack to deal with CMA being slow.
>
> Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
>>> thanks,
>>> -mario
>>>
>>> Typical snippet from an example trace of a badly stalling desktop
>> with
>>> CMA (alloc_pages_node() fallback may have been missing in this traces
>>> ftrace_filter settings):
>>>
>>> 1)               |                          ttm_dma_pool_get_pages
>>> [ttm]() {
>>>  1)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>  1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>>  1)               | dma_generic_alloc_coherent() {
>>>  1) ! 1873.071 us | dma_alloc_from_contiguous();
>>>  1) ! 1874.292 us |                                  }
>>>  1) ! 1875.400 us |                                }
>>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>>  1)               | dma_generic_alloc_coherent() {
>>>  1) ! 1868.372 us | dma_alloc_from_contiguous();
>>>  1) ! 1869.586 us |                                  }
>>>  1) ! 1870.053 us |                                }
>>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>>  1)               | dma_generic_alloc_coherent() {
>>>  1) ! 1871.085 us | dma_alloc_from_contiguous();
>>>  1) ! 1872.240 us |                                  }
>>>  1) ! 1872.669 us |                                }
>>>  1)               | __ttm_dma_alloc_page [ttm]() {
>>>  1)               | dma_generic_alloc_coherent() {
>>>  1) ! 1888.934 us | dma_alloc_from_contiguous();
>>>  1) ! 1890.179 us |                                  }
>>>  1) ! 1890.608 us |                                }
>>>  1)   0.048 us    | ttm_set_pages_caching [ttm]();
>>>  1) ! 7511.000 us |                              }
>>>  1) ! 7511.306 us |                            }
>>>  1) ! 7511.623 us |                          }
>>>
>>> The good case (with cma=0 kernel cmdline, so
>>> dma_alloc_from_contiguous() no-ops,)
>>>
>>> 0)               |                          ttm_dma_pool_get_pages
>>> [ttm]() {
>>>  0)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>  0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>>  0)               | dma_generic_alloc_coherent() {
>>>  0)   0.171 us    | dma_alloc_from_contiguous();
>>>  0)   0.849 us    | __alloc_pages_nodemask();
>>>  0)   3.029 us    |                                  }
>>>  0)   3.882 us    |                                }
>>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>>  0)               | dma_generic_alloc_coherent() {
>>>  0)   0.037 us    | dma_alloc_from_contiguous();
>>>  0)   0.163 us    | __alloc_pages_nodemask();
>>>  0)   1.408 us    |                                  }
>>>  0)   1.719 us    |                                }
>>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>>  0)               | dma_generic_alloc_coherent() {
>>>  0)   0.035 us    | dma_alloc_from_contiguous();
>>>  0)   0.153 us    | __alloc_pages_nodemask();
>>>  0)   1.454 us    |                                  }
>>>  0)   1.720 us    |                                }
>>>  0)               | __ttm_dma_alloc_page [ttm]() {
>>>  0)               | dma_generic_alloc_coherent() {
>>>  0)   0.036 us    | dma_alloc_from_contiguous();
>>>  0)   0.112 us    | __alloc_pages_nodemask();
>>>  0)   1.211 us    |                                  }
>>>  0)   1.541 us    |                                }
>>>  0)   0.035 us    | ttm_set_pages_caching [ttm]();
>>>  0) + 10.902 us   |                              }
>>>  0) + 11.577 us   |                            }
>>>  0) + 11.988 us   |                          }
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://urldefense.proofpoint.com/v1/url?u=http://lists.freedesktop.org/mailman/listinfo/dri-devel&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=d2636419e1f7f56c0d270e29ffe6ab6c6e29249876a578d70d973058f9411831
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-09 13:58       ` Thomas Hellstrom
  (?)
@ 2014-08-10  3:06       ` Mario Kleiner
  -1 siblings, 0 replies; 30+ messages in thread
From: Mario Kleiner @ 2014-08-10  3:06 UTC (permalink / raw)
  To: Thomas Hellstrom, Konrad Rzeszutek Wilk
  Cc: kamal, ben, LKML, dri-devel, m.szyprowski


[-- Attachment #1.1: Type: text/plain, Size: 12332 bytes --]

On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
>
> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
>> On August 9, 2014 1:39:39 AM EDT, Thomas Hellstrom <thellstrom@vmware.com> wrote:
>>> Hi.
>>>
>> Hey Thomas!
>>
>>> IIRC I don't think the TTM DMA pool allocates coherent pages more than
>>> one page at a time, and _if that's true_ it's pretty unnecessary for
>>> the
>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>>> shed
>>> some light over this?
>> It should allocate in batches and keep them in the TTM DMA pool for some time to be reused.
>>
>> The pages that it gets are in 4kb granularity though.
> Then I feel inclined to say this is a DMA subsystem bug. Single page
> allocations shouldn't get routed to CMA.
>
> /Thomas

Yes, seems you're both right. I read through the code a bit more and 
indeed the TTM DMA pool allocates only one page during each 
dma_alloc_coherent() call, so it doesn't need CMA memory. The current 
allocators don't check for single page CMA allocations and therefore try 
to get it from the CMA area anyway, instead of skipping to the much 
cheaper fallback.

So the callers of dma_alloc_from_contiguous() could need that little 
optimization of skipping it if only one page is requested. For

dma_generic_alloc_coherent  <http://lxr.free-electrons.com/ident?i=dma_generic_alloc_coherent>  andintel_alloc_coherent  <http://lxr.free-electrons.com/ident?i=intel_alloc_coherent>  this seems easy to do. Looking at the arm arch variants, e.g.,

http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1194

and

http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c#L44

i'm not sure if it is that easily done, as there aren't any fallbacks 
for such a case and the code looks to me as if that's at least somewhat 
intentional.

As far as TTM goes, one quick one-line fix to prevent it from using the 
CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the above 
methods) would be to clear the __GFP_WAIT 
<http://lxr.free-electrons.com/ident?i=__GFP_WAIT> flag from the passed 
gfp_t flags. That would trigger the well working fallback. So, is

__GFP_WAIT  <http://lxr.free-electrons.com/ident?i=__GFP_WAIT>  needed for those single page allocations that go through__ttm_dma_alloc_page  <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page>?

It would be nice to have such a simple, non-intrusive one-line patch 
that we still could get into 3.17 and then backported to older stable 
kernels to avoid the same desktop hangs there if CMA is enabled. It 
would be also nice for actual users of CMA to not use up lots of CMA 
space for gpu's which don't need it. I think DMA_CMA was introduced 
around 3.12.


The other problem is that probably TTM does not reuse pages from the DMA 
pool. If i trace the __ttm_dma_alloc_page 
<http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> and 
__ttm_dma_free_page 
<http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> calls for 
those single page allocs/frees, then over a 20 second interval of 
tracing and switching tabs in firefox, scrolling things around etc. i 
find about as many alloc's as i find free's, e.g., 1607 allocs vs. 1648 
frees.

This bit of code fromttm_dma_unpopulate 
<http://lxr.free-electrons.com/ident?i=ttm_dma_unpopulate>()  (line 954 
in 3.16) looks suspicious:

http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954

Alloc's from a tt_cached cached pool ( if (is_cached)...) always get 
freed and are not given back to the cached pool. But in the uncached 
case, there's logic to make sure the pool doesn't grow forever (line 
955, checking against _manager->options.max_size), but before that check 
in line 954 there's an uncoditional assignment of npages = count; which 
seems to force freeing all pages as well, instead of recycling? Is this 
some debug code left over, or intentional and just me not understanding 
what happens there?

thanks,
-mario


>
>>> /Thomas
>>>
>>>
>>> On 08/08/2014 07:42 PM, Mario Kleiner wrote:
>>>> Hi all,
>>>>
>>>> there is a rather severe performance problem i accidentally found
>>> when
>>>> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
>>>> Ubuntu 14.04 LTS with nouveau as graphics driver.
>>>>
>>>> I was lazy and just installed the Ubuntu precompiled mainline kernel.
>>>> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
>>>> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
>>>> weren't compiled with CMA, so i only observed this on 3.16, but
>>>> previous kernels would likely be affected too.
>>>>
>>>> After a few minutes of regular desktop use like switching workspaces,
>>>> scrolling text in a terminal window, Firefox with multiple tabs open,
>>>> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
>>>> composition), i get chunky desktop updates, then multi-second
>>> freezes,
>>>> after a few minutes the desktop hangs for over a minute on almost any
>>>> GUI action like switching windows etc. --> Unuseable.
>>>>
>>>> ftrace'ing shows the culprit being this callchain (typical good/bad
>>>> example ftrace snippets at the end of this mail):
>>>>
>>>> ...ttm dma coherent memory allocations, e.g., from
>>>> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
>>>> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
>>>> dma_alloc_from_contiguous()
>>>>
>>>> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
>>> when
>>>> the machine is booted with kernel boot cmdline parameter "cma=0", so
>>>> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>>>>
>>>> With CMA, this function becomes progressively more slow with every
>>>> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
>>>> hundreds or thousands of microseconds (before it gives up and
>>>> alloc_pages_node() fallback is used), so this causes the
>>>> multi-second/minute hangs of the desktop.
>>>>
>>>> So it seems ttm memory allocations quickly fragment and/or exhaust
>>> the
>>>> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
>>>> find a fitting hole big enough to satisfy allocations with a retry
>>>> loop (see
>>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c%23L339&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=42356aad2ff181236f4704283dc058fdd7b7e213cdea7378665094b35ee0dfdf)
>>>> that takes forever.
>> I am curious why it does not end up using the pool. As in use the TTM DMA pool to pick pages instead of allocating (and freeing) new ones?
>>
>>>> This is not good, also not for other devices which actually need a
>>>> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
>>>> still need physically contiguous dma memory, maybe with exception of
>>>> some embedded gpus?
>> Oh. If I understood you correctly - the CMA ends up giving huge chunks of contiguous area. But if the sizes are 4kb I wonder why it would do that?
>>
>> The modern GPUs on x86 can deal with scatter gather and as you surmise don't need contiguous physical contiguous areas.
>>>> My naive approach would be to add a new gfp_t flag a la
>>>> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
>>>> refrain from doing so if they have some fallback for getting memory.
>>>> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
>>>> around here:
>>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L884&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=0c2a37c8bac57e0ab7333a9580eb5114e09566d1d34ab43be7a80de8316bdcdd
>>>> However i'm not familiar enough with memory management, so likely
>>>> greater minds here have much better ideas on how to deal with this?
>>>>
>> That is a bit of hack to deal with CMA being slow.
>>
>> Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
>>>> thanks,
>>>> -mario
>>>>
>>>> Typical snippet from an example trace of a badly stalling desktop
>>> with
>>>> CMA (alloc_pages_node() fallback may have been missing in this traces
>>>> ftrace_filter settings):
>>>>
>>>> 1)               |                          ttm_dma_pool_get_pages
>>>> [ttm]() {
>>>>   1)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>   1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>   1)               | dma_generic_alloc_coherent() {
>>>>   1) ! 1873.071 us | dma_alloc_from_contiguous();
>>>>   1) ! 1874.292 us |                                  }
>>>>   1) ! 1875.400 us |                                }
>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>   1)               | dma_generic_alloc_coherent() {
>>>>   1) ! 1868.372 us | dma_alloc_from_contiguous();
>>>>   1) ! 1869.586 us |                                  }
>>>>   1) ! 1870.053 us |                                }
>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>   1)               | dma_generic_alloc_coherent() {
>>>>   1) ! 1871.085 us | dma_alloc_from_contiguous();
>>>>   1) ! 1872.240 us |                                  }
>>>>   1) ! 1872.669 us |                                }
>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>   1)               | dma_generic_alloc_coherent() {
>>>>   1) ! 1888.934 us | dma_alloc_from_contiguous();
>>>>   1) ! 1890.179 us |                                  }
>>>>   1) ! 1890.608 us |                                }
>>>>   1)   0.048 us    | ttm_set_pages_caching [ttm]();
>>>>   1) ! 7511.000 us |                              }
>>>>   1) ! 7511.306 us |                            }
>>>>   1) ! 7511.623 us |                          }
>>>>
>>>> The good case (with cma=0 kernel cmdline, so
>>>> dma_alloc_from_contiguous() no-ops,)
>>>>
>>>> 0)               |                          ttm_dma_pool_get_pages
>>>> [ttm]() {
>>>>   0)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>   0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>   0)               | dma_generic_alloc_coherent() {
>>>>   0)   0.171 us    | dma_alloc_from_contiguous();
>>>>   0)   0.849 us    | __alloc_pages_nodemask();
>>>>   0)   3.029 us    |                                  }
>>>>   0)   3.882 us    |                                }
>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>   0)               | dma_generic_alloc_coherent() {
>>>>   0)   0.037 us    | dma_alloc_from_contiguous();
>>>>   0)   0.163 us    | __alloc_pages_nodemask();
>>>>   0)   1.408 us    |                                  }
>>>>   0)   1.719 us    |                                }
>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>   0)               | dma_generic_alloc_coherent() {
>>>>   0)   0.035 us    | dma_alloc_from_contiguous();
>>>>   0)   0.153 us    | __alloc_pages_nodemask();
>>>>   0)   1.454 us    |                                  }
>>>>   0)   1.720 us    |                                }
>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>   0)               | dma_generic_alloc_coherent() {
>>>>   0)   0.036 us    | dma_alloc_from_contiguous();
>>>>   0)   0.112 us    | __alloc_pages_nodemask();
>>>>   0)   1.211 us    |                                  }
>>>>   0)   1.541 us    |                                }
>>>>   0)   0.035 us    | ttm_set_pages_caching [ttm]();
>>>>   0) + 10.902 us   |                              }
>>>>   0) + 11.577 us   |                            }
>>>>   0) + 11.988 us   |                          }
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://urldefense.proofpoint.com/v1/url?u=http://lists.freedesktop.org/mailman/listinfo/dri-devel&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=d2636419e1f7f56c0d270e29ffe6ab6c6e29249876a578d70d973058f9411831


[-- Attachment #1.2: Type: text/html, Size: 16441 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-09 13:58       ` Thomas Hellstrom
@ 2014-08-10  3:11         ` Mario Kleiner
  -1 siblings, 0 replies; 30+ messages in thread
From: Mario Kleiner @ 2014-08-10  3:11 UTC (permalink / raw)
  To: Thomas Hellstrom, Konrad Rzeszutek Wilk
  Cc: dri-devel, kamal, LKML, ben, m.szyprowski

Resent this time without HTML formatting which lkml doesn't like. Sorry.

On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
>> On August 9, 2014 1:39:39 AM EDT, Thomas Hellstrom<thellstrom@vmware.com>  wrote:
>>> Hi.
>>>
>> Hey Thomas!
>>
>>> IIRC I don't think the TTM DMA pool allocates coherent pages more than
>>> one page at a time, and _if that's true_ it's pretty unnecessary for
>>> the
>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>>> shed
>>> some light over this?
>> It should allocate in batches and keep them in the TTM DMA pool for some time to be reused.
>>
>> The pages that it gets are in 4kb granularity though.
> Then I feel inclined to say this is a DMA subsystem bug. Single page
> allocations shouldn't get routed to CMA.
>
> /Thomas

Yes, seems you're both right. I read through the code a bit more and 
indeed the TTM DMA pool allocates only one page during each 
dma_alloc_coherent() call, so it doesn't need CMA memory. The current 
allocators don't check for single page CMA allocations and therefore try 
to get it from the CMA area anyway, instead of skipping to the much 
cheaper fallback.

So the callers of dma_alloc_from_contiguous() could need that little 
optimization of skipping it if only one page is requested. For

dma_generic_alloc_coherent  <http://lxr.free-electrons.com/ident?i=dma_generic_alloc_coherent>  andintel_alloc_coherent  <http://lxr.free-electrons.com/ident?i=intel_alloc_coherent>  this seems easy to do. Looking at the arm arch variants, e.g.,

http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1194

and

http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c#L44

i'm not sure if it is that easily done, as there aren't any fallbacks 
for such a case and the code looks to me as if that's at least somewhat 
intentional.

As far as TTM goes, one quick one-line fix to prevent it from using the 
CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the above 
methods) would be to clear the __GFP_WAIT 
<http://lxr.free-electrons.com/ident?i=__GFP_WAIT> flag from the passed 
gfp_t flags. That would trigger the well working fallback. So, is

__GFP_WAIT  <http://lxr.free-electrons.com/ident?i=__GFP_WAIT>  needed for those single page allocations that go through__ttm_dma_alloc_page  <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page>?

It would be nice to have such a simple, non-intrusive one-line patch 
that we still could get into 3.17 and then backported to older stable 
kernels to avoid the same desktop hangs there if CMA is enabled. It 
would be also nice for actual users of CMA to not use up lots of CMA 
space for gpu's which don't need it. I think DMA_CMA was introduced 
around 3.12.


The other problem is that probably TTM does not reuse pages from the DMA 
pool. If i trace the __ttm_dma_alloc_page 
<http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> and 
__ttm_dma_free_page 
<http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> calls for 
those single page allocs/frees, then over a 20 second interval of 
tracing and switching tabs in firefox, scrolling things around etc. i 
find about as many alloc's as i find free's, e.g., 1607 allocs vs. 1648 
frees.

This bit of code fromttm_dma_unpopulate 
<http://lxr.free-electrons.com/ident?i=ttm_dma_unpopulate>()  (line 954 
in 3.16) looks suspicious:

http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954

Alloc's from a tt_cached cached pool ( if (is_cached)...) always get 
freed and are not given back to the cached pool. But in the uncached 
case, there's logic to make sure the pool doesn't grow forever (line 
955, checking against _manager->options.max_size), but before that check 
in line 954 there's an uncoditional assignment of npages = count; which 
seems to force freeing all pages as well, instead of recycling? Is this 
some debug code left over, or intentional and just me not understanding 
what happens there?

thanks,
-mario


>>> /Thomas
>>>
>>>
>>> On 08/08/2014 07:42 PM, Mario Kleiner wrote:
>>>> Hi all,
>>>>
>>>> there is a rather severe performance problem i accidentally found
>>> when
>>>> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
>>>> Ubuntu 14.04 LTS with nouveau as graphics driver.
>>>>
>>>> I was lazy and just installed the Ubuntu precompiled mainline kernel.
>>>> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
>>>> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
>>>> weren't compiled with CMA, so i only observed this on 3.16, but
>>>> previous kernels would likely be affected too.
>>>>
>>>> After a few minutes of regular desktop use like switching workspaces,
>>>> scrolling text in a terminal window, Firefox with multiple tabs open,
>>>> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
>>>> composition), i get chunky desktop updates, then multi-second
>>> freezes,
>>>> after a few minutes the desktop hangs for over a minute on almost any
>>>> GUI action like switching windows etc. --> Unuseable.
>>>>
>>>> ftrace'ing shows the culprit being this callchain (typical good/bad
>>>> example ftrace snippets at the end of this mail):
>>>>
>>>> ...ttm dma coherent memory allocations, e.g., from
>>>> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
>>>> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
>>>> dma_alloc_from_contiguous()
>>>>
>>>> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
>>> when
>>>> the machine is booted with kernel boot cmdline parameter "cma=0", so
>>>> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>>>>
>>>> With CMA, this function becomes progressively more slow with every
>>>> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
>>>> hundreds or thousands of microseconds (before it gives up and
>>>> alloc_pages_node() fallback is used), so this causes the
>>>> multi-second/minute hangs of the desktop.
>>>>
>>>> So it seems ttm memory allocations quickly fragment and/or exhaust
>>> the
>>>> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
>>>> find a fitting hole big enough to satisfy allocations with a retry
>>>> loop (see
>>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c%23L339&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=42356aad2ff181236f4704283dc058fdd7b7e213cdea7378665094b35ee0dfdf)
>>>> that takes forever.
>> I am curious why it does not end up using the pool. As in use the TTM DMA pool to pick pages instead of allocating (and freeing) new ones?
>>
>>>> This is not good, also not for other devices which actually need a
>>>> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
>>>> still need physically contiguous dma memory, maybe with exception of
>>>> some embedded gpus?
>> Oh. If I understood you correctly - the CMA ends up giving huge chunks of contiguous area. But if the sizes are 4kb I wonder why it would do that?
>>
>> The modern GPUs on x86 can deal with scatter gather and as you surmise don't need contiguous physical contiguous areas.
>>>> My naive approach would be to add a new gfp_t flag a la
>>>> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
>>>> refrain from doing so if they have some fallback for getting memory.
>>>> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
>>>> around here:
>>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L884&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=0c2a37c8bac57e0ab7333a9580eb5114e09566d1d34ab43be7a80de8316bdcdd
>>>> However i'm not familiar enough with memory management, so likely
>>>> greater minds here have much better ideas on how to deal with this?
>>>>
>> That is a bit of hack to deal with CMA being slow.
>>
>> Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
>>>> thanks,
>>>> -mario
>>>>
>>>> Typical snippet from an example trace of a badly stalling desktop
>>> with
>>>> CMA (alloc_pages_node() fallback may have been missing in this traces
>>>> ftrace_filter settings):
>>>>
>>>> 1)               |                          ttm_dma_pool_get_pages
>>>> [ttm]() {
>>>>   1)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>   1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>   1)               | dma_generic_alloc_coherent() {
>>>>   1) ! 1873.071 us | dma_alloc_from_contiguous();
>>>>   1) ! 1874.292 us |                                  }
>>>>   1) ! 1875.400 us |                                }
>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>   1)               | dma_generic_alloc_coherent() {
>>>>   1) ! 1868.372 us | dma_alloc_from_contiguous();
>>>>   1) ! 1869.586 us |                                  }
>>>>   1) ! 1870.053 us |                                }
>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>   1)               | dma_generic_alloc_coherent() {
>>>>   1) ! 1871.085 us | dma_alloc_from_contiguous();
>>>>   1) ! 1872.240 us |                                  }
>>>>   1) ! 1872.669 us |                                }
>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>   1)               | dma_generic_alloc_coherent() {
>>>>   1) ! 1888.934 us | dma_alloc_from_contiguous();
>>>>   1) ! 1890.179 us |                                  }
>>>>   1) ! 1890.608 us |                                }
>>>>   1)   0.048 us    | ttm_set_pages_caching [ttm]();
>>>>   1) ! 7511.000 us |                              }
>>>>   1) ! 7511.306 us |                            }
>>>>   1) ! 7511.623 us |                          }
>>>>
>>>> The good case (with cma=0 kernel cmdline, so
>>>> dma_alloc_from_contiguous() no-ops,)
>>>>
>>>> 0)               |                          ttm_dma_pool_get_pages
>>>> [ttm]() {
>>>>   0)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>   0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>   0)               | dma_generic_alloc_coherent() {
>>>>   0)   0.171 us    | dma_alloc_from_contiguous();
>>>>   0)   0.849 us    | __alloc_pages_nodemask();
>>>>   0)   3.029 us    |                                  }
>>>>   0)   3.882 us    |                                }
>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>   0)               | dma_generic_alloc_coherent() {
>>>>   0)   0.037 us    | dma_alloc_from_contiguous();
>>>>   0)   0.163 us    | __alloc_pages_nodemask();
>>>>   0)   1.408 us    |                                  }
>>>>   0)   1.719 us    |                                }
>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>   0)               | dma_generic_alloc_coherent() {
>>>>   0)   0.035 us    | dma_alloc_from_contiguous();
>>>>   0)   0.153 us    | __alloc_pages_nodemask();
>>>>   0)   1.454 us    |                                  }
>>>>   0)   1.720 us    |                                }
>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>   0)               | dma_generic_alloc_coherent() {
>>>>   0)   0.036 us    | dma_alloc_from_contiguous();
>>>>   0)   0.112 us    | __alloc_pages_nodemask();
>>>>   0)   1.211 us    |                                  }
>>>>   0)   1.541 us    |                                }
>>>>   0)   0.035 us    | ttm_set_pages_caching [ttm]();
>>>>   0) + 10.902 us   |                              }
>>>>   0) + 11.577 us   |                            }
>>>>   0) + 11.988 us   |                          }
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://urldefense.proofpoint.com/v1/url?u=http://lists.freedesktop.org/mailman/listinfo/dri-devel&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=d2636419e1f7f56c0d270e29ffe6ab6c6e29249876a578d70d973058f9411831


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-10  3:11         ` Mario Kleiner
  0 siblings, 0 replies; 30+ messages in thread
From: Mario Kleiner @ 2014-08-10  3:11 UTC (permalink / raw)
  To: Thomas Hellstrom, Konrad Rzeszutek Wilk
  Cc: kamal, ben, LKML, dri-devel, m.szyprowski

Resent this time without HTML formatting which lkml doesn't like. Sorry.

On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
>> On August 9, 2014 1:39:39 AM EDT, Thomas Hellstrom<thellstrom@vmware.com>  wrote:
>>> Hi.
>>>
>> Hey Thomas!
>>
>>> IIRC I don't think the TTM DMA pool allocates coherent pages more than
>>> one page at a time, and _if that's true_ it's pretty unnecessary for
>>> the
>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>>> shed
>>> some light over this?
>> It should allocate in batches and keep them in the TTM DMA pool for some time to be reused.
>>
>> The pages that it gets are in 4kb granularity though.
> Then I feel inclined to say this is a DMA subsystem bug. Single page
> allocations shouldn't get routed to CMA.
>
> /Thomas

Yes, seems you're both right. I read through the code a bit more and 
indeed the TTM DMA pool allocates only one page during each 
dma_alloc_coherent() call, so it doesn't need CMA memory. The current 
allocators don't check for single page CMA allocations and therefore try 
to get it from the CMA area anyway, instead of skipping to the much 
cheaper fallback.

So the callers of dma_alloc_from_contiguous() could need that little 
optimization of skipping it if only one page is requested. For

dma_generic_alloc_coherent  <http://lxr.free-electrons.com/ident?i=dma_generic_alloc_coherent>  andintel_alloc_coherent  <http://lxr.free-electrons.com/ident?i=intel_alloc_coherent>  this seems easy to do. Looking at the arm arch variants, e.g.,

http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1194

and

http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c#L44

i'm not sure if it is that easily done, as there aren't any fallbacks 
for such a case and the code looks to me as if that's at least somewhat 
intentional.

As far as TTM goes, one quick one-line fix to prevent it from using the 
CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the above 
methods) would be to clear the __GFP_WAIT 
<http://lxr.free-electrons.com/ident?i=__GFP_WAIT> flag from the passed 
gfp_t flags. That would trigger the well working fallback. So, is

__GFP_WAIT  <http://lxr.free-electrons.com/ident?i=__GFP_WAIT>  needed for those single page allocations that go through__ttm_dma_alloc_page  <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page>?

It would be nice to have such a simple, non-intrusive one-line patch 
that we still could get into 3.17 and then backported to older stable 
kernels to avoid the same desktop hangs there if CMA is enabled. It 
would be also nice for actual users of CMA to not use up lots of CMA 
space for gpu's which don't need it. I think DMA_CMA was introduced 
around 3.12.


The other problem is that probably TTM does not reuse pages from the DMA 
pool. If i trace the __ttm_dma_alloc_page 
<http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> and 
__ttm_dma_free_page 
<http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> calls for 
those single page allocs/frees, then over a 20 second interval of 
tracing and switching tabs in firefox, scrolling things around etc. i 
find about as many alloc's as i find free's, e.g., 1607 allocs vs. 1648 
frees.

This bit of code fromttm_dma_unpopulate 
<http://lxr.free-electrons.com/ident?i=ttm_dma_unpopulate>()  (line 954 
in 3.16) looks suspicious:

http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954

Alloc's from a tt_cached cached pool ( if (is_cached)...) always get 
freed and are not given back to the cached pool. But in the uncached 
case, there's logic to make sure the pool doesn't grow forever (line 
955, checking against _manager->options.max_size), but before that check 
in line 954 there's an uncoditional assignment of npages = count; which 
seems to force freeing all pages as well, instead of recycling? Is this 
some debug code left over, or intentional and just me not understanding 
what happens there?

thanks,
-mario


>>> /Thomas
>>>
>>>
>>> On 08/08/2014 07:42 PM, Mario Kleiner wrote:
>>>> Hi all,
>>>>
>>>> there is a rather severe performance problem i accidentally found
>>> when
>>>> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
>>>> Ubuntu 14.04 LTS with nouveau as graphics driver.
>>>>
>>>> I was lazy and just installed the Ubuntu precompiled mainline kernel.
>>>> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
>>>> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
>>>> weren't compiled with CMA, so i only observed this on 3.16, but
>>>> previous kernels would likely be affected too.
>>>>
>>>> After a few minutes of regular desktop use like switching workspaces,
>>>> scrolling text in a terminal window, Firefox with multiple tabs open,
>>>> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
>>>> composition), i get chunky desktop updates, then multi-second
>>> freezes,
>>>> after a few minutes the desktop hangs for over a minute on almost any
>>>> GUI action like switching windows etc. --> Unuseable.
>>>>
>>>> ftrace'ing shows the culprit being this callchain (typical good/bad
>>>> example ftrace snippets at the end of this mail):
>>>>
>>>> ...ttm dma coherent memory allocations, e.g., from
>>>> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
>>>> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
>>>> dma_alloc_from_contiguous()
>>>>
>>>> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
>>> when
>>>> the machine is booted with kernel boot cmdline parameter "cma=0", so
>>>> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>>>>
>>>> With CMA, this function becomes progressively more slow with every
>>>> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
>>>> hundreds or thousands of microseconds (before it gives up and
>>>> alloc_pages_node() fallback is used), so this causes the
>>>> multi-second/minute hangs of the desktop.
>>>>
>>>> So it seems ttm memory allocations quickly fragment and/or exhaust
>>> the
>>>> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
>>>> find a fitting hole big enough to satisfy allocations with a retry
>>>> loop (see
>>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c%23L339&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=42356aad2ff181236f4704283dc058fdd7b7e213cdea7378665094b35ee0dfdf)
>>>> that takes forever.
>> I am curious why it does not end up using the pool. As in use the TTM DMA pool to pick pages instead of allocating (and freeing) new ones?
>>
>>>> This is not good, also not for other devices which actually need a
>>>> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
>>>> still need physically contiguous dma memory, maybe with exception of
>>>> some embedded gpus?
>> Oh. If I understood you correctly - the CMA ends up giving huge chunks of contiguous area. But if the sizes are 4kb I wonder why it would do that?
>>
>> The modern GPUs on x86 can deal with scatter gather and as you surmise don't need contiguous physical contiguous areas.
>>>> My naive approach would be to add a new gfp_t flag a la
>>>> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
>>>> refrain from doing so if they have some fallback for getting memory.
>>>> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
>>>> around here:
>>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L884&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=0c2a37c8bac57e0ab7333a9580eb5114e09566d1d34ab43be7a80de8316bdcdd
>>>> However i'm not familiar enough with memory management, so likely
>>>> greater minds here have much better ideas on how to deal with this?
>>>>
>> That is a bit of hack to deal with CMA being slow.
>>
>> Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
>>>> thanks,
>>>> -mario
>>>>
>>>> Typical snippet from an example trace of a badly stalling desktop
>>> with
>>>> CMA (alloc_pages_node() fallback may have been missing in this traces
>>>> ftrace_filter settings):
>>>>
>>>> 1)               |                          ttm_dma_pool_get_pages
>>>> [ttm]() {
>>>>   1)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>   1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>   1)               | dma_generic_alloc_coherent() {
>>>>   1) ! 1873.071 us | dma_alloc_from_contiguous();
>>>>   1) ! 1874.292 us |                                  }
>>>>   1) ! 1875.400 us |                                }
>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>   1)               | dma_generic_alloc_coherent() {
>>>>   1) ! 1868.372 us | dma_alloc_from_contiguous();
>>>>   1) ! 1869.586 us |                                  }
>>>>   1) ! 1870.053 us |                                }
>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>   1)               | dma_generic_alloc_coherent() {
>>>>   1) ! 1871.085 us | dma_alloc_from_contiguous();
>>>>   1) ! 1872.240 us |                                  }
>>>>   1) ! 1872.669 us |                                }
>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>   1)               | dma_generic_alloc_coherent() {
>>>>   1) ! 1888.934 us | dma_alloc_from_contiguous();
>>>>   1) ! 1890.179 us |                                  }
>>>>   1) ! 1890.608 us |                                }
>>>>   1)   0.048 us    | ttm_set_pages_caching [ttm]();
>>>>   1) ! 7511.000 us |                              }
>>>>   1) ! 7511.306 us |                            }
>>>>   1) ! 7511.623 us |                          }
>>>>
>>>> The good case (with cma=0 kernel cmdline, so
>>>> dma_alloc_from_contiguous() no-ops,)
>>>>
>>>> 0)               |                          ttm_dma_pool_get_pages
>>>> [ttm]() {
>>>>   0)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>   0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>   0)               | dma_generic_alloc_coherent() {
>>>>   0)   0.171 us    | dma_alloc_from_contiguous();
>>>>   0)   0.849 us    | __alloc_pages_nodemask();
>>>>   0)   3.029 us    |                                  }
>>>>   0)   3.882 us    |                                }
>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>   0)               | dma_generic_alloc_coherent() {
>>>>   0)   0.037 us    | dma_alloc_from_contiguous();
>>>>   0)   0.163 us    | __alloc_pages_nodemask();
>>>>   0)   1.408 us    |                                  }
>>>>   0)   1.719 us    |                                }
>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>   0)               | dma_generic_alloc_coherent() {
>>>>   0)   0.035 us    | dma_alloc_from_contiguous();
>>>>   0)   0.153 us    | __alloc_pages_nodemask();
>>>>   0)   1.454 us    |                                  }
>>>>   0)   1.720 us    |                                }
>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>   0)               | dma_generic_alloc_coherent() {
>>>>   0)   0.036 us    | dma_alloc_from_contiguous();
>>>>   0)   0.112 us    | __alloc_pages_nodemask();
>>>>   0)   1.211 us    |                                  }
>>>>   0)   1.541 us    |                                }
>>>>   0)   0.035 us    | ttm_set_pages_caching [ttm]();
>>>>   0) + 10.902 us   |                              }
>>>>   0) + 11.577 us   |                            }
>>>>   0) + 11.988 us   |                          }
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://urldefense.proofpoint.com/v1/url?u=http://lists.freedesktop.org/mailman/listinfo/dri-devel&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=d2636419e1f7f56c0d270e29ffe6ab6c6e29249876a578d70d973058f9411831

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-10  3:11         ` Mario Kleiner
@ 2014-08-10 11:03           ` Thomas Hellstrom
  -1 siblings, 0 replies; 30+ messages in thread
From: Thomas Hellstrom @ 2014-08-10 11:03 UTC (permalink / raw)
  To: Mario Kleiner
  Cc: Thomas Hellstrom, Konrad Rzeszutek Wilk, kamal, ben, LKML,
	dri-devel, m.szyprowski

On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> Resent this time without HTML formatting which lkml doesn't like. Sorry.
>
> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
>> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
>>> On August 9, 2014 1:39:39 AM EDT, Thomas
>>> Hellstrom<thellstrom@vmware.com>  wrote:
>>>> Hi.
>>>>
>>> Hey Thomas!
>>>
>>>> IIRC I don't think the TTM DMA pool allocates coherent pages more than
>>>> one page at a time, and _if that's true_ it's pretty unnecessary for
>>>> the
>>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>>>> shed
>>>> some light over this?
>>> It should allocate in batches and keep them in the TTM DMA pool for
>>> some time to be reused.
>>>
>>> The pages that it gets are in 4kb granularity though.
>> Then I feel inclined to say this is a DMA subsystem bug. Single page
>> allocations shouldn't get routed to CMA.
>>
>> /Thomas
>
> Yes, seems you're both right. I read through the code a bit more and
> indeed the TTM DMA pool allocates only one page during each
> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
> allocators don't check for single page CMA allocations and therefore
> try to get it from the CMA area anyway, instead of skipping to the
> much cheaper fallback.
>
> So the callers of dma_alloc_from_contiguous() could need that little
> optimization of skipping it if only one page is requested. For
>
> dma_generic_alloc_coherent 
> <http://lxr.free-electrons.com/ident?i=dma_generic_alloc_coherent> 
> andintel_alloc_coherent 
> <http://lxr.free-electrons.com/ident?i=intel_alloc_coherent>  this
> seems easy to do. Looking at the arm arch variants, e.g.,
>
> http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1194
>
> and
>
> http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c#L44
>
> i'm not sure if it is that easily done, as there aren't any fallbacks
> for such a case and the code looks to me as if that's at least
> somewhat intentional.
>
> As far as TTM goes, one quick one-line fix to prevent it from using
> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
> above methods) would be to clear the __GFP_WAIT
> <http://lxr.free-electrons.com/ident?i=__GFP_WAIT> flag from the
> passed gfp_t flags. That would trigger the well working fallback. So, is
>
> __GFP_WAIT  <http://lxr.free-electrons.com/ident?i=__GFP_WAIT>  needed
> for those single page allocations that go through__ttm_dma_alloc_page 
> <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page>?
>
> It would be nice to have such a simple, non-intrusive one-line patch
> that we still could get into 3.17 and then backported to older stable
> kernels to avoid the same desktop hangs there if CMA is enabled. It
> would be also nice for actual users of CMA to not use up lots of CMA
> space for gpu's which don't need it. I think DMA_CMA was introduced
> around 3.12.
>

I don't think that's a good idea. Omitting __GFP_WAIT would cause
unnecessary memory allocation errors on systems under stress.
I think this should be filed as a DMA subsystem kernel bug / regression
and an appropriate solution should be worked out together with the DMA
subsystem maintainers and then backported.

>
> The other problem is that probably TTM does not reuse pages from the
> DMA pool. If i trace the __ttm_dma_alloc_page
> <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> and
> __ttm_dma_free_page
> <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> calls for
> those single page allocs/frees, then over a 20 second interval of
> tracing and switching tabs in firefox, scrolling things around etc. i
> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
> 1648 frees.

This is because historically the pools have been designed to keep only
pages with nonstandard caching attributes since changing page caching
attributes have been very slow but the kernel page allocators have been
reasonably fast.

/Thomas

>
> This bit of code fromttm_dma_unpopulate
> <http://lxr.free-electrons.com/ident?i=ttm_dma_unpopulate>()  (line
> 954 in 3.16) looks suspicious:
>
> http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954
>
>
> Alloc's from a tt_cached cached pool ( if (is_cached)...) always get
> freed and are not given back to the cached pool. But in the uncached
> case, there's logic to make sure the pool doesn't grow forever (line
> 955, checking against _manager->options.max_size), but before that
> check in line 954 there's an uncoditional assignment of npages =
> count; which seems to force freeing all pages as well, instead of
> recycling? Is this some debug code left over, or intentional and just
> me not understanding what happens there?
>
> thanks,
> -mario
>
>
>>>> /Thomas
>>>>
>>>>
>>>> On 08/08/2014 07:42 PM, Mario Kleiner wrote:
>>>>> Hi all,
>>>>>
>>>>> there is a rather severe performance problem i accidentally found
>>>> when
>>>>> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
>>>>> Ubuntu 14.04 LTS with nouveau as graphics driver.
>>>>>
>>>>> I was lazy and just installed the Ubuntu precompiled mainline kernel.
>>>>> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
>>>>> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
>>>>> weren't compiled with CMA, so i only observed this on 3.16, but
>>>>> previous kernels would likely be affected too.
>>>>>
>>>>> After a few minutes of regular desktop use like switching workspaces,
>>>>> scrolling text in a terminal window, Firefox with multiple tabs open,
>>>>> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
>>>>> composition), i get chunky desktop updates, then multi-second
>>>> freezes,
>>>>> after a few minutes the desktop hangs for over a minute on almost any
>>>>> GUI action like switching windows etc. --> Unuseable.
>>>>>
>>>>> ftrace'ing shows the culprit being this callchain (typical good/bad
>>>>> example ftrace snippets at the end of this mail):
>>>>>
>>>>> ...ttm dma coherent memory allocations, e.g., from
>>>>> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
>>>>> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
>>>>> dma_alloc_from_contiguous()
>>>>>
>>>>> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
>>>> when
>>>>> the machine is booted with kernel boot cmdline parameter "cma=0", so
>>>>> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>>>>>
>>>>> With CMA, this function becomes progressively more slow with every
>>>>> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
>>>>> hundreds or thousands of microseconds (before it gives up and
>>>>> alloc_pages_node() fallback is used), so this causes the
>>>>> multi-second/minute hangs of the desktop.
>>>>>
>>>>> So it seems ttm memory allocations quickly fragment and/or exhaust
>>>> the
>>>>> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
>>>>> find a fitting hole big enough to satisfy allocations with a retry
>>>>> loop (see
>>>>>
>>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c%23L339&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=42356aad2ff181236f4704283dc058fdd7b7e213cdea7378665094b35ee0dfdf)
>>>>
>>>>> that takes forever.
>>> I am curious why it does not end up using the pool. As in use the
>>> TTM DMA pool to pick pages instead of allocating (and freeing) new
>>> ones?
>>>
>>>>> This is not good, also not for other devices which actually need a
>>>>> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
>>>>> still need physically contiguous dma memory, maybe with exception of
>>>>> some embedded gpus?
>>> Oh. If I understood you correctly - the CMA ends up giving huge
>>> chunks of contiguous area. But if the sizes are 4kb I wonder why it
>>> would do that?
>>>
>>> The modern GPUs on x86 can deal with scatter gather and as you
>>> surmise don't need contiguous physical contiguous areas.
>>>>> My naive approach would be to add a new gfp_t flag a la
>>>>> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
>>>>> refrain from doing so if they have some fallback for getting memory.
>>>>> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
>>>>> around here:
>>>>>
>>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L884&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=0c2a37c8bac57e0ab7333a9580eb5114e09566d1d34ab43be7a80de8316bdcdd
>>>>
>>>>> However i'm not familiar enough with memory management, so likely
>>>>> greater minds here have much better ideas on how to deal with this?
>>>>>
>>> That is a bit of hack to deal with CMA being slow.
>>>
>>> Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
>>>>> thanks,
>>>>> -mario
>>>>>
>>>>> Typical snippet from an example trace of a badly stalling desktop
>>>> with
>>>>> CMA (alloc_pages_node() fallback may have been missing in this traces
>>>>> ftrace_filter settings):
>>>>>
>>>>> 1)               |                          ttm_dma_pool_get_pages
>>>>> [ttm]() {
>>>>>   1)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>>   1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   1)               | dma_generic_alloc_coherent() {
>>>>>   1) ! 1873.071 us | dma_alloc_from_contiguous();
>>>>>   1) ! 1874.292 us |                                  }
>>>>>   1) ! 1875.400 us |                                }
>>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   1)               | dma_generic_alloc_coherent() {
>>>>>   1) ! 1868.372 us | dma_alloc_from_contiguous();
>>>>>   1) ! 1869.586 us |                                  }
>>>>>   1) ! 1870.053 us |                                }
>>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   1)               | dma_generic_alloc_coherent() {
>>>>>   1) ! 1871.085 us | dma_alloc_from_contiguous();
>>>>>   1) ! 1872.240 us |                                  }
>>>>>   1) ! 1872.669 us |                                }
>>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   1)               | dma_generic_alloc_coherent() {
>>>>>   1) ! 1888.934 us | dma_alloc_from_contiguous();
>>>>>   1) ! 1890.179 us |                                  }
>>>>>   1) ! 1890.608 us |                                }
>>>>>   1)   0.048 us    | ttm_set_pages_caching [ttm]();
>>>>>   1) ! 7511.000 us |                              }
>>>>>   1) ! 7511.306 us |                            }
>>>>>   1) ! 7511.623 us |                          }
>>>>>
>>>>> The good case (with cma=0 kernel cmdline, so
>>>>> dma_alloc_from_contiguous() no-ops,)
>>>>>
>>>>> 0)               |                          ttm_dma_pool_get_pages
>>>>> [ttm]() {
>>>>>   0)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>>   0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   0)               | dma_generic_alloc_coherent() {
>>>>>   0)   0.171 us    | dma_alloc_from_contiguous();
>>>>>   0)   0.849 us    | __alloc_pages_nodemask();
>>>>>   0)   3.029 us    |                                  }
>>>>>   0)   3.882 us    |                                }
>>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   0)               | dma_generic_alloc_coherent() {
>>>>>   0)   0.037 us    | dma_alloc_from_contiguous();
>>>>>   0)   0.163 us    | __alloc_pages_nodemask();
>>>>>   0)   1.408 us    |                                  }
>>>>>   0)   1.719 us    |                                }
>>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   0)               | dma_generic_alloc_coherent() {
>>>>>   0)   0.035 us    | dma_alloc_from_contiguous();
>>>>>   0)   0.153 us    | __alloc_pages_nodemask();
>>>>>   0)   1.454 us    |                                  }
>>>>>   0)   1.720 us    |                                }
>>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   0)               | dma_generic_alloc_coherent() {
>>>>>   0)   0.036 us    | dma_alloc_from_contiguous();
>>>>>   0)   0.112 us    | __alloc_pages_nodemask();
>>>>>   0)   1.211 us    |                                  }
>>>>>   0)   1.541 us    |                                }
>>>>>   0)   0.035 us    | ttm_set_pages_caching [ttm]();
>>>>>   0) + 10.902 us   |                              }
>>>>>   0) + 11.577 us   |                            }
>>>>>   0) + 11.988 us   |                          }
>>>>>
>>>>> _______________________________________________
>>>>> dri-devel mailing list
>>>>> dri-devel@lists.freedesktop.org
>>>>> https://urldefense.proofpoint.com/v1/url?u=http://lists.freedesktop.org/mailman/listinfo/dri-devel&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=d2636419e1f7f56c0d270e29ffe6ab6c6e29249876a578d70d973058f9411831
>>>>>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-10 11:03           ` Thomas Hellstrom
  0 siblings, 0 replies; 30+ messages in thread
From: Thomas Hellstrom @ 2014-08-10 11:03 UTC (permalink / raw)
  To: Mario Kleiner
  Cc: Thomas Hellstrom, Konrad Rzeszutek Wilk, kamal, LKML, dri-devel,
	ben, m.szyprowski

On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> Resent this time without HTML formatting which lkml doesn't like. Sorry.
>
> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
>> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
>>> On August 9, 2014 1:39:39 AM EDT, Thomas
>>> Hellstrom<thellstrom@vmware.com>  wrote:
>>>> Hi.
>>>>
>>> Hey Thomas!
>>>
>>>> IIRC I don't think the TTM DMA pool allocates coherent pages more than
>>>> one page at a time, and _if that's true_ it's pretty unnecessary for
>>>> the
>>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>>>> shed
>>>> some light over this?
>>> It should allocate in batches and keep them in the TTM DMA pool for
>>> some time to be reused.
>>>
>>> The pages that it gets are in 4kb granularity though.
>> Then I feel inclined to say this is a DMA subsystem bug. Single page
>> allocations shouldn't get routed to CMA.
>>
>> /Thomas
>
> Yes, seems you're both right. I read through the code a bit more and
> indeed the TTM DMA pool allocates only one page during each
> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
> allocators don't check for single page CMA allocations and therefore
> try to get it from the CMA area anyway, instead of skipping to the
> much cheaper fallback.
>
> So the callers of dma_alloc_from_contiguous() could need that little
> optimization of skipping it if only one page is requested. For
>
> dma_generic_alloc_coherent 
> <http://lxr.free-electrons.com/ident?i=dma_generic_alloc_coherent> 
> andintel_alloc_coherent 
> <http://lxr.free-electrons.com/ident?i=intel_alloc_coherent>  this
> seems easy to do. Looking at the arm arch variants, e.g.,
>
> http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1194
>
> and
>
> http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c#L44
>
> i'm not sure if it is that easily done, as there aren't any fallbacks
> for such a case and the code looks to me as if that's at least
> somewhat intentional.
>
> As far as TTM goes, one quick one-line fix to prevent it from using
> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
> above methods) would be to clear the __GFP_WAIT
> <http://lxr.free-electrons.com/ident?i=__GFP_WAIT> flag from the
> passed gfp_t flags. That would trigger the well working fallback. So, is
>
> __GFP_WAIT  <http://lxr.free-electrons.com/ident?i=__GFP_WAIT>  needed
> for those single page allocations that go through__ttm_dma_alloc_page 
> <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page>?
>
> It would be nice to have such a simple, non-intrusive one-line patch
> that we still could get into 3.17 and then backported to older stable
> kernels to avoid the same desktop hangs there if CMA is enabled. It
> would be also nice for actual users of CMA to not use up lots of CMA
> space for gpu's which don't need it. I think DMA_CMA was introduced
> around 3.12.
>

I don't think that's a good idea. Omitting __GFP_WAIT would cause
unnecessary memory allocation errors on systems under stress.
I think this should be filed as a DMA subsystem kernel bug / regression
and an appropriate solution should be worked out together with the DMA
subsystem maintainers and then backported.

>
> The other problem is that probably TTM does not reuse pages from the
> DMA pool. If i trace the __ttm_dma_alloc_page
> <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> and
> __ttm_dma_free_page
> <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> calls for
> those single page allocs/frees, then over a 20 second interval of
> tracing and switching tabs in firefox, scrolling things around etc. i
> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
> 1648 frees.

This is because historically the pools have been designed to keep only
pages with nonstandard caching attributes since changing page caching
attributes have been very slow but the kernel page allocators have been
reasonably fast.

/Thomas

>
> This bit of code fromttm_dma_unpopulate
> <http://lxr.free-electrons.com/ident?i=ttm_dma_unpopulate>()  (line
> 954 in 3.16) looks suspicious:
>
> http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954
>
>
> Alloc's from a tt_cached cached pool ( if (is_cached)...) always get
> freed and are not given back to the cached pool. But in the uncached
> case, there's logic to make sure the pool doesn't grow forever (line
> 955, checking against _manager->options.max_size), but before that
> check in line 954 there's an uncoditional assignment of npages =
> count; which seems to force freeing all pages as well, instead of
> recycling? Is this some debug code left over, or intentional and just
> me not understanding what happens there?
>
> thanks,
> -mario
>
>
>>>> /Thomas
>>>>
>>>>
>>>> On 08/08/2014 07:42 PM, Mario Kleiner wrote:
>>>>> Hi all,
>>>>>
>>>>> there is a rather severe performance problem i accidentally found
>>>> when
>>>>> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
>>>>> Ubuntu 14.04 LTS with nouveau as graphics driver.
>>>>>
>>>>> I was lazy and just installed the Ubuntu precompiled mainline kernel.
>>>>> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
>>>>> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
>>>>> weren't compiled with CMA, so i only observed this on 3.16, but
>>>>> previous kernels would likely be affected too.
>>>>>
>>>>> After a few minutes of regular desktop use like switching workspaces,
>>>>> scrolling text in a terminal window, Firefox with multiple tabs open,
>>>>> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
>>>>> composition), i get chunky desktop updates, then multi-second
>>>> freezes,
>>>>> after a few minutes the desktop hangs for over a minute on almost any
>>>>> GUI action like switching windows etc. --> Unuseable.
>>>>>
>>>>> ftrace'ing shows the culprit being this callchain (typical good/bad
>>>>> example ftrace snippets at the end of this mail):
>>>>>
>>>>> ...ttm dma coherent memory allocations, e.g., from
>>>>> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
>>>>> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
>>>>> dma_alloc_from_contiguous()
>>>>>
>>>>> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
>>>> when
>>>>> the machine is booted with kernel boot cmdline parameter "cma=0", so
>>>>> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>>>>>
>>>>> With CMA, this function becomes progressively more slow with every
>>>>> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
>>>>> hundreds or thousands of microseconds (before it gives up and
>>>>> alloc_pages_node() fallback is used), so this causes the
>>>>> multi-second/minute hangs of the desktop.
>>>>>
>>>>> So it seems ttm memory allocations quickly fragment and/or exhaust
>>>> the
>>>>> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
>>>>> find a fitting hole big enough to satisfy allocations with a retry
>>>>> loop (see
>>>>>
>>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c%23L339&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=42356aad2ff181236f4704283dc058fdd7b7e213cdea7378665094b35ee0dfdf)
>>>>
>>>>> that takes forever.
>>> I am curious why it does not end up using the pool. As in use the
>>> TTM DMA pool to pick pages instead of allocating (and freeing) new
>>> ones?
>>>
>>>>> This is not good, also not for other devices which actually need a
>>>>> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
>>>>> still need physically contiguous dma memory, maybe with exception of
>>>>> some embedded gpus?
>>> Oh. If I understood you correctly - the CMA ends up giving huge
>>> chunks of contiguous area. But if the sizes are 4kb I wonder why it
>>> would do that?
>>>
>>> The modern GPUs on x86 can deal with scatter gather and as you
>>> surmise don't need contiguous physical contiguous areas.
>>>>> My naive approach would be to add a new gfp_t flag a la
>>>>> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
>>>>> refrain from doing so if they have some fallback for getting memory.
>>>>> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
>>>>> around here:
>>>>>
>>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L884&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=0c2a37c8bac57e0ab7333a9580eb5114e09566d1d34ab43be7a80de8316bdcdd
>>>>
>>>>> However i'm not familiar enough with memory management, so likely
>>>>> greater minds here have much better ideas on how to deal with this?
>>>>>
>>> That is a bit of hack to deal with CMA being slow.
>>>
>>> Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
>>>>> thanks,
>>>>> -mario
>>>>>
>>>>> Typical snippet from an example trace of a badly stalling desktop
>>>> with
>>>>> CMA (alloc_pages_node() fallback may have been missing in this traces
>>>>> ftrace_filter settings):
>>>>>
>>>>> 1)               |                          ttm_dma_pool_get_pages
>>>>> [ttm]() {
>>>>>   1)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>>   1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   1)               | dma_generic_alloc_coherent() {
>>>>>   1) ! 1873.071 us | dma_alloc_from_contiguous();
>>>>>   1) ! 1874.292 us |                                  }
>>>>>   1) ! 1875.400 us |                                }
>>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   1)               | dma_generic_alloc_coherent() {
>>>>>   1) ! 1868.372 us | dma_alloc_from_contiguous();
>>>>>   1) ! 1869.586 us |                                  }
>>>>>   1) ! 1870.053 us |                                }
>>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   1)               | dma_generic_alloc_coherent() {
>>>>>   1) ! 1871.085 us | dma_alloc_from_contiguous();
>>>>>   1) ! 1872.240 us |                                  }
>>>>>   1) ! 1872.669 us |                                }
>>>>>   1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   1)               | dma_generic_alloc_coherent() {
>>>>>   1) ! 1888.934 us | dma_alloc_from_contiguous();
>>>>>   1) ! 1890.179 us |                                  }
>>>>>   1) ! 1890.608 us |                                }
>>>>>   1)   0.048 us    | ttm_set_pages_caching [ttm]();
>>>>>   1) ! 7511.000 us |                              }
>>>>>   1) ! 7511.306 us |                            }
>>>>>   1) ! 7511.623 us |                          }
>>>>>
>>>>> The good case (with cma=0 kernel cmdline, so
>>>>> dma_alloc_from_contiguous() no-ops,)
>>>>>
>>>>> 0)               |                          ttm_dma_pool_get_pages
>>>>> [ttm]() {
>>>>>   0)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>>   0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   0)               | dma_generic_alloc_coherent() {
>>>>>   0)   0.171 us    | dma_alloc_from_contiguous();
>>>>>   0)   0.849 us    | __alloc_pages_nodemask();
>>>>>   0)   3.029 us    |                                  }
>>>>>   0)   3.882 us    |                                }
>>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   0)               | dma_generic_alloc_coherent() {
>>>>>   0)   0.037 us    | dma_alloc_from_contiguous();
>>>>>   0)   0.163 us    | __alloc_pages_nodemask();
>>>>>   0)   1.408 us    |                                  }
>>>>>   0)   1.719 us    |                                }
>>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   0)               | dma_generic_alloc_coherent() {
>>>>>   0)   0.035 us    | dma_alloc_from_contiguous();
>>>>>   0)   0.153 us    | __alloc_pages_nodemask();
>>>>>   0)   1.454 us    |                                  }
>>>>>   0)   1.720 us    |                                }
>>>>>   0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>   0)               | dma_generic_alloc_coherent() {
>>>>>   0)   0.036 us    | dma_alloc_from_contiguous();
>>>>>   0)   0.112 us    | __alloc_pages_nodemask();
>>>>>   0)   1.211 us    |                                  }
>>>>>   0)   1.541 us    |                                }
>>>>>   0)   0.035 us    | ttm_set_pages_caching [ttm]();
>>>>>   0) + 10.902 us   |                              }
>>>>>   0) + 11.577 us   |                            }
>>>>>   0) + 11.988 us   |                          }
>>>>>
>>>>> _______________________________________________
>>>>> dri-devel mailing list
>>>>> dri-devel@lists.freedesktop.org
>>>>> https://urldefense.proofpoint.com/v1/url?u=http://lists.freedesktop.org/mailman/listinfo/dri-devel&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=d2636419e1f7f56c0d270e29ffe6ab6c6e29249876a578d70d973058f9411831
>>>>>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-10 11:03           ` Thomas Hellstrom
@ 2014-08-10 18:02             ` Mario Kleiner
  -1 siblings, 0 replies; 30+ messages in thread
From: Mario Kleiner @ 2014-08-10 18:02 UTC (permalink / raw)
  To: Thomas Hellstrom
  Cc: Konrad Rzeszutek Wilk, kamal, ben, LKML, dri-devel, m.szyprowski

On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
>> Resent this time without HTML formatting which lkml doesn't like. Sorry.
>>
>> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
>>> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
>>>> On August 9, 2014 1:39:39 AM EDT, Thomas
>>>> Hellstrom<thellstrom@vmware.com>  wrote:
>>>>> Hi.
>>>>>
>>>> Hey Thomas!
>>>>
>>>>> IIRC I don't think the TTM DMA pool allocates coherent pages more than
>>>>> one page at a time, and _if that's true_ it's pretty unnecessary for
>>>>> the
>>>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>>>>> shed
>>>>> some light over this?
>>>> It should allocate in batches and keep them in the TTM DMA pool for
>>>> some time to be reused.
>>>>
>>>> The pages that it gets are in 4kb granularity though.
>>> Then I feel inclined to say this is a DMA subsystem bug. Single page
>>> allocations shouldn't get routed to CMA.
>>>
>>> /Thomas
>> Yes, seems you're both right. I read through the code a bit more and
>> indeed the TTM DMA pool allocates only one page during each
>> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
>> allocators don't check for single page CMA allocations and therefore
>> try to get it from the CMA area anyway, instead of skipping to the
>> much cheaper fallback.
>>
>> So the callers of dma_alloc_from_contiguous() could need that little
>> optimization of skipping it if only one page is requested. For
>>
>> dma_generic_alloc_coherent
>> <http://lxr.free-electrons.com/ident?i=dma_generic_alloc_coherent>
>> andintel_alloc_coherent
>> <http://lxr.free-electrons.com/ident?i=intel_alloc_coherent>  this
>> seems easy to do. Looking at the arm arch variants, e.g.,
>>
>> http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1194
>>
>> and
>>
>> http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c#L44
>>
>> i'm not sure if it is that easily done, as there aren't any fallbacks
>> for such a case and the code looks to me as if that's at least
>> somewhat intentional.
>>
>> As far as TTM goes, one quick one-line fix to prevent it from using
>> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
>> above methods) would be to clear the __GFP_WAIT
>> <http://lxr.free-electrons.com/ident?i=__GFP_WAIT> flag from the
>> passed gfp_t flags. That would trigger the well working fallback. So, is
>>
>> __GFP_WAIT  <http://lxr.free-electrons.com/ident?i=__GFP_WAIT>  needed
>> for those single page allocations that go through__ttm_dma_alloc_page
>> <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page>?
>>
>> It would be nice to have such a simple, non-intrusive one-line patch
>> that we still could get into 3.17 and then backported to older stable
>> kernels to avoid the same desktop hangs there if CMA is enabled. It
>> would be also nice for actual users of CMA to not use up lots of CMA
>> space for gpu's which don't need it. I think DMA_CMA was introduced
>> around 3.12.
>>
> I don't think that's a good idea. Omitting __GFP_WAIT would cause
> unnecessary memory allocation errors on systems under stress.
> I think this should be filed as a DMA subsystem kernel bug / regression
> and an appropriate solution should be worked out together with the DMA
> subsystem maintainers and then backported.

Ok, so it is needed. I'll file a bug report.

>> The other problem is that probably TTM does not reuse pages from the
>> DMA pool. If i trace the __ttm_dma_alloc_page
>> <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> and
>> __ttm_dma_free_page
>> <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> calls for
>> those single page allocs/frees, then over a 20 second interval of
>> tracing and switching tabs in firefox, scrolling things around etc. i
>> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
>> 1648 frees.
> This is because historically the pools have been designed to keep only
> pages with nonstandard caching attributes since changing page caching
> attributes have been very slow but the kernel page allocators have been
> reasonably fast.
>
> /Thomas

Ok. A bit more ftraceing showed my hang problem case goes through the 
"if (is_cached)" paths, so the pool doesn't recycle anything and i see 
it bouncing up and down by 4 pages all the time.

But for the non-cached case, which i don't hit with my problem, could 
one of you look at line 954...

http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954

... and tell me why that unconditional npages = count; assignment makes sense? It seems to essentially disable all recycling for the dma pool whenever the pool isn't filled up to/beyond its maximum with free pages? When the pool is filled up, lots of stuff is recycled, but when it is already somewhat below capacity, it gets "punished" by not getting refilled? I'd just like to understand the logic behind that line.

thanks,
-mario


>> This bit of code fromttm_dma_unpopulate
>> <http://lxr.free-electrons.com/ident?i=ttm_dma_unpopulate>()  (line
>> 954 in 3.16) looks suspicious:
>>
>> http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954
>>
>>
>> Alloc's from a tt_cached cached pool ( if (is_cached)...) always get
>> freed and are not given back to the cached pool. But in the uncached
>> case, there's logic to make sure the pool doesn't grow forever (line
>> 955, checking against _manager->options.max_size), but before that
>> check in line 954 there's an uncoditional assignment of npages =
>> count; which seems to force freeing all pages as well, instead of
>> recycling? Is this some debug code left over, or intentional and just
>> me not understanding what happens there?
>>
>> thanks,
>> -mario
>>
>>
>>>>> /Thomas
>>>>>
>>>>>
>>>>> On 08/08/2014 07:42 PM, Mario Kleiner wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> there is a rather severe performance problem i accidentally found
>>>>> when
>>>>>> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
>>>>>> Ubuntu 14.04 LTS with nouveau as graphics driver.
>>>>>>
>>>>>> I was lazy and just installed the Ubuntu precompiled mainline kernel.
>>>>>> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
>>>>>> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
>>>>>> weren't compiled with CMA, so i only observed this on 3.16, but
>>>>>> previous kernels would likely be affected too.
>>>>>>
>>>>>> After a few minutes of regular desktop use like switching workspaces,
>>>>>> scrolling text in a terminal window, Firefox with multiple tabs open,
>>>>>> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
>>>>>> composition), i get chunky desktop updates, then multi-second
>>>>> freezes,
>>>>>> after a few minutes the desktop hangs for over a minute on almost any
>>>>>> GUI action like switching windows etc. --> Unuseable.
>>>>>>
>>>>>> ftrace'ing shows the culprit being this callchain (typical good/bad
>>>>>> example ftrace snippets at the end of this mail):
>>>>>>
>>>>>> ...ttm dma coherent memory allocations, e.g., from
>>>>>> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
>>>>>> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
>>>>>> dma_alloc_from_contiguous()
>>>>>>
>>>>>> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
>>>>> when
>>>>>> the machine is booted with kernel boot cmdline parameter "cma=0", so
>>>>>> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>>>>>>
>>>>>> With CMA, this function becomes progressively more slow with every
>>>>>> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
>>>>>> hundreds or thousands of microseconds (before it gives up and
>>>>>> alloc_pages_node() fallback is used), so this causes the
>>>>>> multi-second/minute hangs of the desktop.
>>>>>>
>>>>>> So it seems ttm memory allocations quickly fragment and/or exhaust
>>>>> the
>>>>>> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
>>>>>> find a fitting hole big enough to satisfy allocations with a retry
>>>>>> loop (see
>>>>>>
>>>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c%23L339&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=42356aad2ff181236f4704283dc058fdd7b7e213cdea7378665094b35ee0dfdf)
>>>>>
>>>>>> that takes forever.
>>>> I am curious why it does not end up using the pool. As in use the
>>>> TTM DMA pool to pick pages instead of allocating (and freeing) new
>>>> ones?
>>>>
>>>>>> This is not good, also not for other devices which actually need a
>>>>>> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
>>>>>> still need physically contiguous dma memory, maybe with exception of
>>>>>> some embedded gpus?
>>>> Oh. If I understood you correctly - the CMA ends up giving huge
>>>> chunks of contiguous area. But if the sizes are 4kb I wonder why it
>>>> would do that?
>>>>
>>>> The modern GPUs on x86 can deal with scatter gather and as you
>>>> surmise don't need contiguous physical contiguous areas.
>>>>>> My naive approach would be to add a new gfp_t flag a la
>>>>>> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
>>>>>> refrain from doing so if they have some fallback for getting memory.
>>>>>> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
>>>>>> around here:
>>>>>>
>>>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L884&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=0c2a37c8bac57e0ab7333a9580eb5114e09566d1d34ab43be7a80de8316bdcdd
>>>>>
>>>>>> However i'm not familiar enough with memory management, so likely
>>>>>> greater minds here have much better ideas on how to deal with this?
>>>>>>
>>>> That is a bit of hack to deal with CMA being slow.
>>>>
>>>> Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
>>>>>> thanks,
>>>>>> -mario
>>>>>>
>>>>>> Typical snippet from an example trace of a badly stalling desktop
>>>>> with
>>>>>> CMA (alloc_pages_node() fallback may have been missing in this traces
>>>>>> ftrace_filter settings):
>>>>>>
>>>>>> 1)               |                          ttm_dma_pool_get_pages
>>>>>> [ttm]() {
>>>>>>    1)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>>>    1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>>>    1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    1)               | dma_generic_alloc_coherent() {
>>>>>>    1) ! 1873.071 us | dma_alloc_from_contiguous();
>>>>>>    1) ! 1874.292 us |                                  }
>>>>>>    1) ! 1875.400 us |                                }
>>>>>>    1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    1)               | dma_generic_alloc_coherent() {
>>>>>>    1) ! 1868.372 us | dma_alloc_from_contiguous();
>>>>>>    1) ! 1869.586 us |                                  }
>>>>>>    1) ! 1870.053 us |                                }
>>>>>>    1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    1)               | dma_generic_alloc_coherent() {
>>>>>>    1) ! 1871.085 us | dma_alloc_from_contiguous();
>>>>>>    1) ! 1872.240 us |                                  }
>>>>>>    1) ! 1872.669 us |                                }
>>>>>>    1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    1)               | dma_generic_alloc_coherent() {
>>>>>>    1) ! 1888.934 us | dma_alloc_from_contiguous();
>>>>>>    1) ! 1890.179 us |                                  }
>>>>>>    1) ! 1890.608 us |                                }
>>>>>>    1)   0.048 us    | ttm_set_pages_caching [ttm]();
>>>>>>    1) ! 7511.000 us |                              }
>>>>>>    1) ! 7511.306 us |                            }
>>>>>>    1) ! 7511.623 us |                          }
>>>>>>
>>>>>> The good case (with cma=0 kernel cmdline, so
>>>>>> dma_alloc_from_contiguous() no-ops,)
>>>>>>
>>>>>> 0)               |                          ttm_dma_pool_get_pages
>>>>>> [ttm]() {
>>>>>>    0)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>>>    0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>>>    0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    0)               | dma_generic_alloc_coherent() {
>>>>>>    0)   0.171 us    | dma_alloc_from_contiguous();
>>>>>>    0)   0.849 us    | __alloc_pages_nodemask();
>>>>>>    0)   3.029 us    |                                  }
>>>>>>    0)   3.882 us    |                                }
>>>>>>    0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    0)               | dma_generic_alloc_coherent() {
>>>>>>    0)   0.037 us    | dma_alloc_from_contiguous();
>>>>>>    0)   0.163 us    | __alloc_pages_nodemask();
>>>>>>    0)   1.408 us    |                                  }
>>>>>>    0)   1.719 us    |                                }
>>>>>>    0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    0)               | dma_generic_alloc_coherent() {
>>>>>>    0)   0.035 us    | dma_alloc_from_contiguous();
>>>>>>    0)   0.153 us    | __alloc_pages_nodemask();
>>>>>>    0)   1.454 us    |                                  }
>>>>>>    0)   1.720 us    |                                }
>>>>>>    0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    0)               | dma_generic_alloc_coherent() {
>>>>>>    0)   0.036 us    | dma_alloc_from_contiguous();
>>>>>>    0)   0.112 us    | __alloc_pages_nodemask();
>>>>>>    0)   1.211 us    |                                  }
>>>>>>    0)   1.541 us    |                                }
>>>>>>    0)   0.035 us    | ttm_set_pages_caching [ttm]();
>>>>>>    0) + 10.902 us   |                              }
>>>>>>    0) + 11.577 us   |                            }
>>>>>>    0) + 11.988 us   |                          }
>>>>>>
>>>>>> _______________________________________________
>>>>>> dri-devel mailing list
>>>>>> dri-devel@lists.freedesktop.org
>>>>>> https://urldefense.proofpoint.com/v1/url?u=http://lists.freedesktop.org/mailman/listinfo/dri-devel&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=d2636419e1f7f56c0d270e29ffe6ab6c6e29249876a578d70d973058f9411831
>>>>>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-10 18:02             ` Mario Kleiner
  0 siblings, 0 replies; 30+ messages in thread
From: Mario Kleiner @ 2014-08-10 18:02 UTC (permalink / raw)
  To: Thomas Hellstrom
  Cc: Konrad Rzeszutek Wilk, kamal, LKML, dri-devel, ben, m.szyprowski

On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
>> Resent this time without HTML formatting which lkml doesn't like. Sorry.
>>
>> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
>>> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
>>>> On August 9, 2014 1:39:39 AM EDT, Thomas
>>>> Hellstrom<thellstrom@vmware.com>  wrote:
>>>>> Hi.
>>>>>
>>>> Hey Thomas!
>>>>
>>>>> IIRC I don't think the TTM DMA pool allocates coherent pages more than
>>>>> one page at a time, and _if that's true_ it's pretty unnecessary for
>>>>> the
>>>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>>>>> shed
>>>>> some light over this?
>>>> It should allocate in batches and keep them in the TTM DMA pool for
>>>> some time to be reused.
>>>>
>>>> The pages that it gets are in 4kb granularity though.
>>> Then I feel inclined to say this is a DMA subsystem bug. Single page
>>> allocations shouldn't get routed to CMA.
>>>
>>> /Thomas
>> Yes, seems you're both right. I read through the code a bit more and
>> indeed the TTM DMA pool allocates only one page during each
>> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
>> allocators don't check for single page CMA allocations and therefore
>> try to get it from the CMA area anyway, instead of skipping to the
>> much cheaper fallback.
>>
>> So the callers of dma_alloc_from_contiguous() could need that little
>> optimization of skipping it if only one page is requested. For
>>
>> dma_generic_alloc_coherent
>> <http://lxr.free-electrons.com/ident?i=dma_generic_alloc_coherent>
>> andintel_alloc_coherent
>> <http://lxr.free-electrons.com/ident?i=intel_alloc_coherent>  this
>> seems easy to do. Looking at the arm arch variants, e.g.,
>>
>> http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c#L1194
>>
>> and
>>
>> http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c#L44
>>
>> i'm not sure if it is that easily done, as there aren't any fallbacks
>> for such a case and the code looks to me as if that's at least
>> somewhat intentional.
>>
>> As far as TTM goes, one quick one-line fix to prevent it from using
>> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
>> above methods) would be to clear the __GFP_WAIT
>> <http://lxr.free-electrons.com/ident?i=__GFP_WAIT> flag from the
>> passed gfp_t flags. That would trigger the well working fallback. So, is
>>
>> __GFP_WAIT  <http://lxr.free-electrons.com/ident?i=__GFP_WAIT>  needed
>> for those single page allocations that go through__ttm_dma_alloc_page
>> <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page>?
>>
>> It would be nice to have such a simple, non-intrusive one-line patch
>> that we still could get into 3.17 and then backported to older stable
>> kernels to avoid the same desktop hangs there if CMA is enabled. It
>> would be also nice for actual users of CMA to not use up lots of CMA
>> space for gpu's which don't need it. I think DMA_CMA was introduced
>> around 3.12.
>>
> I don't think that's a good idea. Omitting __GFP_WAIT would cause
> unnecessary memory allocation errors on systems under stress.
> I think this should be filed as a DMA subsystem kernel bug / regression
> and an appropriate solution should be worked out together with the DMA
> subsystem maintainers and then backported.

Ok, so it is needed. I'll file a bug report.

>> The other problem is that probably TTM does not reuse pages from the
>> DMA pool. If i trace the __ttm_dma_alloc_page
>> <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> and
>> __ttm_dma_free_page
>> <http://lxr.free-electrons.com/ident?i=__ttm_dma_alloc_page> calls for
>> those single page allocs/frees, then over a 20 second interval of
>> tracing and switching tabs in firefox, scrolling things around etc. i
>> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
>> 1648 frees.
> This is because historically the pools have been designed to keep only
> pages with nonstandard caching attributes since changing page caching
> attributes have been very slow but the kernel page allocators have been
> reasonably fast.
>
> /Thomas

Ok. A bit more ftraceing showed my hang problem case goes through the 
"if (is_cached)" paths, so the pool doesn't recycle anything and i see 
it bouncing up and down by 4 pages all the time.

But for the non-cached case, which i don't hit with my problem, could 
one of you look at line 954...

http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954

... and tell me why that unconditional npages = count; assignment makes sense? It seems to essentially disable all recycling for the dma pool whenever the pool isn't filled up to/beyond its maximum with free pages? When the pool is filled up, lots of stuff is recycled, but when it is already somewhat below capacity, it gets "punished" by not getting refilled? I'd just like to understand the logic behind that line.

thanks,
-mario


>> This bit of code fromttm_dma_unpopulate
>> <http://lxr.free-electrons.com/ident?i=ttm_dma_unpopulate>()  (line
>> 954 in 3.16) looks suspicious:
>>
>> http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c#L954
>>
>>
>> Alloc's from a tt_cached cached pool ( if (is_cached)...) always get
>> freed and are not given back to the cached pool. But in the uncached
>> case, there's logic to make sure the pool doesn't grow forever (line
>> 955, checking against _manager->options.max_size), but before that
>> check in line 954 there's an uncoditional assignment of npages =
>> count; which seems to force freeing all pages as well, instead of
>> recycling? Is this some debug code left over, or intentional and just
>> me not understanding what happens there?
>>
>> thanks,
>> -mario
>>
>>
>>>>> /Thomas
>>>>>
>>>>>
>>>>> On 08/08/2014 07:42 PM, Mario Kleiner wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> there is a rather severe performance problem i accidentally found
>>>>> when
>>>>>> trying to give Linux 3.16.0 a final test on a x86_64 MacBookPro under
>>>>>> Ubuntu 14.04 LTS with nouveau as graphics driver.
>>>>>>
>>>>>> I was lazy and just installed the Ubuntu precompiled mainline kernel.
>>>>>> That kernel happens to have CONFIG_DMA_CMA=y set, with a default CMA
>>>>>> (contiguous memory allocator) size of 64 MB. Older Ubuntu kernels
>>>>>> weren't compiled with CMA, so i only observed this on 3.16, but
>>>>>> previous kernels would likely be affected too.
>>>>>>
>>>>>> After a few minutes of regular desktop use like switching workspaces,
>>>>>> scrolling text in a terminal window, Firefox with multiple tabs open,
>>>>>> Thunderbird etc. (tested with KDE/Kwin, with/without desktop
>>>>>> composition), i get chunky desktop updates, then multi-second
>>>>> freezes,
>>>>>> after a few minutes the desktop hangs for over a minute on almost any
>>>>>> GUI action like switching windows etc. --> Unuseable.
>>>>>>
>>>>>> ftrace'ing shows the culprit being this callchain (typical good/bad
>>>>>> example ftrace snippets at the end of this mail):
>>>>>>
>>>>>> ...ttm dma coherent memory allocations, e.g., from
>>>>>> __ttm_dma_alloc_page() ... --> dma_alloc_coherent() --> platform
>>>>>> specific hooks ... -> dma_generic_alloc_coherent() [on x86_64] -->
>>>>>> dma_alloc_from_contiguous()
>>>>>>
>>>>>> dma_alloc_from_contiguous() is a no-op without CONFIG_DMA_CMA, or
>>>>> when
>>>>>> the machine is booted with kernel boot cmdline parameter "cma=0", so
>>>>>> it triggers the fast alloc_pages_node() fallback at least on x86_64.
>>>>>>
>>>>>> With CMA, this function becomes progressively more slow with every
>>>>>> minute of desktop use, e.g., runtimes going up from < 0.3 usecs to
>>>>>> hundreds or thousands of microseconds (before it gives up and
>>>>>> alloc_pages_node() fallback is used), so this causes the
>>>>>> multi-second/minute hangs of the desktop.
>>>>>>
>>>>>> So it seems ttm memory allocations quickly fragment and/or exhaust
>>>>> the
>>>>>> CMA memory area, and dma_alloc_from_contiguous() tries very hard to
>>>>>> find a fitting hole big enough to satisfy allocations with a retry
>>>>>> loop (see
>>>>>>
>>>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/base/dma-contiguous.c%23L339&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=42356aad2ff181236f4704283dc058fdd7b7e213cdea7378665094b35ee0dfdf)
>>>>>
>>>>>> that takes forever.
>>>> I am curious why it does not end up using the pool. As in use the
>>>> TTM DMA pool to pick pages instead of allocating (and freeing) new
>>>> ones?
>>>>
>>>>>> This is not good, also not for other devices which actually need a
>>>>>> non-fragmented CMA for DMA, so what to do? I doubt most current gpus
>>>>>> still need physically contiguous dma memory, maybe with exception of
>>>>>> some embedded gpus?
>>>> Oh. If I understood you correctly - the CMA ends up giving huge
>>>> chunks of contiguous area. But if the sizes are 4kb I wonder why it
>>>> would do that?
>>>>
>>>> The modern GPUs on x86 can deal with scatter gather and as you
>>>> surmise don't need contiguous physical contiguous areas.
>>>>>> My naive approach would be to add a new gfp_t flag a la
>>>>>> ___GFP_AVOIDCMA, and make callers of dma_alloc_from_contiguous()
>>>>>> refrain from doing so if they have some fallback for getting memory.
>>>>>> And then add that flag to ttm's ttm_dma_populate() gfp_flags, e.g.,
>>>>>> around here:
>>>>>>
>>>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L884&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=0c2a37c8bac57e0ab7333a9580eb5114e09566d1d34ab43be7a80de8316bdcdd
>>>>>
>>>>>> However i'm not familiar enough with memory management, so likely
>>>>>> greater minds here have much better ideas on how to deal with this?
>>>>>>
>>>> That is a bit of hack to deal with CMA being slow.
>>>>
>>>> Hmm. Let's first figure out why TTM DMA pool is not reusing pages.
>>>>>> thanks,
>>>>>> -mario
>>>>>>
>>>>>> Typical snippet from an example trace of a badly stalling desktop
>>>>> with
>>>>>> CMA (alloc_pages_node() fallback may have been missing in this traces
>>>>>> ftrace_filter settings):
>>>>>>
>>>>>> 1)               |                          ttm_dma_pool_get_pages
>>>>>> [ttm]() {
>>>>>>    1)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>>>    1)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>>>    1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    1)               | dma_generic_alloc_coherent() {
>>>>>>    1) ! 1873.071 us | dma_alloc_from_contiguous();
>>>>>>    1) ! 1874.292 us |                                  }
>>>>>>    1) ! 1875.400 us |                                }
>>>>>>    1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    1)               | dma_generic_alloc_coherent() {
>>>>>>    1) ! 1868.372 us | dma_alloc_from_contiguous();
>>>>>>    1) ! 1869.586 us |                                  }
>>>>>>    1) ! 1870.053 us |                                }
>>>>>>    1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    1)               | dma_generic_alloc_coherent() {
>>>>>>    1) ! 1871.085 us | dma_alloc_from_contiguous();
>>>>>>    1) ! 1872.240 us |                                  }
>>>>>>    1) ! 1872.669 us |                                }
>>>>>>    1)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    1)               | dma_generic_alloc_coherent() {
>>>>>>    1) ! 1888.934 us | dma_alloc_from_contiguous();
>>>>>>    1) ! 1890.179 us |                                  }
>>>>>>    1) ! 1890.608 us |                                }
>>>>>>    1)   0.048 us    | ttm_set_pages_caching [ttm]();
>>>>>>    1) ! 7511.000 us |                              }
>>>>>>    1) ! 7511.306 us |                            }
>>>>>>    1) ! 7511.623 us |                          }
>>>>>>
>>>>>> The good case (with cma=0 kernel cmdline, so
>>>>>> dma_alloc_from_contiguous() no-ops,)
>>>>>>
>>>>>> 0)               |                          ttm_dma_pool_get_pages
>>>>>> [ttm]() {
>>>>>>    0)               | ttm_dma_page_pool_fill_locked [ttm]() {
>>>>>>    0)               | ttm_dma_pool_alloc_new_pages [ttm]() {
>>>>>>    0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    0)               | dma_generic_alloc_coherent() {
>>>>>>    0)   0.171 us    | dma_alloc_from_contiguous();
>>>>>>    0)   0.849 us    | __alloc_pages_nodemask();
>>>>>>    0)   3.029 us    |                                  }
>>>>>>    0)   3.882 us    |                                }
>>>>>>    0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    0)               | dma_generic_alloc_coherent() {
>>>>>>    0)   0.037 us    | dma_alloc_from_contiguous();
>>>>>>    0)   0.163 us    | __alloc_pages_nodemask();
>>>>>>    0)   1.408 us    |                                  }
>>>>>>    0)   1.719 us    |                                }
>>>>>>    0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    0)               | dma_generic_alloc_coherent() {
>>>>>>    0)   0.035 us    | dma_alloc_from_contiguous();
>>>>>>    0)   0.153 us    | __alloc_pages_nodemask();
>>>>>>    0)   1.454 us    |                                  }
>>>>>>    0)   1.720 us    |                                }
>>>>>>    0)               | __ttm_dma_alloc_page [ttm]() {
>>>>>>    0)               | dma_generic_alloc_coherent() {
>>>>>>    0)   0.036 us    | dma_alloc_from_contiguous();
>>>>>>    0)   0.112 us    | __alloc_pages_nodemask();
>>>>>>    0)   1.211 us    |                                  }
>>>>>>    0)   1.541 us    |                                }
>>>>>>    0)   0.035 us    | ttm_set_pages_caching [ttm]();
>>>>>>    0) + 10.902 us   |                              }
>>>>>>    0) + 11.577 us   |                            }
>>>>>>    0) + 11.988 us   |                          }
>>>>>>
>>>>>> _______________________________________________
>>>>>> dri-devel mailing list
>>>>>> dri-devel@lists.freedesktop.org
>>>>>> https://urldefense.proofpoint.com/v1/url?u=http://lists.freedesktop.org/mailman/listinfo/dri-devel&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=6cy0madhpBCtEyOKu95ucqhzU%2FjAHPP7ODVTc47UYQs%3D%0A&s=d2636419e1f7f56c0d270e29ffe6ab6c6e29249876a578d70d973058f9411831
>>>>>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-10 18:02             ` Mario Kleiner
@ 2014-08-11 10:11               ` Thomas Hellstrom
  -1 siblings, 0 replies; 30+ messages in thread
From: Thomas Hellstrom @ 2014-08-11 10:11 UTC (permalink / raw)
  To: Mario Kleiner, Konrad Rzeszutek Wilk, Dave Airlie, Jerome Glisse
  Cc: kamal, ben, LKML, dri-devel, m.szyprowski

On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
>> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
>>> Resent this time without HTML formatting which lkml doesn't like.
>>> Sorry.
>>>
>>> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
>>>> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
>>>>> On August 9, 2014 1:39:39 AM EDT, Thomas
>>>>> Hellstrom<thellstrom@vmware.com>  wrote:
>>>>>> Hi.
>>>>>>
>>>>> Hey Thomas!
>>>>>
>>>>>> IIRC I don't think the TTM DMA pool allocates coherent pages more
>>>>>> than
>>>>>> one page at a time, and _if that's true_ it's pretty unnecessary for
>>>>>> the
>>>>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>>>>>> shed
>>>>>> some light over this?
>>>>> It should allocate in batches and keep them in the TTM DMA pool for
>>>>> some time to be reused.
>>>>>
>>>>> The pages that it gets are in 4kb granularity though.
>>>> Then I feel inclined to say this is a DMA subsystem bug. Single page
>>>> allocations shouldn't get routed to CMA.
>>>>
>>>> /Thomas
>>> Yes, seems you're both right. I read through the code a bit more and
>>> indeed the TTM DMA pool allocates only one page during each
>>> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
>>> allocators don't check for single page CMA allocations and therefore
>>> try to get it from the CMA area anyway, instead of skipping to the
>>> much cheaper fallback.
>>>
>>> So the callers of dma_alloc_from_contiguous() could need that little
>>> optimization of skipping it if only one page is requested. For
>>>
>>> dma_generic_alloc_coherent
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Ddma_generic_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d1852625e2ab2ff07eb34a7f33fc1f55f7f13959912d5a6ce9316d23070ce939>
>>>
>>> andintel_alloc_coherent
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Dintel_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=82d587e9b6aeced5cf9a7caefa91bf47fba809f3522b7379d22e45a2d5d35ebd> 
>>> this
>>> seems easy to do. Looking at the arm arch variants, e.g.,
>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac
>>>
>>>
>>> and
>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08
>>>
>>>
>>> i'm not sure if it is that easily done, as there aren't any fallbacks
>>> for such a case and the code looks to me as if that's at least
>>> somewhat intentional.
>>>
>>> As far as TTM goes, one quick one-line fix to prevent it from using
>>> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
>>> above methods) would be to clear the __GFP_WAIT
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc>
>>> flag from the
>>> passed gfp_t flags. That would trigger the well working fallback.
>>> So, is
>>>
>>> __GFP_WAIT 
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc> 
>>> needed
>>> for those single page allocations that go through__ttm_dma_alloc_page
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>?
>>>
>>>
>>> It would be nice to have such a simple, non-intrusive one-line patch
>>> that we still could get into 3.17 and then backported to older stable
>>> kernels to avoid the same desktop hangs there if CMA is enabled. It
>>> would be also nice for actual users of CMA to not use up lots of CMA
>>> space for gpu's which don't need it. I think DMA_CMA was introduced
>>> around 3.12.
>>>
>> I don't think that's a good idea. Omitting __GFP_WAIT would cause
>> unnecessary memory allocation errors on systems under stress.
>> I think this should be filed as a DMA subsystem kernel bug / regression
>> and an appropriate solution should be worked out together with the DMA
>> subsystem maintainers and then backported.
>
> Ok, so it is needed. I'll file a bug report.
>
>>> The other problem is that probably TTM does not reuse pages from the
>>> DMA pool. If i trace the __ttm_dma_alloc_page
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
>>> and
>>> __ttm_dma_free_page
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
>>> calls for
>>> those single page allocs/frees, then over a 20 second interval of
>>> tracing and switching tabs in firefox, scrolling things around etc. i
>>> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
>>> 1648 frees.
>> This is because historically the pools have been designed to keep only
>> pages with nonstandard caching attributes since changing page caching
>> attributes have been very slow but the kernel page allocators have been
>> reasonably fast.
>>
>> /Thomas
>
> Ok. A bit more ftraceing showed my hang problem case goes through the
> "if (is_cached)" paths, so the pool doesn't recycle anything and i see
> it bouncing up and down by 4 pages all the time.
>
> But for the non-cached case, which i don't hit with my problem, could
> one of you look at line 954...
>
> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
>
>
> ... and tell me why that unconditional npages = count; assignment
> makes sense? It seems to essentially disable all recycling for the dma
> pool whenever the pool isn't filled up to/beyond its maximum with free
> pages? When the pool is filled up, lots of stuff is recycled, but when
> it is already somewhat below capacity, it gets "punished" by not
> getting refilled? I'd just like to understand the logic behind that line.
>
> thanks,
> -mario

I'll happily forward that question to Konrad who wrote the code (or it
may even stem from the ordinary page pool code which IIRC has Dave
Airlie / Jerome Glisse as authors)

/Thomas
 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-11 10:11               ` Thomas Hellstrom
  0 siblings, 0 replies; 30+ messages in thread
From: Thomas Hellstrom @ 2014-08-11 10:11 UTC (permalink / raw)
  To: Mario Kleiner, Konrad Rzeszutek Wilk, Dave Airlie, Jerome Glisse
  Cc: kamal, ben, LKML, dri-devel, m.szyprowski

On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
>> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
>>> Resent this time without HTML formatting which lkml doesn't like.
>>> Sorry.
>>>
>>> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
>>>> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
>>>>> On August 9, 2014 1:39:39 AM EDT, Thomas
>>>>> Hellstrom<thellstrom@vmware.com>  wrote:
>>>>>> Hi.
>>>>>>
>>>>> Hey Thomas!
>>>>>
>>>>>> IIRC I don't think the TTM DMA pool allocates coherent pages more
>>>>>> than
>>>>>> one page at a time, and _if that's true_ it's pretty unnecessary for
>>>>>> the
>>>>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>>>>>> shed
>>>>>> some light over this?
>>>>> It should allocate in batches and keep them in the TTM DMA pool for
>>>>> some time to be reused.
>>>>>
>>>>> The pages that it gets are in 4kb granularity though.
>>>> Then I feel inclined to say this is a DMA subsystem bug. Single page
>>>> allocations shouldn't get routed to CMA.
>>>>
>>>> /Thomas
>>> Yes, seems you're both right. I read through the code a bit more and
>>> indeed the TTM DMA pool allocates only one page during each
>>> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
>>> allocators don't check for single page CMA allocations and therefore
>>> try to get it from the CMA area anyway, instead of skipping to the
>>> much cheaper fallback.
>>>
>>> So the callers of dma_alloc_from_contiguous() could need that little
>>> optimization of skipping it if only one page is requested. For
>>>
>>> dma_generic_alloc_coherent
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Ddma_generic_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d1852625e2ab2ff07eb34a7f33fc1f55f7f13959912d5a6ce9316d23070ce939>
>>>
>>> andintel_alloc_coherent
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Dintel_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=82d587e9b6aeced5cf9a7caefa91bf47fba809f3522b7379d22e45a2d5d35ebd> 
>>> this
>>> seems easy to do. Looking at the arm arch variants, e.g.,
>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac
>>>
>>>
>>> and
>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08
>>>
>>>
>>> i'm not sure if it is that easily done, as there aren't any fallbacks
>>> for such a case and the code looks to me as if that's at least
>>> somewhat intentional.
>>>
>>> As far as TTM goes, one quick one-line fix to prevent it from using
>>> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
>>> above methods) would be to clear the __GFP_WAIT
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc>
>>> flag from the
>>> passed gfp_t flags. That would trigger the well working fallback.
>>> So, is
>>>
>>> __GFP_WAIT 
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc> 
>>> needed
>>> for those single page allocations that go through__ttm_dma_alloc_page
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>?
>>>
>>>
>>> It would be nice to have such a simple, non-intrusive one-line patch
>>> that we still could get into 3.17 and then backported to older stable
>>> kernels to avoid the same desktop hangs there if CMA is enabled. It
>>> would be also nice for actual users of CMA to not use up lots of CMA
>>> space for gpu's which don't need it. I think DMA_CMA was introduced
>>> around 3.12.
>>>
>> I don't think that's a good idea. Omitting __GFP_WAIT would cause
>> unnecessary memory allocation errors on systems under stress.
>> I think this should be filed as a DMA subsystem kernel bug / regression
>> and an appropriate solution should be worked out together with the DMA
>> subsystem maintainers and then backported.
>
> Ok, so it is needed. I'll file a bug report.
>
>>> The other problem is that probably TTM does not reuse pages from the
>>> DMA pool. If i trace the __ttm_dma_alloc_page
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
>>> and
>>> __ttm_dma_free_page
>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
>>> calls for
>>> those single page allocs/frees, then over a 20 second interval of
>>> tracing and switching tabs in firefox, scrolling things around etc. i
>>> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
>>> 1648 frees.
>> This is because historically the pools have been designed to keep only
>> pages with nonstandard caching attributes since changing page caching
>> attributes have been very slow but the kernel page allocators have been
>> reasonably fast.
>>
>> /Thomas
>
> Ok. A bit more ftraceing showed my hang problem case goes through the
> "if (is_cached)" paths, so the pool doesn't recycle anything and i see
> it bouncing up and down by 4 pages all the time.
>
> But for the non-cached case, which i don't hit with my problem, could
> one of you look at line 954...
>
> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
>
>
> ... and tell me why that unconditional npages = count; assignment
> makes sense? It seems to essentially disable all recycling for the dma
> pool whenever the pool isn't filled up to/beyond its maximum with free
> pages? When the pool is filled up, lots of stuff is recycled, but when
> it is already somewhat below capacity, it gets "punished" by not
> getting refilled? I'd just like to understand the logic behind that line.
>
> thanks,
> -mario

I'll happily forward that question to Konrad who wrote the code (or it
may even stem from the ordinary page pool code which IIRC has Dave
Airlie / Jerome Glisse as authors)

/Thomas

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-11 10:11               ` Thomas Hellstrom
@ 2014-08-11 15:17                 ` Jerome Glisse
  -1 siblings, 0 replies; 30+ messages in thread
From: Jerome Glisse @ 2014-08-11 15:17 UTC (permalink / raw)
  To: Thomas Hellstrom
  Cc: Mario Kleiner, Konrad Rzeszutek Wilk, Dave Airlie, kamal, ben,
	LKML, dri-devel, m.szyprowski

[-- Attachment #1: Type: text/plain, Size: 8404 bytes --]

On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
> On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> > On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
> >> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> >>> Resent this time without HTML formatting which lkml doesn't like.
> >>> Sorry.
> >>>
> >>> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
> >>>> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
> >>>>> On August 9, 2014 1:39:39 AM EDT, Thomas
> >>>>> Hellstrom<thellstrom@vmware.com>  wrote:
> >>>>>> Hi.
> >>>>>>
> >>>>> Hey Thomas!
> >>>>>
> >>>>>> IIRC I don't think the TTM DMA pool allocates coherent pages more
> >>>>>> than
> >>>>>> one page at a time, and _if that's true_ it's pretty unnecessary for
> >>>>>> the
> >>>>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
> >>>>>> shed
> >>>>>> some light over this?
> >>>>> It should allocate in batches and keep them in the TTM DMA pool for
> >>>>> some time to be reused.
> >>>>>
> >>>>> The pages that it gets are in 4kb granularity though.
> >>>> Then I feel inclined to say this is a DMA subsystem bug. Single page
> >>>> allocations shouldn't get routed to CMA.
> >>>>
> >>>> /Thomas
> >>> Yes, seems you're both right. I read through the code a bit more and
> >>> indeed the TTM DMA pool allocates only one page during each
> >>> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
> >>> allocators don't check for single page CMA allocations and therefore
> >>> try to get it from the CMA area anyway, instead of skipping to the
> >>> much cheaper fallback.
> >>>
> >>> So the callers of dma_alloc_from_contiguous() could need that little
> >>> optimization of skipping it if only one page is requested. For
> >>>
> >>> dma_generic_alloc_coherent
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Ddma_generic_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d1852625e2ab2ff07eb34a7f33fc1f55f7f13959912d5a6ce9316d23070ce939>
> >>>
> >>> andintel_alloc_coherent
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Dintel_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=82d587e9b6aeced5cf9a7caefa91bf47fba809f3522b7379d22e45a2d5d35ebd> 
> >>> this
> >>> seems easy to do. Looking at the arm arch variants, e.g.,
> >>>
> >>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac
> >>>
> >>>
> >>> and
> >>>
> >>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08
> >>>
> >>>
> >>> i'm not sure if it is that easily done, as there aren't any fallbacks
> >>> for such a case and the code looks to me as if that's at least
> >>> somewhat intentional.
> >>>
> >>> As far as TTM goes, one quick one-line fix to prevent it from using
> >>> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
> >>> above methods) would be to clear the __GFP_WAIT
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc>
> >>> flag from the
> >>> passed gfp_t flags. That would trigger the well working fallback.
> >>> So, is
> >>>
> >>> __GFP_WAIT 
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc> 
> >>> needed
> >>> for those single page allocations that go through__ttm_dma_alloc_page
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>?
> >>>
> >>>
> >>> It would be nice to have such a simple, non-intrusive one-line patch
> >>> that we still could get into 3.17 and then backported to older stable
> >>> kernels to avoid the same desktop hangs there if CMA is enabled. It
> >>> would be also nice for actual users of CMA to not use up lots of CMA
> >>> space for gpu's which don't need it. I think DMA_CMA was introduced
> >>> around 3.12.
> >>>
> >> I don't think that's a good idea. Omitting __GFP_WAIT would cause
> >> unnecessary memory allocation errors on systems under stress.
> >> I think this should be filed as a DMA subsystem kernel bug / regression
> >> and an appropriate solution should be worked out together with the DMA
> >> subsystem maintainers and then backported.
> >
> > Ok, so it is needed. I'll file a bug report.
> >
> >>> The other problem is that probably TTM does not reuse pages from the
> >>> DMA pool. If i trace the __ttm_dma_alloc_page
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>> and
> >>> __ttm_dma_free_page
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>> calls for
> >>> those single page allocs/frees, then over a 20 second interval of
> >>> tracing and switching tabs in firefox, scrolling things around etc. i
> >>> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
> >>> 1648 frees.
> >> This is because historically the pools have been designed to keep only
> >> pages with nonstandard caching attributes since changing page caching
> >> attributes have been very slow but the kernel page allocators have been
> >> reasonably fast.
> >>
> >> /Thomas
> >
> > Ok. A bit more ftraceing showed my hang problem case goes through the
> > "if (is_cached)" paths, so the pool doesn't recycle anything and i see
> > it bouncing up and down by 4 pages all the time.
> >
> > But for the non-cached case, which i don't hit with my problem, could
> > one of you look at line 954...
> >
> > https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
> >
> >
> > ... and tell me why that unconditional npages = count; assignment
> > makes sense? It seems to essentially disable all recycling for the dma
> > pool whenever the pool isn't filled up to/beyond its maximum with free
> > pages? When the pool is filled up, lots of stuff is recycled, but when
> > it is already somewhat below capacity, it gets "punished" by not
> > getting refilled? I'd just like to understand the logic behind that line.
> >
> > thanks,
> > -mario
> 
> I'll happily forward that question to Konrad who wrote the code (or it
> may even stem from the ordinary page pool code which IIRC has Dave
> Airlie / Jerome Glisse as authors)

This is effectively bogus code, i now wonder how it came to stay alive.
Attached patch will fix that.


> 
> /Thomas
>  

[-- Attachment #2: 0001-drm-ttm-fix-object-deallocation-to-properly-fill-in-.patch --]
[-- Type: text/plain, Size: 1581 bytes --]

>From f65e796fea5f79e4834f4609147ea06c123d6396 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
Date: Mon, 11 Aug 2014 11:10:31 -0400
Subject: [PATCH] drm/ttm: fix object deallocation to properly fill in the page
 pool.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Current code never allowed the page pool to actualy fill in anyway. This fix
it and also allow it to grow over its limit until it grow beyond the batch
size for allocation and deallocation.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index fb8259f..73744cd 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -951,14 +951,9 @@ void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
 	} else {
 		pool->npages_free += count;
 		list_splice(&ttm_dma->pages_list, &pool->free_list);
-		npages = count;
-		if (pool->npages_free > _manager->options.max_size) {
+		if (pool->npages_free >= (_manager->options.max_size +
+					  NUM_PAGES_TO_ALLOC))
 			npages = pool->npages_free - _manager->options.max_size;
-			/* free at least NUM_PAGES_TO_ALLOC number of pages
-			 * to reduce calls to set_memory_wb */
-			if (npages < NUM_PAGES_TO_ALLOC)
-				npages = NUM_PAGES_TO_ALLOC;
-		}
 	}
 	spin_unlock_irqrestore(&pool->lock, irq_flags);
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-11 15:17                 ` Jerome Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jerome Glisse @ 2014-08-11 15:17 UTC (permalink / raw)
  To: Thomas Hellstrom
  Cc: Konrad Rzeszutek Wilk, kamal, LKML, dri-devel, Dave Airlie, ben,
	m.szyprowski

[-- Attachment #1: Type: text/plain, Size: 8404 bytes --]

On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
> On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> > On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
> >> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> >>> Resent this time without HTML formatting which lkml doesn't like.
> >>> Sorry.
> >>>
> >>> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
> >>>> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
> >>>>> On August 9, 2014 1:39:39 AM EDT, Thomas
> >>>>> Hellstrom<thellstrom@vmware.com>  wrote:
> >>>>>> Hi.
> >>>>>>
> >>>>> Hey Thomas!
> >>>>>
> >>>>>> IIRC I don't think the TTM DMA pool allocates coherent pages more
> >>>>>> than
> >>>>>> one page at a time, and _if that's true_ it's pretty unnecessary for
> >>>>>> the
> >>>>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
> >>>>>> shed
> >>>>>> some light over this?
> >>>>> It should allocate in batches and keep them in the TTM DMA pool for
> >>>>> some time to be reused.
> >>>>>
> >>>>> The pages that it gets are in 4kb granularity though.
> >>>> Then I feel inclined to say this is a DMA subsystem bug. Single page
> >>>> allocations shouldn't get routed to CMA.
> >>>>
> >>>> /Thomas
> >>> Yes, seems you're both right. I read through the code a bit more and
> >>> indeed the TTM DMA pool allocates only one page during each
> >>> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
> >>> allocators don't check for single page CMA allocations and therefore
> >>> try to get it from the CMA area anyway, instead of skipping to the
> >>> much cheaper fallback.
> >>>
> >>> So the callers of dma_alloc_from_contiguous() could need that little
> >>> optimization of skipping it if only one page is requested. For
> >>>
> >>> dma_generic_alloc_coherent
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Ddma_generic_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d1852625e2ab2ff07eb34a7f33fc1f55f7f13959912d5a6ce9316d23070ce939>
> >>>
> >>> andintel_alloc_coherent
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Dintel_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=82d587e9b6aeced5cf9a7caefa91bf47fba809f3522b7379d22e45a2d5d35ebd> 
> >>> this
> >>> seems easy to do. Looking at the arm arch variants, e.g.,
> >>>
> >>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac
> >>>
> >>>
> >>> and
> >>>
> >>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08
> >>>
> >>>
> >>> i'm not sure if it is that easily done, as there aren't any fallbacks
> >>> for such a case and the code looks to me as if that's at least
> >>> somewhat intentional.
> >>>
> >>> As far as TTM goes, one quick one-line fix to prevent it from using
> >>> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
> >>> above methods) would be to clear the __GFP_WAIT
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc>
> >>> flag from the
> >>> passed gfp_t flags. That would trigger the well working fallback.
> >>> So, is
> >>>
> >>> __GFP_WAIT 
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc> 
> >>> needed
> >>> for those single page allocations that go through__ttm_dma_alloc_page
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>?
> >>>
> >>>
> >>> It would be nice to have such a simple, non-intrusive one-line patch
> >>> that we still could get into 3.17 and then backported to older stable
> >>> kernels to avoid the same desktop hangs there if CMA is enabled. It
> >>> would be also nice for actual users of CMA to not use up lots of CMA
> >>> space for gpu's which don't need it. I think DMA_CMA was introduced
> >>> around 3.12.
> >>>
> >> I don't think that's a good idea. Omitting __GFP_WAIT would cause
> >> unnecessary memory allocation errors on systems under stress.
> >> I think this should be filed as a DMA subsystem kernel bug / regression
> >> and an appropriate solution should be worked out together with the DMA
> >> subsystem maintainers and then backported.
> >
> > Ok, so it is needed. I'll file a bug report.
> >
> >>> The other problem is that probably TTM does not reuse pages from the
> >>> DMA pool. If i trace the __ttm_dma_alloc_page
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>> and
> >>> __ttm_dma_free_page
> >>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>> calls for
> >>> those single page allocs/frees, then over a 20 second interval of
> >>> tracing and switching tabs in firefox, scrolling things around etc. i
> >>> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
> >>> 1648 frees.
> >> This is because historically the pools have been designed to keep only
> >> pages with nonstandard caching attributes since changing page caching
> >> attributes have been very slow but the kernel page allocators have been
> >> reasonably fast.
> >>
> >> /Thomas
> >
> > Ok. A bit more ftraceing showed my hang problem case goes through the
> > "if (is_cached)" paths, so the pool doesn't recycle anything and i see
> > it bouncing up and down by 4 pages all the time.
> >
> > But for the non-cached case, which i don't hit with my problem, could
> > one of you look at line 954...
> >
> > https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
> >
> >
> > ... and tell me why that unconditional npages = count; assignment
> > makes sense? It seems to essentially disable all recycling for the dma
> > pool whenever the pool isn't filled up to/beyond its maximum with free
> > pages? When the pool is filled up, lots of stuff is recycled, but when
> > it is already somewhat below capacity, it gets "punished" by not
> > getting refilled? I'd just like to understand the logic behind that line.
> >
> > thanks,
> > -mario
> 
> I'll happily forward that question to Konrad who wrote the code (or it
> may even stem from the ordinary page pool code which IIRC has Dave
> Airlie / Jerome Glisse as authors)

This is effectively bogus code, i now wonder how it came to stay alive.
Attached patch will fix that.


> 
> /Thomas
>  

[-- Attachment #2: 0001-drm-ttm-fix-object-deallocation-to-properly-fill-in-.patch --]
[-- Type: text/plain, Size: 1623 bytes --]

>From f65e796fea5f79e4834f4609147ea06c123d6396 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?J=C3=A9r=C3=B4me=20Glisse?= <jglisse@redhat.com>
Date: Mon, 11 Aug 2014 11:10:31 -0400
Subject: [PATCH] drm/ttm: fix object deallocation to properly fill in the page
 pool.
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Current code never allowed the page pool to actualy fill in anyway. This fix
it and also allow it to grow over its limit until it grow beyond the batch
size for allocation and deallocation.

Signed-off-by: Jérôme Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index fb8259f..73744cd 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -951,14 +951,9 @@ void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
 	} else {
 		pool->npages_free += count;
 		list_splice(&ttm_dma->pages_list, &pool->free_list);
-		npages = count;
-		if (pool->npages_free > _manager->options.max_size) {
+		if (pool->npages_free >= (_manager->options.max_size +
+					  NUM_PAGES_TO_ALLOC))
 			npages = pool->npages_free - _manager->options.max_size;
-			/* free at least NUM_PAGES_TO_ALLOC number of pages
-			 * to reduce calls to set_memory_wb */
-			if (npages < NUM_PAGES_TO_ALLOC)
-				npages = NUM_PAGES_TO_ALLOC;
-		}
 	}
 	spin_unlock_irqrestore(&pool->lock, irq_flags);
 
-- 
1.9.3


[-- Attachment #3: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-11 15:17                 ` Jerome Glisse
@ 2014-08-12 12:12                   ` Mario Kleiner
  -1 siblings, 0 replies; 30+ messages in thread
From: Mario Kleiner @ 2014-08-12 12:12 UTC (permalink / raw)
  To: Jerome Glisse, Thomas Hellstrom
  Cc: Konrad Rzeszutek Wilk, Dave Airlie, kamal, ben, LKML, dri-devel,
	m.szyprowski

On 08/11/2014 05:17 PM, Jerome Glisse wrote:
> On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
>> On 08/10/2014 08:02 PM, Mario Kleiner wrote:
>>> On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
>>>> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
>>>>> Resent this time without HTML formatting which lkml doesn't like.
>>>>> Sorry.
>>>>>
>>>>> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
>>>>>> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
>>>>>>> On August 9, 2014 1:39:39 AM EDT, Thomas
>>>>>>> Hellstrom<thellstrom@vmware.com>  wrote:
>>>>>>>> Hi.
>>>>>>>>
>>>>>>> Hey Thomas!
>>>>>>>
>>>>>>>> IIRC I don't think the TTM DMA pool allocates coherent pages more
>>>>>>>> than
>>>>>>>> one page at a time, and _if that's true_ it's pretty unnecessary for
>>>>>>>> the
>>>>>>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>>>>>>>> shed
>>>>>>>> some light over this?
>>>>>>> It should allocate in batches and keep them in the TTM DMA pool for
>>>>>>> some time to be reused.
>>>>>>>
>>>>>>> The pages that it gets are in 4kb granularity though.
>>>>>> Then I feel inclined to say this is a DMA subsystem bug. Single page
>>>>>> allocations shouldn't get routed to CMA.
>>>>>>
>>>>>> /Thomas
>>>>> Yes, seems you're both right. I read through the code a bit more and
>>>>> indeed the TTM DMA pool allocates only one page during each
>>>>> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
>>>>> allocators don't check for single page CMA allocations and therefore
>>>>> try to get it from the CMA area anyway, instead of skipping to the
>>>>> much cheaper fallback.
>>>>>
>>>>> So the callers of dma_alloc_from_contiguous() could need that little
>>>>> optimization of skipping it if only one page is requested. For
>>>>>
>>>>> dma_generic_alloc_coherent
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Ddma_generic_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d1852625e2ab2ff07eb34a7f33fc1f55f7f13959912d5a6ce9316d23070ce939>
>>>>>
>>>>> andintel_alloc_coherent
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Dintel_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=82d587e9b6aeced5cf9a7caefa91bf47fba809f3522b7379d22e45a2d5d35ebd>
>>>>> this
>>>>> seems easy to do. Looking at the arm arch variants, e.g.,
>>>>>
>>>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac
>>>>>
>>>>>
>>>>> and
>>>>>
>>>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08
>>>>>
>>>>>
>>>>> i'm not sure if it is that easily done, as there aren't any fallbacks
>>>>> for such a case and the code looks to me as if that's at least
>>>>> somewhat intentional.
>>>>>
>>>>> As far as TTM goes, one quick one-line fix to prevent it from using
>>>>> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
>>>>> above methods) would be to clear the __GFP_WAIT
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc>
>>>>> flag from the
>>>>> passed gfp_t flags. That would trigger the well working fallback.
>>>>> So, is
>>>>>
>>>>> __GFP_WAIT
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc>
>>>>> needed
>>>>> for those single page allocations that go through__ttm_dma_alloc_page
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>?
>>>>>
>>>>>
>>>>> It would be nice to have such a simple, non-intrusive one-line patch
>>>>> that we still could get into 3.17 and then backported to older stable
>>>>> kernels to avoid the same desktop hangs there if CMA is enabled. It
>>>>> would be also nice for actual users of CMA to not use up lots of CMA
>>>>> space for gpu's which don't need it. I think DMA_CMA was introduced
>>>>> around 3.12.
>>>>>
>>>> I don't think that's a good idea. Omitting __GFP_WAIT would cause
>>>> unnecessary memory allocation errors on systems under stress.
>>>> I think this should be filed as a DMA subsystem kernel bug / regression
>>>> and an appropriate solution should be worked out together with the DMA
>>>> subsystem maintainers and then backported.
>>> Ok, so it is needed. I'll file a bug report.
>>>
>>>>> The other problem is that probably TTM does not reuse pages from the
>>>>> DMA pool. If i trace the __ttm_dma_alloc_page
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
>>>>> and
>>>>> __ttm_dma_free_page
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
>>>>> calls for
>>>>> those single page allocs/frees, then over a 20 second interval of
>>>>> tracing and switching tabs in firefox, scrolling things around etc. i
>>>>> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
>>>>> 1648 frees.
>>>> This is because historically the pools have been designed to keep only
>>>> pages with nonstandard caching attributes since changing page caching
>>>> attributes have been very slow but the kernel page allocators have been
>>>> reasonably fast.
>>>>
>>>> /Thomas
>>> Ok. A bit more ftraceing showed my hang problem case goes through the
>>> "if (is_cached)" paths, so the pool doesn't recycle anything and i see
>>> it bouncing up and down by 4 pages all the time.
>>>
>>> But for the non-cached case, which i don't hit with my problem, could
>>> one of you look at line 954...
>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
>>>
>>>
>>> ... and tell me why that unconditional npages = count; assignment
>>> makes sense? It seems to essentially disable all recycling for the dma
>>> pool whenever the pool isn't filled up to/beyond its maximum with free
>>> pages? When the pool is filled up, lots of stuff is recycled, but when
>>> it is already somewhat below capacity, it gets "punished" by not
>>> getting refilled? I'd just like to understand the logic behind that line.
>>>
>>> thanks,
>>> -mario
>> I'll happily forward that question to Konrad who wrote the code (or it
>> may even stem from the ordinary page pool code which IIRC has Dave
>> Airlie / Jerome Glisse as authors)
> This is effectively bogus code, i now wonder how it came to stay alive.
> Attached patch will fix that.

Yes, that makes sense to me. Fwiw,

Reviewed-by: Mario Kleiner <mario.kleiner.de@gmail.com>

-mario


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-12 12:12                   ` Mario Kleiner
  0 siblings, 0 replies; 30+ messages in thread
From: Mario Kleiner @ 2014-08-12 12:12 UTC (permalink / raw)
  To: Jerome Glisse, Thomas Hellstrom
  Cc: Konrad Rzeszutek Wilk, kamal, LKML, dri-devel, Dave Airlie, ben,
	m.szyprowski

On 08/11/2014 05:17 PM, Jerome Glisse wrote:
> On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
>> On 08/10/2014 08:02 PM, Mario Kleiner wrote:
>>> On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
>>>> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
>>>>> Resent this time without HTML formatting which lkml doesn't like.
>>>>> Sorry.
>>>>>
>>>>> On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
>>>>>> On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
>>>>>>> On August 9, 2014 1:39:39 AM EDT, Thomas
>>>>>>> Hellstrom<thellstrom@vmware.com>  wrote:
>>>>>>>> Hi.
>>>>>>>>
>>>>>>> Hey Thomas!
>>>>>>>
>>>>>>>> IIRC I don't think the TTM DMA pool allocates coherent pages more
>>>>>>>> than
>>>>>>>> one page at a time, and _if that's true_ it's pretty unnecessary for
>>>>>>>> the
>>>>>>>> dma subsystem to route those allocations to CMA. Maybe Konrad could
>>>>>>>> shed
>>>>>>>> some light over this?
>>>>>>> It should allocate in batches and keep them in the TTM DMA pool for
>>>>>>> some time to be reused.
>>>>>>>
>>>>>>> The pages that it gets are in 4kb granularity though.
>>>>>> Then I feel inclined to say this is a DMA subsystem bug. Single page
>>>>>> allocations shouldn't get routed to CMA.
>>>>>>
>>>>>> /Thomas
>>>>> Yes, seems you're both right. I read through the code a bit more and
>>>>> indeed the TTM DMA pool allocates only one page during each
>>>>> dma_alloc_coherent() call, so it doesn't need CMA memory. The current
>>>>> allocators don't check for single page CMA allocations and therefore
>>>>> try to get it from the CMA area anyway, instead of skipping to the
>>>>> much cheaper fallback.
>>>>>
>>>>> So the callers of dma_alloc_from_contiguous() could need that little
>>>>> optimization of skipping it if only one page is requested. For
>>>>>
>>>>> dma_generic_alloc_coherent
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Ddma_generic_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d1852625e2ab2ff07eb34a7f33fc1f55f7f13959912d5a6ce9316d23070ce939>
>>>>>
>>>>> andintel_alloc_coherent
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Dintel_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=82d587e9b6aeced5cf9a7caefa91bf47fba809f3522b7379d22e45a2d5d35ebd>
>>>>> this
>>>>> seems easy to do. Looking at the arm arch variants, e.g.,
>>>>>
>>>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac
>>>>>
>>>>>
>>>>> and
>>>>>
>>>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08
>>>>>
>>>>>
>>>>> i'm not sure if it is that easily done, as there aren't any fallbacks
>>>>> for such a case and the code looks to me as if that's at least
>>>>> somewhat intentional.
>>>>>
>>>>> As far as TTM goes, one quick one-line fix to prevent it from using
>>>>> the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
>>>>> above methods) would be to clear the __GFP_WAIT
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc>
>>>>> flag from the
>>>>> passed gfp_t flags. That would trigger the well working fallback.
>>>>> So, is
>>>>>
>>>>> __GFP_WAIT
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc>
>>>>> needed
>>>>> for those single page allocations that go through__ttm_dma_alloc_page
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>?
>>>>>
>>>>>
>>>>> It would be nice to have such a simple, non-intrusive one-line patch
>>>>> that we still could get into 3.17 and then backported to older stable
>>>>> kernels to avoid the same desktop hangs there if CMA is enabled. It
>>>>> would be also nice for actual users of CMA to not use up lots of CMA
>>>>> space for gpu's which don't need it. I think DMA_CMA was introduced
>>>>> around 3.12.
>>>>>
>>>> I don't think that's a good idea. Omitting __GFP_WAIT would cause
>>>> unnecessary memory allocation errors on systems under stress.
>>>> I think this should be filed as a DMA subsystem kernel bug / regression
>>>> and an appropriate solution should be worked out together with the DMA
>>>> subsystem maintainers and then backported.
>>> Ok, so it is needed. I'll file a bug report.
>>>
>>>>> The other problem is that probably TTM does not reuse pages from the
>>>>> DMA pool. If i trace the __ttm_dma_alloc_page
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
>>>>> and
>>>>> __ttm_dma_free_page
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
>>>>> calls for
>>>>> those single page allocs/frees, then over a 20 second interval of
>>>>> tracing and switching tabs in firefox, scrolling things around etc. i
>>>>> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
>>>>> 1648 frees.
>>>> This is because historically the pools have been designed to keep only
>>>> pages with nonstandard caching attributes since changing page caching
>>>> attributes have been very slow but the kernel page allocators have been
>>>> reasonably fast.
>>>>
>>>> /Thomas
>>> Ok. A bit more ftraceing showed my hang problem case goes through the
>>> "if (is_cached)" paths, so the pool doesn't recycle anything and i see
>>> it bouncing up and down by 4 pages all the time.
>>>
>>> But for the non-cached case, which i don't hit with my problem, could
>>> one of you look at line 954...
>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
>>>
>>>
>>> ... and tell me why that unconditional npages = count; assignment
>>> makes sense? It seems to essentially disable all recycling for the dma
>>> pool whenever the pool isn't filled up to/beyond its maximum with free
>>> pages? When the pool is filled up, lots of stuff is recycled, but when
>>> it is already somewhat below capacity, it gets "punished" by not
>>> getting refilled? I'd just like to understand the logic behind that line.
>>>
>>> thanks,
>>> -mario
>> I'll happily forward that question to Konrad who wrote the code (or it
>> may even stem from the ordinary page pool code which IIRC has Dave
>> Airlie / Jerome Glisse as authors)
> This is effectively bogus code, i now wonder how it came to stay alive.
> Attached patch will fix that.

Yes, that makes sense to me. Fwiw,

Reviewed-by: Mario Kleiner <mario.kleiner.de@gmail.com>

-mario

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-12 12:12                   ` Mario Kleiner
@ 2014-08-12 20:47                     ` Konrad Rzeszutek Wilk
  -1 siblings, 0 replies; 30+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-12 20:47 UTC (permalink / raw)
  To: Mario Kleiner
  Cc: Jerome Glisse, Thomas Hellstrom, Dave Airlie, kamal, ben, LKML,
	dri-devel, m.szyprowski

On Tue, Aug 12, 2014 at 02:12:07PM +0200, Mario Kleiner wrote:
> On 08/11/2014 05:17 PM, Jerome Glisse wrote:
> >On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
> >>On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> >>>On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
> >>>>On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> >>>>>Resent this time without HTML formatting which lkml doesn't like.
> >>>>>Sorry.
> >>>>>
> >>>>>On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
> >>>>>>On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
> >>>>>>>On August 9, 2014 1:39:39 AM EDT, Thomas
> >>>>>>>Hellstrom<thellstrom@vmware.com>  wrote:
> >>>>>>>>Hi.
> >>>>>>>>
> >>>>>>>Hey Thomas!
> >>>>>>>
> >>>>>>>>IIRC I don't think the TTM DMA pool allocates coherent pages more
> >>>>>>>>than
> >>>>>>>>one page at a time, and _if that's true_ it's pretty unnecessary for
> >>>>>>>>the
> >>>>>>>>dma subsystem to route those allocations to CMA. Maybe Konrad could
> >>>>>>>>shed
> >>>>>>>>some light over this?
> >>>>>>>It should allocate in batches and keep them in the TTM DMA pool for
> >>>>>>>some time to be reused.
> >>>>>>>
> >>>>>>>The pages that it gets are in 4kb granularity though.
> >>>>>>Then I feel inclined to say this is a DMA subsystem bug. Single page
> >>>>>>allocations shouldn't get routed to CMA.
> >>>>>>
> >>>>>>/Thomas
> >>>>>Yes, seems you're both right. I read through the code a bit more and
> >>>>>indeed the TTM DMA pool allocates only one page during each
> >>>>>dma_alloc_coherent() call, so it doesn't need CMA memory. The current
> >>>>>allocators don't check for single page CMA allocations and therefore
> >>>>>try to get it from the CMA area anyway, instead of skipping to the
> >>>>>much cheaper fallback.
> >>>>>
> >>>>>So the callers of dma_alloc_from_contiguous() could need that little
> >>>>>optimization of skipping it if only one page is requested. For
> >>>>>
> >>>>>dma_generic_alloc_coherent
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Ddma_generic_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d1852625e2ab2ff07eb34a7f33fc1f55f7f13959912d5a6ce9316d23070ce939>
> >>>>>
> >>>>>andintel_alloc_coherent
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Dintel_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=82d587e9b6aeced5cf9a7caefa91bf47fba809f3522b7379d22e45a2d5d35ebd>
> >>>>>this
> >>>>>seems easy to do. Looking at the arm arch variants, e.g.,
> >>>>>
> >>>>>https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac
> >>>>>
> >>>>>
> >>>>>and
> >>>>>
> >>>>>https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08
> >>>>>
> >>>>>
> >>>>>i'm not sure if it is that easily done, as there aren't any fallbacks
> >>>>>for such a case and the code looks to me as if that's at least
> >>>>>somewhat intentional.
> >>>>>
> >>>>>As far as TTM goes, one quick one-line fix to prevent it from using
> >>>>>the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
> >>>>>above methods) would be to clear the __GFP_WAIT
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc>
> >>>>>flag from the
> >>>>>passed gfp_t flags. That would trigger the well working fallback.
> >>>>>So, is
> >>>>>
> >>>>>__GFP_WAIT
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc>
> >>>>>needed
> >>>>>for those single page allocations that go through__ttm_dma_alloc_page
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>?
> >>>>>
> >>>>>
> >>>>>It would be nice to have such a simple, non-intrusive one-line patch
> >>>>>that we still could get into 3.17 and then backported to older stable
> >>>>>kernels to avoid the same desktop hangs there if CMA is enabled. It
> >>>>>would be also nice for actual users of CMA to not use up lots of CMA
> >>>>>space for gpu's which don't need it. I think DMA_CMA was introduced
> >>>>>around 3.12.
> >>>>>
> >>>>I don't think that's a good idea. Omitting __GFP_WAIT would cause
> >>>>unnecessary memory allocation errors on systems under stress.
> >>>>I think this should be filed as a DMA subsystem kernel bug / regression
> >>>>and an appropriate solution should be worked out together with the DMA
> >>>>subsystem maintainers and then backported.
> >>>Ok, so it is needed. I'll file a bug report.
> >>>
> >>>>>The other problem is that probably TTM does not reuse pages from the
> >>>>>DMA pool. If i trace the __ttm_dma_alloc_page
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>>>>and
> >>>>>__ttm_dma_free_page
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>>>>calls for
> >>>>>those single page allocs/frees, then over a 20 second interval of
> >>>>>tracing and switching tabs in firefox, scrolling things around etc. i
> >>>>>find about as many alloc's as i find free's, e.g., 1607 allocs vs.
> >>>>>1648 frees.
> >>>>This is because historically the pools have been designed to keep only
> >>>>pages with nonstandard caching attributes since changing page caching
> >>>>attributes have been very slow but the kernel page allocators have been
> >>>>reasonably fast.
> >>>>
> >>>>/Thomas
> >>>Ok. A bit more ftraceing showed my hang problem case goes through the
> >>>"if (is_cached)" paths, so the pool doesn't recycle anything and i see
> >>>it bouncing up and down by 4 pages all the time.
> >>>
> >>>But for the non-cached case, which i don't hit with my problem, could
> >>>one of you look at line 954...
> >>>
> >>>https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
> >>>
> >>>
> >>>... and tell me why that unconditional npages = count; assignment
> >>>makes sense? It seems to essentially disable all recycling for the dma
> >>>pool whenever the pool isn't filled up to/beyond its maximum with free
> >>>pages? When the pool is filled up, lots of stuff is recycled, but when
> >>>it is already somewhat below capacity, it gets "punished" by not
> >>>getting refilled? I'd just like to understand the logic behind that line.
> >>>
> >>>thanks,
> >>>-mario
> >>I'll happily forward that question to Konrad who wrote the code (or it
> >>may even stem from the ordinary page pool code which IIRC has Dave
> >>Airlie / Jerome Glisse as authors)
> >This is effectively bogus code, i now wonder how it came to stay alive.
> >Attached patch will fix that.
> 
> Yes, that makes sense to me. Fwiw,
> 
> Reviewed-by: Mario Kleiner <mario.kleiner.de@gmail.com>

What about testing? Did it make the issue less of a problem or did it
disappear completely?

Thank you.
> 
> -mario
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-12 20:47                     ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 30+ messages in thread
From: Konrad Rzeszutek Wilk @ 2014-08-12 20:47 UTC (permalink / raw)
  To: Mario Kleiner
  Cc: Thomas Hellstrom, kamal, LKML, dri-devel, Dave Airlie, ben, m.szyprowski

On Tue, Aug 12, 2014 at 02:12:07PM +0200, Mario Kleiner wrote:
> On 08/11/2014 05:17 PM, Jerome Glisse wrote:
> >On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
> >>On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> >>>On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
> >>>>On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> >>>>>Resent this time without HTML formatting which lkml doesn't like.
> >>>>>Sorry.
> >>>>>
> >>>>>On 08/09/2014 03:58 PM, Thomas Hellstrom wrote:
> >>>>>>On 08/09/2014 03:33 PM, Konrad Rzeszutek Wilk wrote:
> >>>>>>>On August 9, 2014 1:39:39 AM EDT, Thomas
> >>>>>>>Hellstrom<thellstrom@vmware.com>  wrote:
> >>>>>>>>Hi.
> >>>>>>>>
> >>>>>>>Hey Thomas!
> >>>>>>>
> >>>>>>>>IIRC I don't think the TTM DMA pool allocates coherent pages more
> >>>>>>>>than
> >>>>>>>>one page at a time, and _if that's true_ it's pretty unnecessary for
> >>>>>>>>the
> >>>>>>>>dma subsystem to route those allocations to CMA. Maybe Konrad could
> >>>>>>>>shed
> >>>>>>>>some light over this?
> >>>>>>>It should allocate in batches and keep them in the TTM DMA pool for
> >>>>>>>some time to be reused.
> >>>>>>>
> >>>>>>>The pages that it gets are in 4kb granularity though.
> >>>>>>Then I feel inclined to say this is a DMA subsystem bug. Single page
> >>>>>>allocations shouldn't get routed to CMA.
> >>>>>>
> >>>>>>/Thomas
> >>>>>Yes, seems you're both right. I read through the code a bit more and
> >>>>>indeed the TTM DMA pool allocates only one page during each
> >>>>>dma_alloc_coherent() call, so it doesn't need CMA memory. The current
> >>>>>allocators don't check for single page CMA allocations and therefore
> >>>>>try to get it from the CMA area anyway, instead of skipping to the
> >>>>>much cheaper fallback.
> >>>>>
> >>>>>So the callers of dma_alloc_from_contiguous() could need that little
> >>>>>optimization of skipping it if only one page is requested. For
> >>>>>
> >>>>>dma_generic_alloc_coherent
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Ddma_generic_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d1852625e2ab2ff07eb34a7f33fc1f55f7f13959912d5a6ce9316d23070ce939>
> >>>>>
> >>>>>andintel_alloc_coherent
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3Dintel_alloc_coherent&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=82d587e9b6aeced5cf9a7caefa91bf47fba809f3522b7379d22e45a2d5d35ebd>
> >>>>>this
> >>>>>seems easy to do. Looking at the arm arch variants, e.g.,
> >>>>>
> >>>>>https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm/mm/dma-mapping.c%23L1194&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=4c178257eab9b5d7ca650dedba76cf27abeb49ddc7aebb9433f52b6c8bb3bbac
> >>>>>
> >>>>>
> >>>>>and
> >>>>>
> >>>>>https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/arch/arm64/mm/dma-mapping.c%23L44&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=5f62f4cbe8cee1f1dd4cbba656354efe6867bcdc664cf90e9719e2f42a85de08
> >>>>>
> >>>>>
> >>>>>i'm not sure if it is that easily done, as there aren't any fallbacks
> >>>>>for such a case and the code looks to me as if that's at least
> >>>>>somewhat intentional.
> >>>>>
> >>>>>As far as TTM goes, one quick one-line fix to prevent it from using
> >>>>>the CMA at least on SWIOTLB, NOMMU and Intel IOMMU (when using the
> >>>>>above methods) would be to clear the __GFP_WAIT
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc>
> >>>>>flag from the
> >>>>>passed gfp_t flags. That would trigger the well working fallback.
> >>>>>So, is
> >>>>>
> >>>>>__GFP_WAIT
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__GFP_WAIT&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=d56d076770d3416264be6c9ea2829ac0d6951203696fa3ad04144f13307577bc>
> >>>>>needed
> >>>>>for those single page allocations that go through__ttm_dma_alloc_page
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>?
> >>>>>
> >>>>>
> >>>>>It would be nice to have such a simple, non-intrusive one-line patch
> >>>>>that we still could get into 3.17 and then backported to older stable
> >>>>>kernels to avoid the same desktop hangs there if CMA is enabled. It
> >>>>>would be also nice for actual users of CMA to not use up lots of CMA
> >>>>>space for gpu's which don't need it. I think DMA_CMA was introduced
> >>>>>around 3.12.
> >>>>>
> >>>>I don't think that's a good idea. Omitting __GFP_WAIT would cause
> >>>>unnecessary memory allocation errors on systems under stress.
> >>>>I think this should be filed as a DMA subsystem kernel bug / regression
> >>>>and an appropriate solution should be worked out together with the DMA
> >>>>subsystem maintainers and then backported.
> >>>Ok, so it is needed. I'll file a bug report.
> >>>
> >>>>>The other problem is that probably TTM does not reuse pages from the
> >>>>>DMA pool. If i trace the __ttm_dma_alloc_page
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>>>>and
> >>>>>__ttm_dma_free_page
> >>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>>>>calls for
> >>>>>those single page allocs/frees, then over a 20 second interval of
> >>>>>tracing and switching tabs in firefox, scrolling things around etc. i
> >>>>>find about as many alloc's as i find free's, e.g., 1607 allocs vs.
> >>>>>1648 frees.
> >>>>This is because historically the pools have been designed to keep only
> >>>>pages with nonstandard caching attributes since changing page caching
> >>>>attributes have been very slow but the kernel page allocators have been
> >>>>reasonably fast.
> >>>>
> >>>>/Thomas
> >>>Ok. A bit more ftraceing showed my hang problem case goes through the
> >>>"if (is_cached)" paths, so the pool doesn't recycle anything and i see
> >>>it bouncing up and down by 4 pages all the time.
> >>>
> >>>But for the non-cached case, which i don't hit with my problem, could
> >>>one of you look at line 954...
> >>>
> >>>https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
> >>>
> >>>
> >>>... and tell me why that unconditional npages = count; assignment
> >>>makes sense? It seems to essentially disable all recycling for the dma
> >>>pool whenever the pool isn't filled up to/beyond its maximum with free
> >>>pages? When the pool is filled up, lots of stuff is recycled, but when
> >>>it is already somewhat below capacity, it gets "punished" by not
> >>>getting refilled? I'd just like to understand the logic behind that line.
> >>>
> >>>thanks,
> >>>-mario
> >>I'll happily forward that question to Konrad who wrote the code (or it
> >>may even stem from the ordinary page pool code which IIRC has Dave
> >>Airlie / Jerome Glisse as authors)
> >This is effectively bogus code, i now wonder how it came to stay alive.
> >Attached patch will fix that.
> 
> Yes, that makes sense to me. Fwiw,
> 
> Reviewed-by: Mario Kleiner <mario.kleiner.de@gmail.com>

What about testing? Did it make the issue less of a problem or did it
disappear completely?

Thank you.
> 
> -mario
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-11 15:17                 ` Jerome Glisse
  (?)
  (?)
@ 2014-08-13  1:50                 ` Michel Dänzer
  2014-08-13  2:04                   ` Mario Kleiner
  2014-08-13  2:04                   ` Jerome Glisse
  -1 siblings, 2 replies; 30+ messages in thread
From: Michel Dänzer @ 2014-08-13  1:50 UTC (permalink / raw)
  To: Jerome Glisse, Thomas Hellstrom
  Cc: Konrad Rzeszutek Wilk, kamal, LKML, dri-devel, Dave Airlie, ben,
	m.szyprowski

On 12.08.2014 00:17, Jerome Glisse wrote:
> On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
>> On 08/10/2014 08:02 PM, Mario Kleiner wrote:
>>> On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
>>>> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
>>>>>
>>>>> The other problem is that probably TTM does not reuse pages from the
>>>>> DMA pool. If i trace the __ttm_dma_alloc_page
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
>>>>> and
>>>>> __ttm_dma_free_page
>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
>>>>> calls for
>>>>> those single page allocs/frees, then over a 20 second interval of
>>>>> tracing and switching tabs in firefox, scrolling things around etc. i
>>>>> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
>>>>> 1648 frees.
>>>> This is because historically the pools have been designed to keep only
>>>> pages with nonstandard caching attributes since changing page caching
>>>> attributes have been very slow but the kernel page allocators have been
>>>> reasonably fast.
>>>>
>>>> /Thomas
>>>
>>> Ok. A bit more ftraceing showed my hang problem case goes through the
>>> "if (is_cached)" paths, so the pool doesn't recycle anything and i see
>>> it bouncing up and down by 4 pages all the time.
>>>
>>> But for the non-cached case, which i don't hit with my problem, could
>>> one of you look at line 954...
>>>
>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
>>>
>>>
>>> ... and tell me why that unconditional npages = count; assignment
>>> makes sense? It seems to essentially disable all recycling for the dma
>>> pool whenever the pool isn't filled up to/beyond its maximum with free
>>> pages? When the pool is filled up, lots of stuff is recycled, but when
>>> it is already somewhat below capacity, it gets "punished" by not
>>> getting refilled? I'd just like to understand the logic behind that line.
>>>
>>> thanks,
>>> -mario
>>
>> I'll happily forward that question to Konrad who wrote the code (or it
>> may even stem from the ordinary page pool code which IIRC has Dave
>> Airlie / Jerome Glisse as authors)
> 
> This is effectively bogus code, i now wonder how it came to stay alive.
> Attached patch will fix that.

I haven't tested Mario's scenario specifically, but it survived piglit
and the UE4 Effects Cave Demo (for which 1GB of VRAM isn't enough, so
some BOs ended up in GTT instead with write-combined CPU mappings) on
radeonsi without any noticeable issues.

Tested-by: Michel Dänzer <michel.daenzer@amd.com>


-- 
Earthling Michel Dänzer            |                  http://www.amd.com
Libre software enthusiast          |                Mesa and X developer

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-13  1:50                 ` Michel Dänzer
@ 2014-08-13  2:04                   ` Mario Kleiner
  2014-08-13  2:17                       ` Jerome Glisse
  2014-08-13  2:04                   ` Jerome Glisse
  1 sibling, 1 reply; 30+ messages in thread
From: Mario Kleiner @ 2014-08-13  2:04 UTC (permalink / raw)
  To: Michel Dänzer, Jerome Glisse, Thomas Hellstrom
  Cc: Konrad Rzeszutek Wilk, kamal, LKML, dri-devel, Dave Airlie, ben,
	m.szyprowski

On 08/13/2014 03:50 AM, Michel Dänzer wrote:
> On 12.08.2014 00:17, Jerome Glisse wrote:
>> On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
>>> On 08/10/2014 08:02 PM, Mario Kleiner wrote:
>>>> On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
>>>>> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
>>>>>> The other problem is that probably TTM does not reuse pages from the
>>>>>> DMA pool. If i trace the __ttm_dma_alloc_page
>>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
>>>>>> and
>>>>>> __ttm_dma_free_page
>>>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
>>>>>> calls for
>>>>>> those single page allocs/frees, then over a 20 second interval of
>>>>>> tracing and switching tabs in firefox, scrolling things around etc. i
>>>>>> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
>>>>>> 1648 frees.
>>>>> This is because historically the pools have been designed to keep only
>>>>> pages with nonstandard caching attributes since changing page caching
>>>>> attributes have been very slow but the kernel page allocators have been
>>>>> reasonably fast.
>>>>>
>>>>> /Thomas
>>>> Ok. A bit more ftraceing showed my hang problem case goes through the
>>>> "if (is_cached)" paths, so the pool doesn't recycle anything and i see
>>>> it bouncing up and down by 4 pages all the time.
>>>>
>>>> But for the non-cached case, which i don't hit with my problem, could
>>>> one of you look at line 954...
>>>>
>>>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
>>>>
>>>>
>>>> ... and tell me why that unconditional npages = count; assignment
>>>> makes sense? It seems to essentially disable all recycling for the dma
>>>> pool whenever the pool isn't filled up to/beyond its maximum with free
>>>> pages? When the pool is filled up, lots of stuff is recycled, but when
>>>> it is already somewhat below capacity, it gets "punished" by not
>>>> getting refilled? I'd just like to understand the logic behind that line.
>>>>
>>>> thanks,
>>>> -mario
>>> I'll happily forward that question to Konrad who wrote the code (or it
>>> may even stem from the ordinary page pool code which IIRC has Dave
>>> Airlie / Jerome Glisse as authors)
>> This is effectively bogus code, i now wonder how it came to stay alive.
>> Attached patch will fix that.
> I haven't tested Mario's scenario specifically, but it survived piglit
> and the UE4 Effects Cave Demo (for which 1GB of VRAM isn't enough, so
> some BOs ended up in GTT instead with write-combined CPU mappings) on
> radeonsi without any noticeable issues.
>
> Tested-by: Michel Dänzer <michel.daenzer@amd.com>
>
>

I haven't tested the patch yet. For the original bug it won't help 
directly, because the super-slow allocations which cause the desktop 
stall are tt_cached allocations, so they go through the if (is_cached) 
code path which isn't improved by Jerome's patch. is_cached always 
releases memory immediately, so the tt_cached pool just bounces up and 
down between 4 and 7 pages. So this was an independent issue. The slow 
allocations i noticed were mostly caused by exa allocating new gem bo's, 
i don't know which path is taken by 3d graphics?

However, the fixed ttm path could indirectly solve the DMA_CMA stalls by 
completely killing CMA for its intended purpose. Typical CMA sizes are 
probably around < 100 MB (kernel default is 16 MB, Ubuntu config is 64 
MB), and the limit for the page pool seems to be more like 50% of all 
system RAM? Iow. if the ttm dma pool is allowed to grow that big with 
recycled pages, it probably will almost completely monopolize the whole 
CMA memory after a short amount of time. ttm won't suffer stalls if it 
essentially doesn't interact with CMA anymore after a warmup period, but 
actual clients which really need CMA (ie., hardware without 
scatter-gather dma etc.) will be starved of what they need as far as my 
limited understanding of the CMA goes.

So fwiw probably the fix to ttm will increase the urgency for the CMA 
people to come up with a fix/optimization for the allocator. Unless it 
doesn't matter if most desktop systems have CMA disabled by default, and 
ttm is mostly used by desktop graphics drivers (nouveau, radeon, vmgfx)? 
I only stumbled over the problem because the Ubuntu 3.16 mainline 
testing kernels are compiled with CMA on.

-mario


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-13  1:50                 ` Michel Dänzer
  2014-08-13  2:04                   ` Mario Kleiner
@ 2014-08-13  2:04                   ` Jerome Glisse
  1 sibling, 0 replies; 30+ messages in thread
From: Jerome Glisse @ 2014-08-13  2:04 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: Thomas Hellstrom, Konrad Rzeszutek Wilk, kamal, LKML, dri-devel,
	Dave Airlie, ben, m.szyprowski

On Wed, Aug 13, 2014 at 10:50:25AM +0900, Michel Dänzer wrote:
> On 12.08.2014 00:17, Jerome Glisse wrote:
> > On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
> >> On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> >>> On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
> >>>> On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> >>>>>
> >>>>> The other problem is that probably TTM does not reuse pages from the
> >>>>> DMA pool. If i trace the __ttm_dma_alloc_page
> >>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>>>> and
> >>>>> __ttm_dma_free_page
> >>>>> <https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>>>> calls for
> >>>>> those single page allocs/frees, then over a 20 second interval of
> >>>>> tracing and switching tabs in firefox, scrolling things around etc. i
> >>>>> find about as many alloc's as i find free's, e.g., 1607 allocs vs.
> >>>>> 1648 frees.
> >>>> This is because historically the pools have been designed to keep only
> >>>> pages with nonstandard caching attributes since changing page caching
> >>>> attributes have been very slow but the kernel page allocators have been
> >>>> reasonably fast.
> >>>>
> >>>> /Thomas
> >>>
> >>> Ok. A bit more ftraceing showed my hang problem case goes through the
> >>> "if (is_cached)" paths, so the pool doesn't recycle anything and i see
> >>> it bouncing up and down by 4 pages all the time.
> >>>
> >>> But for the non-cached case, which i don't hit with my problem, could
> >>> one of you look at line 954...
> >>>
> >>> https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
> >>>
> >>>
> >>> ... and tell me why that unconditional npages = count; assignment
> >>> makes sense? It seems to essentially disable all recycling for the dma
> >>> pool whenever the pool isn't filled up to/beyond its maximum with free
> >>> pages? When the pool is filled up, lots of stuff is recycled, but when
> >>> it is already somewhat below capacity, it gets "punished" by not
> >>> getting refilled? I'd just like to understand the logic behind that line.
> >>>
> >>> thanks,
> >>> -mario
> >>
> >> I'll happily forward that question to Konrad who wrote the code (or it
> >> may even stem from the ordinary page pool code which IIRC has Dave
> >> Airlie / Jerome Glisse as authors)
> > 
> > This is effectively bogus code, i now wonder how it came to stay alive.
> > Attached patch will fix that.
> 
> I haven't tested Mario's scenario specifically, but it survived piglit
> and the UE4 Effects Cave Demo (for which 1GB of VRAM isn't enough, so
> some BOs ended up in GTT instead with write-combined CPU mappings) on
> radeonsi without any noticeable issues.
> 
> Tested-by: Michel Dänzer <michel.daenzer@amd.com>
> 

My patch does not fix the cma bug, cma should not allocate single page into
it reserved contiguous memory. But cma is a broken technology in the first
place and it should not be enabled on x86 who ever did that is a moron.

So i would definitly encourage opening a bug against cma.

None the less ttm code was buggy too and this patch will fix that but will
only allieviate or delay the symptoms reported by Mario.

Cheers,
Jérôme

> 
> -- 
> Earthling Michel Dänzer            |                  http://www.amd.com
> Libre software enthusiast          |                Mesa and X developer

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-13  2:04                   ` Mario Kleiner
@ 2014-08-13  2:17                       ` Jerome Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jerome Glisse @ 2014-08-13  2:17 UTC (permalink / raw)
  To: Mario Kleiner
  Cc: Michel Dänzer, Thomas Hellstrom, Konrad Rzeszutek Wilk,
	kamal, LKML, dri-devel, Dave Airlie, ben, m.szyprowski

On Wed, Aug 13, 2014 at 04:04:15AM +0200, Mario Kleiner wrote:
> On 08/13/2014 03:50 AM, Michel Dänzer wrote:
> >On 12.08.2014 00:17, Jerome Glisse wrote:
> >>On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
> >>>On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> >>>>On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
> >>>>>On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> >>>>>>The other problem is that probably TTM does not reuse pages from the
> >>>>>>DMA pool. If i trace the __ttm_dma_alloc_page
> >>>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>>>>>and
> >>>>>>__ttm_dma_free_page
> >>>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>>>>>calls for
> >>>>>>those single page allocs/frees, then over a 20 second interval of
> >>>>>>tracing and switching tabs in firefox, scrolling things around etc. i
> >>>>>>find about as many alloc's as i find free's, e.g., 1607 allocs vs.
> >>>>>>1648 frees.
> >>>>>This is because historically the pools have been designed to keep only
> >>>>>pages with nonstandard caching attributes since changing page caching
> >>>>>attributes have been very slow but the kernel page allocators have been
> >>>>>reasonably fast.
> >>>>>
> >>>>>/Thomas
> >>>>Ok. A bit more ftraceing showed my hang problem case goes through the
> >>>>"if (is_cached)" paths, so the pool doesn't recycle anything and i see
> >>>>it bouncing up and down by 4 pages all the time.
> >>>>
> >>>>But for the non-cached case, which i don't hit with my problem, could
> >>>>one of you look at line 954...
> >>>>
> >>>>https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
> >>>>
> >>>>
> >>>>... and tell me why that unconditional npages = count; assignment
> >>>>makes sense? It seems to essentially disable all recycling for the dma
> >>>>pool whenever the pool isn't filled up to/beyond its maximum with free
> >>>>pages? When the pool is filled up, lots of stuff is recycled, but when
> >>>>it is already somewhat below capacity, it gets "punished" by not
> >>>>getting refilled? I'd just like to understand the logic behind that line.
> >>>>
> >>>>thanks,
> >>>>-mario
> >>>I'll happily forward that question to Konrad who wrote the code (or it
> >>>may even stem from the ordinary page pool code which IIRC has Dave
> >>>Airlie / Jerome Glisse as authors)
> >>This is effectively bogus code, i now wonder how it came to stay alive.
> >>Attached patch will fix that.
> >I haven't tested Mario's scenario specifically, but it survived piglit
> >and the UE4 Effects Cave Demo (for which 1GB of VRAM isn't enough, so
> >some BOs ended up in GTT instead with write-combined CPU mappings) on
> >radeonsi without any noticeable issues.
> >
> >Tested-by: Michel Dänzer <michel.daenzer@amd.com>
> >
> >
> 
> I haven't tested the patch yet. For the original bug it won't help directly,
> because the super-slow allocations which cause the desktop stall are
> tt_cached allocations, so they go through the if (is_cached) code path which
> isn't improved by Jerome's patch. is_cached always releases memory
> immediately, so the tt_cached pool just bounces up and down between 4 and 7
> pages. So this was an independent issue. The slow allocations i noticed were
> mostly caused by exa allocating new gem bo's, i don't know which path is
> taken by 3d graphics?
> 
> However, the fixed ttm path could indirectly solve the DMA_CMA stalls by
> completely killing CMA for its intended purpose. Typical CMA sizes are
> probably around < 100 MB (kernel default is 16 MB, Ubuntu config is 64 MB),
> and the limit for the page pool seems to be more like 50% of all system RAM?
> Iow. if the ttm dma pool is allowed to grow that big with recycled pages, it
> probably will almost completely monopolize the whole CMA memory after a
> short amount of time. ttm won't suffer stalls if it essentially doesn't
> interact with CMA anymore after a warmup period, but actual clients which
> really need CMA (ie., hardware without scatter-gather dma etc.) will be
> starved of what they need as far as my limited understanding of the CMA
> goes.

Yes currently we allow the pool to be way too big, given that pool was probably
never really use we most likely never had much of an issue. So i would hold on
applying my patch until more proper limit are in place. My thinking was to go
for something like 32/64M at most and less then that if < 256M total ram. I also
think that we should lower the pool size on first call to shrink and only increase
it again after some timeout since last call to shrink so that when shrink is call
we minimize our pool size at least for a time. Will put together couple patches
for doing that.

> 
> So fwiw probably the fix to ttm will increase the urgency for the CMA people
> to come up with a fix/optimization for the allocator. Unless it doesn't
> matter if most desktop systems have CMA disabled by default, and ttm is
> mostly used by desktop graphics drivers (nouveau, radeon, vmgfx)? I only
> stumbled over the problem because the Ubuntu 3.16 mainline testing kernels
> are compiled with CMA on.
> 

Enabling cma on x86 is proof of brain damage that said the dma allocator should
not use the cma area for single page allocation.

> -mario
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-13  2:17                       ` Jerome Glisse
  0 siblings, 0 replies; 30+ messages in thread
From: Jerome Glisse @ 2014-08-13  2:17 UTC (permalink / raw)
  To: Mario Kleiner
  Cc: Thomas Hellstrom, Konrad Rzeszutek Wilk, kamal, LKML, dri-devel,
	Dave Airlie, ben, Michel Dänzer, m.szyprowski

On Wed, Aug 13, 2014 at 04:04:15AM +0200, Mario Kleiner wrote:
> On 08/13/2014 03:50 AM, Michel Dänzer wrote:
> >On 12.08.2014 00:17, Jerome Glisse wrote:
> >>On Mon, Aug 11, 2014 at 12:11:21PM +0200, Thomas Hellstrom wrote:
> >>>On 08/10/2014 08:02 PM, Mario Kleiner wrote:
> >>>>On 08/10/2014 01:03 PM, Thomas Hellstrom wrote:
> >>>>>On 08/10/2014 05:11 AM, Mario Kleiner wrote:
> >>>>>>The other problem is that probably TTM does not reuse pages from the
> >>>>>>DMA pool. If i trace the __ttm_dma_alloc_page
> >>>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>>>>>and
> >>>>>>__ttm_dma_free_page
> >>>>>><https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/ident?i%3D__ttm_dma_alloc_page&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=7898522bba274e4dcc332735fbcf0c96e48918f60c2ee8e9a3e9c73ab3487bd0>
> >>>>>>calls for
> >>>>>>those single page allocs/frees, then over a 20 second interval of
> >>>>>>tracing and switching tabs in firefox, scrolling things around etc. i
> >>>>>>find about as many alloc's as i find free's, e.g., 1607 allocs vs.
> >>>>>>1648 frees.
> >>>>>This is because historically the pools have been designed to keep only
> >>>>>pages with nonstandard caching attributes since changing page caching
> >>>>>attributes have been very slow but the kernel page allocators have been
> >>>>>reasonably fast.
> >>>>>
> >>>>>/Thomas
> >>>>Ok. A bit more ftraceing showed my hang problem case goes through the
> >>>>"if (is_cached)" paths, so the pool doesn't recycle anything and i see
> >>>>it bouncing up and down by 4 pages all the time.
> >>>>
> >>>>But for the non-cached case, which i don't hit with my problem, could
> >>>>one of you look at line 954...
> >>>>
> >>>>https://urldefense.proofpoint.com/v1/url?u=http://lxr.free-electrons.com/source/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c%23L954&k=oIvRg1%2BdGAgOoM1BIlLLqw%3D%3D%0A&r=l5Ago9ekmVFZ3c4M6eauqrJWGwjf6fTb%2BP3CxbBFkVM%3D%0A&m=QQSN6uVpEiw6RuWLAfK%2FKWBFV5HspJUfDh4Y2mUz%2FH4%3D%0A&s=e15c51805d429ee6d8960d6b88035e9811a1cdbfbf13168eec2fbb2214b99c60
> >>>>
> >>>>
> >>>>... and tell me why that unconditional npages = count; assignment
> >>>>makes sense? It seems to essentially disable all recycling for the dma
> >>>>pool whenever the pool isn't filled up to/beyond its maximum with free
> >>>>pages? When the pool is filled up, lots of stuff is recycled, but when
> >>>>it is already somewhat below capacity, it gets "punished" by not
> >>>>getting refilled? I'd just like to understand the logic behind that line.
> >>>>
> >>>>thanks,
> >>>>-mario
> >>>I'll happily forward that question to Konrad who wrote the code (or it
> >>>may even stem from the ordinary page pool code which IIRC has Dave
> >>>Airlie / Jerome Glisse as authors)
> >>This is effectively bogus code, i now wonder how it came to stay alive.
> >>Attached patch will fix that.
> >I haven't tested Mario's scenario specifically, but it survived piglit
> >and the UE4 Effects Cave Demo (for which 1GB of VRAM isn't enough, so
> >some BOs ended up in GTT instead with write-combined CPU mappings) on
> >radeonsi without any noticeable issues.
> >
> >Tested-by: Michel Dänzer <michel.daenzer@amd.com>
> >
> >
> 
> I haven't tested the patch yet. For the original bug it won't help directly,
> because the super-slow allocations which cause the desktop stall are
> tt_cached allocations, so they go through the if (is_cached) code path which
> isn't improved by Jerome's patch. is_cached always releases memory
> immediately, so the tt_cached pool just bounces up and down between 4 and 7
> pages. So this was an independent issue. The slow allocations i noticed were
> mostly caused by exa allocating new gem bo's, i don't know which path is
> taken by 3d graphics?
> 
> However, the fixed ttm path could indirectly solve the DMA_CMA stalls by
> completely killing CMA for its intended purpose. Typical CMA sizes are
> probably around < 100 MB (kernel default is 16 MB, Ubuntu config is 64 MB),
> and the limit for the page pool seems to be more like 50% of all system RAM?
> Iow. if the ttm dma pool is allowed to grow that big with recycled pages, it
> probably will almost completely monopolize the whole CMA memory after a
> short amount of time. ttm won't suffer stalls if it essentially doesn't
> interact with CMA anymore after a warmup period, but actual clients which
> really need CMA (ie., hardware without scatter-gather dma etc.) will be
> starved of what they need as far as my limited understanding of the CMA
> goes.

Yes currently we allow the pool to be way too big, given that pool was probably
never really use we most likely never had much of an issue. So i would hold on
applying my patch until more proper limit are in place. My thinking was to go
for something like 32/64M at most and less then that if < 256M total ram. I also
think that we should lower the pool size on first call to shrink and only increase
it again after some timeout since last call to shrink so that when shrink is call
we minimize our pool size at least for a time. Will put together couple patches
for doing that.

> 
> So fwiw probably the fix to ttm will increase the urgency for the CMA people
> to come up with a fix/optimization for the allocator. Unless it doesn't
> matter if most desktop systems have CMA disabled by default, and ttm is
> mostly used by desktop graphics drivers (nouveau, radeon, vmgfx)? I only
> stumbled over the problem because the Ubuntu 3.16 mainline testing kernels
> are compiled with CMA on.
> 

Enabling cma on x86 is proof of brain damage that said the dma allocator should
not use the cma area for single page allocation.

> -mario
> 

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
  2014-08-13  2:17                       ` Jerome Glisse
@ 2014-08-13  8:42                         ` Lucas Stach
  -1 siblings, 0 replies; 30+ messages in thread
From: Lucas Stach @ 2014-08-13  8:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Mario Kleiner, Thomas Hellstrom, Konrad Rzeszutek Wilk, kamal,
	LKML, dri-devel, Dave Airlie, ben, Michel Dänzer,
	m.szyprowski

Am Dienstag, den 12.08.2014, 22:17 -0400 schrieb Jerome Glisse:
[...]
> > I haven't tested the patch yet. For the original bug it won't help directly,
> > because the super-slow allocations which cause the desktop stall are
> > tt_cached allocations, so they go through the if (is_cached) code path which
> > isn't improved by Jerome's patch. is_cached always releases memory
> > immediately, so the tt_cached pool just bounces up and down between 4 and 7
> > pages. So this was an independent issue. The slow allocations i noticed were
> > mostly caused by exa allocating new gem bo's, i don't know which path is
> > taken by 3d graphics?
> > 
> > However, the fixed ttm path could indirectly solve the DMA_CMA stalls by
> > completely killing CMA for its intended purpose. Typical CMA sizes are
> > probably around < 100 MB (kernel default is 16 MB, Ubuntu config is 64 MB),
> > and the limit for the page pool seems to be more like 50% of all system RAM?
> > Iow. if the ttm dma pool is allowed to grow that big with recycled pages, it
> > probably will almost completely monopolize the whole CMA memory after a
> > short amount of time. ttm won't suffer stalls if it essentially doesn't
> > interact with CMA anymore after a warmup period, but actual clients which
> > really need CMA (ie., hardware without scatter-gather dma etc.) will be
> > starved of what they need as far as my limited understanding of the CMA
> > goes.
> 
> Yes currently we allow the pool to be way too big, given that pool was probably
> never really use we most likely never had much of an issue. So i would hold on
> applying my patch until more proper limit are in place. My thinking was to go
> for something like 32/64M at most and less then that if < 256M total ram. I also
> think that we should lower the pool size on first call to shrink and only increase
> it again after some timeout since last call to shrink so that when shrink is call
> we minimize our pool size at least for a time. Will put together couple patches
> for doing that.
> 
> > 
> > So fwiw probably the fix to ttm will increase the urgency for the CMA people
> > to come up with a fix/optimization for the allocator. Unless it doesn't
> > matter if most desktop systems have CMA disabled by default, and ttm is
> > mostly used by desktop graphics drivers (nouveau, radeon, vmgfx)? I only
> > stumbled over the problem because the Ubuntu 3.16 mainline testing kernels
> > are compiled with CMA on.
> > 
> 
> Enabling cma on x86 is proof of brain damage that said the dma allocator should
> not use the cma area for single page allocation.
> 
Harsh words.

Yes, allocating pages unconditionally from CMA if it is enabled is an
artifact of CMAs ARM heritage. While it seems completely backwards to
allocate single pages from CMA on x86, on ARM the CMA pool is the only
way to get lowmem pages on which you are allowed to change the caching
state.

So the obvious fix is to avoid CMA for order 0 allocations on x86. I can
cook a patch for this.

Regards,
Lucas 
-- 
Pengutronix e.K.             | Lucas Stach                 |
Industrial Linux Solutions   | http://www.pengutronix.de/  |


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: CONFIG_DMA_CMA causes ttm performance problems/hangs.
@ 2014-08-13  8:42                         ` Lucas Stach
  0 siblings, 0 replies; 30+ messages in thread
From: Lucas Stach @ 2014-08-13  8:42 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Konrad Rzeszutek Wilk, kamal, Thomas Hellstrom, dri-devel, LKML,
	Dave Airlie, ben, Michel Dänzer, m.szyprowski

Am Dienstag, den 12.08.2014, 22:17 -0400 schrieb Jerome Glisse:
[...]
> > I haven't tested the patch yet. For the original bug it won't help directly,
> > because the super-slow allocations which cause the desktop stall are
> > tt_cached allocations, so they go through the if (is_cached) code path which
> > isn't improved by Jerome's patch. is_cached always releases memory
> > immediately, so the tt_cached pool just bounces up and down between 4 and 7
> > pages. So this was an independent issue. The slow allocations i noticed were
> > mostly caused by exa allocating new gem bo's, i don't know which path is
> > taken by 3d graphics?
> > 
> > However, the fixed ttm path could indirectly solve the DMA_CMA stalls by
> > completely killing CMA for its intended purpose. Typical CMA sizes are
> > probably around < 100 MB (kernel default is 16 MB, Ubuntu config is 64 MB),
> > and the limit for the page pool seems to be more like 50% of all system RAM?
> > Iow. if the ttm dma pool is allowed to grow that big with recycled pages, it
> > probably will almost completely monopolize the whole CMA memory after a
> > short amount of time. ttm won't suffer stalls if it essentially doesn't
> > interact with CMA anymore after a warmup period, but actual clients which
> > really need CMA (ie., hardware without scatter-gather dma etc.) will be
> > starved of what they need as far as my limited understanding of the CMA
> > goes.
> 
> Yes currently we allow the pool to be way too big, given that pool was probably
> never really use we most likely never had much of an issue. So i would hold on
> applying my patch until more proper limit are in place. My thinking was to go
> for something like 32/64M at most and less then that if < 256M total ram. I also
> think that we should lower the pool size on first call to shrink and only increase
> it again after some timeout since last call to shrink so that when shrink is call
> we minimize our pool size at least for a time. Will put together couple patches
> for doing that.
> 
> > 
> > So fwiw probably the fix to ttm will increase the urgency for the CMA people
> > to come up with a fix/optimization for the allocator. Unless it doesn't
> > matter if most desktop systems have CMA disabled by default, and ttm is
> > mostly used by desktop graphics drivers (nouveau, radeon, vmgfx)? I only
> > stumbled over the problem because the Ubuntu 3.16 mainline testing kernels
> > are compiled with CMA on.
> > 
> 
> Enabling cma on x86 is proof of brain damage that said the dma allocator should
> not use the cma area for single page allocation.
> 
Harsh words.

Yes, allocating pages unconditionally from CMA if it is enabled is an
artifact of CMAs ARM heritage. While it seems completely backwards to
allocate single pages from CMA on x86, on ARM the CMA pool is the only
way to get lowmem pages on which you are allowed to change the caching
state.

So the obvious fix is to avoid CMA for order 0 allocations on x86. I can
cook a patch for this.

Regards,
Lucas 
-- 
Pengutronix e.K.             | Lucas Stach                 |
Industrial Linux Solutions   | http://www.pengutronix.de/  |

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2014-08-13  8:44 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-08 17:42 CONFIG_DMA_CMA causes ttm performance problems/hangs Mario Kleiner
2014-08-08 17:42 ` Mario Kleiner
2014-08-09  5:39 ` Thomas Hellstrom
2014-08-09  5:39   ` Thomas Hellstrom
2014-08-09 13:33   ` Konrad Rzeszutek Wilk
2014-08-09 13:33     ` Konrad Rzeszutek Wilk
2014-08-09 13:58     ` Thomas Hellstrom
2014-08-09 13:58       ` Thomas Hellstrom
2014-08-10  3:06       ` Mario Kleiner
2014-08-10  3:11       ` Mario Kleiner
2014-08-10  3:11         ` Mario Kleiner
2014-08-10 11:03         ` Thomas Hellstrom
2014-08-10 11:03           ` Thomas Hellstrom
2014-08-10 18:02           ` Mario Kleiner
2014-08-10 18:02             ` Mario Kleiner
2014-08-11 10:11             ` Thomas Hellstrom
2014-08-11 10:11               ` Thomas Hellstrom
2014-08-11 15:17               ` Jerome Glisse
2014-08-11 15:17                 ` Jerome Glisse
2014-08-12 12:12                 ` Mario Kleiner
2014-08-12 12:12                   ` Mario Kleiner
2014-08-12 20:47                   ` Konrad Rzeszutek Wilk
2014-08-12 20:47                     ` Konrad Rzeszutek Wilk
2014-08-13  1:50                 ` Michel Dänzer
2014-08-13  2:04                   ` Mario Kleiner
2014-08-13  2:17                     ` Jerome Glisse
2014-08-13  2:17                       ` Jerome Glisse
2014-08-13  8:42                       ` Lucas Stach
2014-08-13  8:42                         ` Lucas Stach
2014-08-13  2:04                   ` Jerome Glisse

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.