From mboxrd@z Thu Jan 1 00:00:00 1970 From: bugzilla-daemon@bugzilla.kernel.org Subject: [Bug 15293] Flash video laggy inside Firefox only with KMS Date: Wed, 17 Feb 2010 15:10:10 GMT Message-ID: <201002171510.o1HFAAOJ026722@demeter.kernel.org> References: Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dri-devel-bounces@lists.sourceforge.net To: dri-devel@lists.sourceforge.net List-Id: dri-devel@lists.freedesktop.org http://bugzilla.kernel.org/show_bug.cgi?id=15293 Pauli changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |suokkos@gmail.com --- Comment #11 from Pauli 2010-02-17 15:10:01 --- I did look at the provided profile data but it is mostly useless because missing debug symbols. It would be nice if it could be repeated with debug symbols installed for kernel, xserver and ddx driver. I did some profiling in my AGP system to see what there might be going on. Problematic places are for the vimeo video in original report: 1. 26% cpu time goes for allocating bo to GTT for exaGetImage (in DFS). Real problem in allocation is cache flush when changing pages from WB to WC and purge_vmap_area_lazy in ttm. But as there is report from PCIE users I don't believe this is problem for them. 3. linflashplayer.so taking directly 11% of cpu time and indirectly 12% by calling gtk/gdk. 4. 14% cpu time is oing for memcpy from GTT to system memory. This is far me that memcpy from system memory to GTT. I guess the WC caching is slowing down the operation. But I would still need to run some micro benchmarks to locate the problem 5. UTS taking 7%. That totals to 70% of cpu utilization. Firefox showing 26% cpu time total but that includes 23% for flash and only 3% for firefox. Is it possible to skip the blit to scratch in PCIe systems? Skipping the scratch would reduce memory bandwidth use quite nicely for large flash videos. Specially when flash is wasting memory bandwidth already a lot. Data flow in flash video playback is system->VRAM->system->VRAM which is causing multiple times memory bandwidth use when compared to simple video playback. Idea for DFS handler optimization for AGP systems: preallocate 2 scratch buffers to GTT (maybe 256k each?) for all DFS and UTS operations function DFS: send 2 blits (from vram to scratch) commands to GPU with fence between. i = 0; while (data to copy) { map scratch[i] memcpy scratch[i] to system memory unmap scratch[i] if ( more to read from vram ) { send blit from vram to scratch[i] } i = 1 - 0; } Here seems to be multiple performance bugs that flash is triggering to cause the effects which this bug report is about. The largest bug seems to expensive buffer object allocation to GTT. I don't know if this can be fixed in TTM code but at least ddx could reduce number of allocations. Next largest bug is that memcpy is very expensive when doing the copy from GTT to system memory. I don't know why or how to fix it without some micro benchmarks. -- Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are watching the assignee of the bug. ------------------------------------------------------------------------------ SOLARIS 10 is the OS for Data Centers - provides features such as DTrace, Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW http://p.sf.net/sfu/solaris-dev2dev --