From mboxrd@z Thu Jan 1 00:00:00 1970
From: bugzilla-daemon@bugzilla.kernel.org
Subject: [Bug 15293] Flash video laggy inside Firefox only with KMS
Date: Wed, 17 Feb 2010 15:10:10 GMT
Message-ID: <201002171510.o1HFAAOJ026722@demeter.kernel.org>
References:
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path:
In-Reply-To:
List-Unsubscribe: ,
List-Archive:
List-Post:
List-Help:
List-Subscribe: ,
Errors-To: dri-devel-bounces@lists.sourceforge.net
To: dri-devel@lists.sourceforge.net
List-Id: dri-devel@lists.freedesktop.org
http://bugzilla.kernel.org/show_bug.cgi?id=15293
Pauli changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |suokkos@gmail.com
--- Comment #11 from Pauli 2010-02-17 15:10:01 ---
I did look at the provided profile data but it is mostly useless because
missing debug symbols. It would be nice if it could be repeated with debug
symbols installed for kernel, xserver and ddx driver.
I did some profiling in my AGP system to see what there might be going on.
Problematic places are for the vimeo video in original report:
1. 26% cpu time goes for allocating bo to GTT for exaGetImage (in DFS). Real
problem in allocation is cache flush when changing pages from WB to WC and
purge_vmap_area_lazy in ttm. But as there is report from PCIE users I don't
believe this is problem for them.
3. linflashplayer.so taking directly 11% of cpu time and indirectly 12% by
calling gtk/gdk.
4. 14% cpu time is oing for memcpy from GTT to system memory. This is far me
that memcpy from system memory to GTT. I guess the WC caching is slowing down
the operation. But I would still need to run some micro benchmarks to locate
the problem
5. UTS taking 7%.
That totals to 70% of cpu utilization. Firefox showing 26% cpu time total but
that includes 23% for flash and only 3% for firefox.
Is it possible to skip the blit to scratch in PCIe systems? Skipping the
scratch would reduce memory bandwidth use quite nicely for large flash videos.
Specially when flash is wasting memory bandwidth already a lot. Data flow in
flash video playback is system->VRAM->system->VRAM which is causing multiple
times memory bandwidth use when compared to simple video playback.
Idea for DFS handler optimization for AGP systems:
preallocate 2 scratch buffers to GTT (maybe 256k each?) for all DFS and UTS
operations
function DFS:
send 2 blits (from vram to scratch) commands to GPU with fence between.
i = 0;
while (data to copy) {
map scratch[i]
memcpy scratch[i] to system memory
unmap scratch[i]
if ( more to read from vram ) {
send blit from vram to scratch[i]
}
i = 1 - 0;
}
Here seems to be multiple performance bugs that flash is triggering to cause
the effects which this bug report is about.
The largest bug seems to expensive buffer object allocation to GTT. I don't
know if this can be fixed in TTM code but at least ddx could reduce number of
allocations.
Next largest bug is that memcpy is very expensive when doing the copy from GTT
to system memory. I don't know why or how to fix it without some micro
benchmarks.
--
Configure bugmail: http://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are watching the assignee of the bug.
------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
--