From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:53099) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XEy9y-0005h2-PZ for qemu-devel@nongnu.org; Wed, 06 Aug 2014 06:09:35 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1XEy9t-0004np-Jk for qemu-devel@nongnu.org; Wed, 06 Aug 2014 06:09:30 -0400 Received: from mx1.redhat.com ([209.132.183.28]:42336) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1XEy9t-0004nl-Cm for qemu-devel@nongnu.org; Wed, 06 Aug 2014 06:09:25 -0400 Date: Wed, 6 Aug 2014 12:09:18 +0200 From: Kevin Wolf Message-ID: <20140806100918.GC4090@noname.str.redhat.com> References: <1407209598-2572-1-git-send-email-ming.lei@canonical.com> <20140805094844.GF4391@noname.str.redhat.com> <20140805134815.GD12251@stefanha-thinkpad.redhat.com> <20140805144728.GH4391@noname.str.redhat.com> <20140806084855.GA4090@noname.str.redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [Qemu-devel] [PATCH v1 00/17] dataplane: optimization and multi virtqueue support List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Ming Lei Cc: Peter Maydell , Fam Zheng , "Michael S. Tsirkin" , qemu-devel , Stefan Hajnoczi , Paolo Bonzini Am 06.08.2014 um 11:37 hat Ming Lei geschrieben: > On Wed, Aug 6, 2014 at 4:48 PM, Kevin Wolf wrote: > > Am 06.08.2014 um 07:33 hat Ming Lei geschrieben: > >> Hi Kevin, > >> > >> On Tue, Aug 5, 2014 at 10:47 PM, Kevin Wolf wrote: > >> > Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben: > >> >> I have been wondering how to prove that the root cause is the ucontext > >> >> coroutine mechanism (stack switching). Here is an idea: > >> >> > >> >> Hack your "bypass" code path to run the request inside a coroutine. > >> >> That way you can compare "bypass without coroutine" against "bypass with > >> >> coroutine". > >> >> > >> >> Right now I think there are doubts because the bypass code path is > >> >> indeed a different (and not 100% correct) code path. So this approach > >> >> might prove that the coroutines are adding the overhead and not > >> >> something that you bypassed. > >> > > >> > My doubts aren't only that the overhead might not come from the > >> > coroutines, but also whether any coroutine-related overhead is really > >> > unavoidable. If we can optimise coroutines, I'd strongly prefer to do > >> > just that instead of introducing additional code paths. > >> > >> OK, thank you for taking look at the problem, and hope we can > >> figure out the root cause, :-) > >> > >> > > >> > Another thought I had was this: If the performance difference is indeed > >> > only coroutines, then that is completely inside the block layer and we > >> > don't actually need a VM to test it. We could instead have something > >> > like a simple qemu-img based benchmark and should be observing the same. > >> > >> Even it is simpler to run a coroutine-only benchmark, and I just > >> wrote a raw one, and looks coroutine does decrease performance > >> a lot, please see the attachment patch, and thanks for your template > >> to help me add the 'co_bench' command in qemu-img. > > > > Yes, we can look at coroutines microbenchmarks in isolation. I actually > > did do that yesterday with the yield test from tests/test-coroutine.c. > > And in fact profiling immediately showed something to optimise: > > pthread_getspecific() was quite high, replacing it by __thread on > > systems where it works is more efficient and helped the numbers a bit. > > Also, a lot of time seems to be spent in pthread_mutex_lock/unlock (even > > in qemu-img bench), maybe there's even something that can be done here. > > The lock/unlock in dataplane is often from memory_region_find(), and Paolo > should have done lots of work on that. > > > > > However, I just wasn't sure whether a change on this level would be > > relevant in a realistic environment. This is the reason why I wanted to > > get a benchmark involving the block layer and some I/O. > > > >> From the profiling data in below link: > >> > >> http://pastebin.com/YwH2uwbq > >> > >> With coroutine, the running time for same loading is increased > >> ~50%(1.325s vs. 0.903s), and dcache load events is increased > >> ~35%(693M vs. 512M), insns per cycle is decreased by ~50%( > >> 1.35 vs. 1.63), compared with bypassing coroutine(-b parameter). > >> > >> The bypass code in the benchmark is very similar with the approach > >> used in the bypass patch, since linux-aio with O_DIRECT seldom > >> blocks in the the kernel I/O path. > >> > >> Maybe the benchmark is a bit extremely, but given modern storage > >> device may reach millions of IOPS, and it is very easy to slow down > >> the I/O by coroutine. > > > > I think in order to optimise coroutines, such benchmarks are fair game. > > It's just not guaranteed that the effects are exactly the same on real > > workloads, so we should take the results with a grain of salt. > > > > Anyhow, the coroutine version of your benchmark is buggy, it leaks all > > coroutines instead of exiting them, so it can't make any use of the > > coroutine pool. On my laptop, I get this (where fixed coroutine is a > > version that simply removes the yield at the end): > > > > | bypass | fixed coro | buggy coro > > ----------------+---------------+---------------+-------------- > > time | 1.09s | 1.10s | 1.62s > > L1-dcache-loads | 921,836,360 | 932,781,747 | 1,298,067,438 > > insns per cycle | 2.39 | 2.39 | 1.90 > > > > Begs the question whether you see a similar effect on a real qemu and > > the coroutine pool is still not big enough? With correct use of > > coroutines, the difference seems to be barely measurable even without > > any I/O involved. > > When I comment qemu_coroutine_yield(), looks result of > bypass and fixed coro is very similar as your test, and I am just > wondering if stack is always switched in qemu_coroutine_enter() > without calling qemu_coroutine_yield(). Yes, definitely. qemu_coroutine_enter() always involves calling qemu_coroutine_switch(), which is the stack switch. > Without the yield, the benchmark can't emulate coroutine usage in > bdrv_aio_readv/writev() path any more, and bypass in the patchset > skips two qemu_coroutine_enter() and one qemu_coroutine_yield() > for each bdrv_aio_readv/writev(). It's not completely comparable anyway because you're not going through a main loop and callbacks from there for your benchmark. But fair enough: Keep the yield, but enter the coroutine twice then. You get slightly worse results then, but that's more like doubling the very small difference between "bypass" and "fixed coro" (1.11s / 946,434,327 / 2.37), not like the horrible performance of the buggy version. Actually, that's within the error of measurement for time and insns/cycle, so running it for a bit longer: | bypass | coro | + yield | buggy coro ----------------+-----------+-----------+-----------+-------------- time | 21.45s | 21.68s | 21.83s | 97.05s L1-dcache-loads | 18,049 M | 18,387 M | 18,618 M | 26,062 M insns per cycle | 2.42 | 2.40 | 2.41 | 1.75 > >> > I played a bit with the following, I hope it's not too naive. I couldn't > >> > see a difference with your patches, but at least one reason for this is > >> > probably that my laptop SSD isn't fast enough to make the CPU the > >> > bottleneck. Haven't tried ramdisk yet, that would probably be the next > >> > thing. (I actually wrote the patch up just for some profiling on my own, > >> > not for comparing throughput, but it should be usable for that as well.) > >> > >> This might not be good for the test since it is basically a sequential > >> read test, which can be optimized a lot by kernel. And I always use > >> randread benchmark. > > > > Yes, I shortly pondered whether I should implement random offsets > > instead. But then I realised that a quicker kernel operation would only > > help the benchmark because we want it to test the CPU consumption in > > userspace. So the faster the kernel gets, the better for us, because it > > should make the impact of coroutines bigger. > > OK, I will compare coroutine vs. bypass-co with the benchmark. Ok, thanks. Kevin