Am 05.08.2014 um 15:48 hat Stefan Hajnoczi geschrieben: > On Tue, Aug 05, 2014 at 06:00:22PM +0800, Ming Lei wrote: > > On Tue, Aug 5, 2014 at 5:48 PM, Kevin Wolf wrote: > > > Am 05.08.2014 um 05:33 hat Ming Lei geschrieben: > > >> Hi, > > >> > > >> These patches bring up below 4 changes: > > >> - introduce object allocation pool and apply it to > > >> virtio-blk dataplane for improving its performance > > >> > > >> - introduce selective coroutine bypass mechanism > > >> for improving performance of virtio-blk dataplane with > > >> raw format image > > > > > > Before applying any bypassing patches, I think we should understand in > > > detail where we are losing performance with coroutines enabled. > > > > From the below profiling data, CPU becomes slow to run instructions > > with coroutine, and CPU dcache miss is increased so it is very > > likely caused by switching stack frequently. > > > > http://marc.info/?l=qemu-devel&m=140679721126306&w=2 > > > > http://pastebin.com/ae0vnQ6V > > I have been wondering how to prove that the root cause is the ucontext > coroutine mechanism (stack switching). Here is an idea: > > Hack your "bypass" code path to run the request inside a coroutine. > That way you can compare "bypass without coroutine" against "bypass with > coroutine". > > Right now I think there are doubts because the bypass code path is > indeed a different (and not 100% correct) code path. So this approach > might prove that the coroutines are adding the overhead and not > something that you bypassed. My doubts aren't only that the overhead might not come from the coroutines, but also whether any coroutine-related overhead is really unavoidable. If we can optimise coroutines, I'd strongly prefer to do just that instead of introducing additional code paths. Another thought I had was this: If the performance difference is indeed only coroutines, then that is completely inside the block layer and we don't actually need a VM to test it. We could instead have something like a simple qemu-img based benchmark and should be observing the same. I played a bit with the following, I hope it's not too naive. I couldn't see a difference with your patches, but at least one reason for this is probably that my laptop SSD isn't fast enough to make the CPU the bottleneck. Haven't tried ramdisk yet, that would probably be the next thing. (I actually wrote the patch up just for some profiling on my own, not for comparing throughput, but it should be usable for that as well.) Kevin diff --git a/qemu-img-cmds.hx b/qemu-img-cmds.hx index d029609..ae64b3d 100644 --- a/qemu-img-cmds.hx +++ b/qemu-img-cmds.hx @@ -9,6 +9,12 @@ STEXI @table @option ETEXI +DEF("bench", img_bench, + "bench [-q] [-f fmt] [-n] [-t cache] filename") +STEXI +@item bench [-q] [-f @var{fmt]} [-n] [-t @var{cache}] filename +ETEXI + DEF("check", img_check, "check [-q] [-f fmt] [--output=ofmt] [-r [leaks | all]] filename") STEXI diff --git a/qemu-img.c b/qemu-img.c index d4518e7..92e9529 100644 --- a/qemu-img.c +++ b/qemu-img.c @@ -2789,6 +2789,132 @@ out: return 0; } +typedef struct BenchData { + BlockDriverState *bs; + int bufsize; + int nrreq; + int n; + uint8_t *buf; + QEMUIOVector *qiov; + + int in_flight; + uint64_t sector; +} BenchData; + +static void bench_cb(void *opaque, int ret) +{ + BenchData *b = opaque; + BlockDriverAIOCB *acb; + + if (ret < 0) { + error_report("Failed request: %s\n", strerror(-ret)); + exit(EXIT_FAILURE); + } + if (b->in_flight > 0) { + b->n--; + b->in_flight--; + } + + while (b->n > b->in_flight && b->in_flight < b->nrreq) { + acb = bdrv_aio_readv(b->bs, b->sector, b->qiov, + b->bufsize >> BDRV_SECTOR_BITS, + bench_cb, b); + if (!acb) { + error_report("Failed to issue request"); + exit(EXIT_FAILURE); + } + b->in_flight++; + b->sector += b->bufsize; + b->sector %= b->bs->total_sectors; + } +} + +static int img_bench(int argc, char **argv) +{ + int c, ret = 0; + const char *fmt = NULL, *filename; + bool quiet = false; + BlockDriverState *bs = NULL; + int flags = BDRV_O_FLAGS; + int i; + + for (;;) { + c = getopt(argc, argv, "hf:nqt:"); + if (c == -1) { + break; + } + + switch (c) { + case 'h': + case '?': + help(); + break; + case 'f': + fmt = optarg; + break; + case 'n': + flags |= BDRV_O_NATIVE_AIO; + break; + case 'q': + quiet = true; + break; + case 't': + ret = bdrv_parse_cache_flags(optarg, &flags); + if (ret < 0) { + error_report("Invalid cache mode"); + ret = -1; + goto out; + } + break; + } + } + + if (optind != argc - 1) { + error_exit("Expecting one image file name"); + } + filename = argv[argc - 1]; + + bs = bdrv_new_open("image", filename, fmt, flags, true, quiet); + if (!bs) { + error_report("Could not open image '%s'", filename); + ret = -1; + goto out; + } + + BenchData data = { + .bs = bs, + .bufsize = 0x1000, + .nrreq = 64, + .n = 75000, + }; + + data.buf = qemu_blockalign(bs, data.nrreq * data.bufsize); + data.qiov = g_new(QEMUIOVector, data.nrreq); + for (i = 0; i < data.nrreq; i++) { + qemu_iovec_init(&data.qiov[i], 1); + qemu_iovec_add(&data.qiov[i], + data.buf + i * data.bufsize, data.bufsize); + } + + bench_cb(&data, 0); + + while (data.n > 0) { + main_loop_wait(false); + } + +out: + qemu_vfree(data.buf); + if (bs) { + bdrv_unref(bs); + } + + if (ret) { + return 1; + } + return 0; +} + + static const img_cmd_t img_cmds[] = { #define DEF(option, callback, arg_string) \ { option, callback },