All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@gmail.com>
To: Liu Yuan <namei.unix@gmail.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>,
	Rusty Russell <rusty@rustcorp.com.au>,
	Avi Kivity <avi@redhat.com>,
	kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Khoa Huynh <khoa@us.ibm.com>,
	Badari Pulavarty <pbadari@us.ibm.com>,
	Christoph Hellwig <hch@infradead.org>
Subject: Re: [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device
Date: Fri, 29 Jul 2011 13:29:44 +0100	[thread overview]
Message-ID: <CAJSP0QUV1C3O6SuXzFKMi5t61T+Qa-Yy2yWEevuuAvyQarJ47g@mail.gmail.com> (raw)
In-Reply-To: <4E32A105.6080509@gmail.com>

On Fri, Jul 29, 2011 at 1:01 PM, Liu Yuan <namei.unix@gmail.com> wrote:
> On 07/29/2011 05:06 PM, Stefan Hajnoczi wrote:
>>
>> I mean did you investigate *why* userspace virtio-blk has higher
>> latency?  Did you profile it and drill down on its performance?
>>
>> It's important to understand what is going on before replacing it with
>> another mechanism.  What I'm saying is, if I have a buggy program I
>> can sometimes rewrite it from scratch correctly but that doesn't tell
>> me what the bug was.
>>
>> Perhaps the inefficiencies in userspace virtio-blk can be solved by
>> adjusting the code (removing inefficient notification mechanisms,
>> introducing a dedicated thread outside of the QEMU iothread model,
>> etc).  Then we'd get the performance benefit for non-raw images and
>> perhaps non-virtio and non-Linux host platforms too.
>>
>
> As Christoph mentioned, the unnecessary memory allocation and too much cache
> line unfriendly
> function pointers might be culprit. For example, the read quests code path
> for linux aio would be
>
>
>  qemu_iohandler_poll->virtio_pci_host_notifier_read->virtio_queue_notify_vq->virtio_blk_handle_output
> ->virtio_blk_handle_read->bdrv_aio_read->raw_aio_readv->bdrv_aio_readv(Yes
> again nested called!)->raw_aio_readv->laio_submit->io_submit...
>
> Looking at this long list,most are function pointers that can not be
> inlined, and the internal data structures used by these functions are
> dozons. Leave aside code complexity, this long code path would really need
> retrofit. As Christoph simply put, this kind of mess is inherent all over
> the qemu code. So I am afraid, the 'retrofit'  would end up to be a re-write
> the entire (sub)system. I have to admit that, I am inclined to the MST's
> vhost approach, that write a new subsystem other than tedious profiling and
> fixing, that would possibly goes as far as actually re-writing it.

I'm totally for vhost-blk if there are unique benefits that make it
worth maintaining.  But better benchmark results are not a cause, they
are an effect.  So the thing to do is to drill down on both vhost-blk
and userspace virtio-blk to understand what causes overheads.
Evidence showing that userspace can never compete is needed to justify
vhost-blk IMO.

>>> Actually, the motivation to start vhost-blk is that, in our observation,
>>> KVM(virtio enabled) in RHEL 6 is worse than Xen(PV) in RHEL in disk IO
>>> perspective, especially for sequential read/write (around 20% gap).
>>>
>>> We'll deploy a large number of KVM-based systems as the infrastructure of
>>> some service and this gap is really unpleasant.
>>>
>>> By the design, IMHO, virtio performance is supposed to be comparable to
>>> the
>>> para-vulgarization solution if not better, because for KVM, guest and
>>> backend driver could sit in the same address space via mmaping. This
>>> would
>>> reduce the overhead involved in page table modification, thus speed up
>>> the
>>> buffer management and transfer a lot compared with Xen PV.
>>
>> Yes, guest memory is just a region of QEMU userspace memory.  So it's
>> easy to reach inside and there are no page table tricks or copying
>> involved.
>>
>>> I am not in a qualified  position to talk about QEMU , but I think the
>>> surprised performance improvement by this very primitive vhost-blk simply
>>> manifest that, the internal structure for qemu io is the way bloated. I
>>> say
>>> it *surprised* because basically vhost just reduces the number of system
>>> calls, which is heavily tuned by chip manufacture for years. So, I guess
>>> the
>>> performance number vhost-blk gains mainly could possibly be contributed
>>> to
>>> *shorter and simpler* code path.
>>
>> First we need to understand exactly what the latency overhead is.  If
>> we discover that it's simply not possible to do this equally well in
>> userspace, then it makes perfect sense to use vhost-blk.
>>
>> So let's gather evidence and learn what the overheads really are.
>> Last year I spent time looking at virtio-blk latency:
>> http://www.linux-kvm.org/page/Virtio/Block/Latency
>>
>
> Nice stuff.
>
>> See especially this diagram:
>> http://www.linux-kvm.org/page/Image:Threads.png
>>
>> The goal wasn't specifically to reduce synchronous sequential I/O,
>> instead the aim was to reduce overheads for a variety of scenarios,
>> especially multithreaded workloads.
>>
>> In most cases it was helpful to move I/O submission out of the vcpu
>> thread by using the ioeventfd model just like vhost.  Ioeventfd for
>> userspace virtio-blk is now on by default in qemu-kvm.
>>
>> Try running the userspace virtio-blk benchmark with -drive
>> if=none,id=drive0,file=... -device
>> virtio-blk-pci,drive=drive0,ioeventfd=off.  This causes QEMU to do I/O
>> submission in the vcpu thread, which might reduce latency at the cost
>> of stealing guest time.
>>
>>> Anyway, IMHO, compared with user space approach, the in-kernel one would
>>> allow more flexibility and better integration with the kernel IO stack,
>>> since we don't need two IO stacks for guest OS.
>>
>> I agree that there may be advantages to integrating with in-kernel I/O
>> mechanisms.  An interesting step would be to implement the
>> submit_bio() approach that Christoph suggested and seeing if that
>> improves things further.
>>
>> Push virtio-blk as far as you can and let's see what the performance is!
>>
>>>> I have a hacked up world here that basically implements vhost-blk in
>>>> userspace:
>>>>
>>>>
>>>> http://repo.or.cz/w/qemu/stefanha.git/blob/refs/heads/virtio-blk-data-plane:/hw/virtio-blk.c
>>>>
>>>>  * A dedicated virtqueue thread sleeps on ioeventfd
>>>>  * Guest memory is pre-mapped and accessed directly (not using QEMU's
>>>> usually memory access functions)
>>>>  * Linux AIO is used, the QEMU block layer is bypassed
>>>>  * Completion interrupts are injected from the virtqueue thread using
>>>> ioctl
>>>>
>>>> I will try to rebase onto qemu-kvm.git/master (this work is several
>>>> months old).  Then we can compare to see how much of the benefit can
>>>> be gotten in userspace.
>>>>
>>> I don't really get you about vhost-blk in user space since vhost
>>> infrastructure itself means an in-kernel accelerator that implemented in
>>> kernel . I guess what you meant is somewhat a re-write of virtio-blk in
>>> user
>>> space with a dedicated thread handling requests, and shorter code path
>>> similar to vhost-blk.
>>
>> Right - it's the same model as vhost: a dedicated thread listening for
>> ioeventfd virtqueue kicks and processing them out-of-line with the
>> guest and userspace QEMU's traditional vcpu and iothread.
>>
>> When you say "IOPS drops drastically" do you mean that it gets worse
>> than with queue-depth=1?
>>
>
> Yes, on my laptop, when iodepth = 3, IOPS in my host drops to about 3,500
> from 13K! and so is iodepth = 4 in my guest during FIO seq read test. This
> should never happen.

Yes, that doesn't make sense to me unless the I/O scheduler is doing
something weird.  Have you tried switching between cfq, deadline, and
noop?

Stefan

  reply	other threads:[~2011-07-29 12:29 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-07-28 14:29 [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device Liu Yuan
2011-07-28 14:29 ` [RFC PATCH] vhost-blk: An in-kernel accelerator for virtio-blk Liu Yuan
2011-07-28 14:47   ` Christoph Hellwig
2011-07-29 11:19     ` Liu Yuan
2011-07-28 15:18   ` Stefan Hajnoczi
2011-07-28 15:22   ` Michael S. Tsirkin
2011-07-29 15:09     ` Liu Yuan
2011-08-01  6:25     ` Liu Yuan
2011-08-01  8:12       ` Michael S. Tsirkin
2011-08-01  8:55         ` Liu Yuan
2011-08-01 10:26           ` Michael S. Tsirkin
2011-08-11 19:59     ` Dongsu Park
2011-08-12  8:56       ` Alan Cox
2011-07-28 14:29 ` [RFC PATCH] vhost: Enable vhost-blk support Liu Yuan
2011-07-28 15:44 ` [RFC PATCH]vhost-blk: In-kernel accelerator for virtio block device Stefan Hajnoczi
2011-07-29  4:48   ` Stefan Hajnoczi
2011-07-29  7:59     ` Liu Yuan
2011-07-29 10:55       ` Christoph Hellwig
2011-07-29  7:22   ` Liu Yuan
2011-07-29  9:06     ` Stefan Hajnoczi
2011-07-29 12:01       ` Liu Yuan
2011-07-29 12:29         ` Stefan Hajnoczi [this message]
2011-07-29 12:50           ` Stefan Hajnoczi
2011-07-29 14:45             ` Liu Yuan
2011-07-29 14:50               ` Liu Yuan
2011-07-29 15:25         ` Sasha Levin
2011-08-01  8:17           ` Avi Kivity
2011-08-01  9:18             ` Liu Yuan
2011-08-01  9:37               ` Avi Kivity
2011-07-29 18:12     ` Badari Pulavarty
2011-08-01  5:46       ` Liu Yuan
2011-08-01  8:12         ` Christoph Hellwig
2011-08-04 21:58         ` Badari Pulavarty
2011-08-05  7:56           ` Liu Yuan
2011-08-05 11:04           ` Liu Yuan
2011-08-05 18:02             ` Badari Pulavarty
2011-08-08  1:35               ` Liu Yuan
2011-08-08  5:04                 ` Badari Pulavarty
2011-08-08  7:31                   ` Liu Yuan
2011-08-08 17:16                     ` Badari Pulavarty
2011-08-10  2:19                       ` Liu Yuan
2011-08-10 20:37                         ` Badari Pulavarty
2011-08-11  3:01                           ` Liu Yuan
2011-08-11  3:19                             ` Liu Yuan
2011-08-11 23:51                               ` Badari Pulavarty
2011-08-12  4:50                               ` Badari Pulavarty
2011-08-12  6:46                                 ` Dongsu Park
2011-08-12  8:27                                 ` Liu Yuan
2011-08-12 11:40                                   ` Liu Yuan
2011-08-12 16:12                                     ` Badari Pulavarty
2011-08-15  3:20                                       ` Liu Yuan
2011-08-15  4:17                                         ` Badari Pulavarty
2011-08-16  5:44                                           ` Liu Yuan
2011-09-07 13:36                                           ` Liu Yuan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAJSP0QUV1C3O6SuXzFKMi5t61T+Qa-Yy2yWEevuuAvyQarJ47g@mail.gmail.com \
    --to=stefanha@gmail.com \
    --cc=avi@redhat.com \
    --cc=hch@infradead.org \
    --cc=khoa@us.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=namei.unix@gmail.com \
    --cc=pbadari@us.ibm.com \
    --cc=rusty@rustcorp.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.