From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool
Date: Thu, 11 Dec 2008 16:53:35 +0100
Message-ID: <20081211155335.GE14908@random.random>
References: <1228512061-25398-1-git-send-email-aliguori@us.ibm.com> <493E941D.4000608@redhat.com> <493E965E.5050701@us.ibm.com> <20081210164401.GF18814@random.random> <493FFAB6.2000106@codemonkey.ws> <493FFC8E.9080802@redhat.com> <49400F69.8080707@codemonkey.ws> <20081210190810.GG18814@random.random> <20081211131222.GA14908@random.random> <494130B5.2080800@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Anthony Liguori <anthony@codemonkey.ws>,
	kvm-devel <kvm@vger.kernel.org>, qemu-devel@nongnu.org
To: Gerd Hoffmann <kraxel@redhat.com>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx2.redhat.com ([66.187.237.31]:37922 "EHLO mx2.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1755669AbYLKPxm (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 11 Dec 2008 10:53:42 -0500
Content-Disposition: inline
In-Reply-To: <494130B5.2080800@redhat.com>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Thu, Dec 11, 2008 at 04:24:37PM +0100, Gerd Hoffmann wrote:
> Well, linux kernel aio has its share of problems too:
> 
>   * Anthony mentioned it may block on certain circumstances (forgot
>     which ones), and you can't figure beforehand to turn off aio then.

We've worse problems as long as bdrv_read/write are used by qcow2. And
we can fix host kernel in the long run if this becomes an issue.

>   * It can't handle block allocation.  Kernel handles that by doing
>     such writes synchronously via VFS layer (instead of the separate
>     aio code paths).  Leads to horrible performance and bug reports
>     such as "installs on sparse files are very slow".

I think here you mean O_DIRECT regardless of aio/sync API, I doubt aio
has any relevance to block allocation in any way, so whatever problem
we have with kernel API and O_DIRECT should also be there with
sync-api + userland threads and O_DIRECT.

>   * support for vectored aio isn't that old.  IIRC it was added
>     somewhen around 2.6.20 (newer that current suse/redhat enterprise
>     versions).  Which IMHO means you can't expect it being present
>     unconditionally.

I think this is a false alarm: the whole point of kernel AIO is that
even if O_DIRECT is enabled, all bios are pushed to the disk before
the disk queue is unplugged which is all we care about to get decent
disk bandwidth with zerocopy dma. Or at least that's the way it's
supposed to work if aio is implemented correctly at the bio level.

So in kernels that don't support IOCB_CMD_READV/WRITEV, we've simply
to an array of iocb through io_submit (i.e. to conver the iov into a
vector of iocb, instead of a single iocb pointing to the
iov). Internally to io_submit a single dma command should be generated
and the same sg list should be built the same as if we used
READV/WRITEV. In theory READV/WRITEV should be just a cpu saving
feature, it shouldn't influence disk bandwidth. If it does, it means
the bio layer is broken and needs fixing.

If IOCB_CMD_READV/WRITEV is available, good, if not we go with
READ/WRITE and more iocb dynamically allocated. It just needs a
conversion routine from iovec, file, offset to iocb pointer when
IOCB_CMD_READV/WRITEV is not available. The iocb array can be
preallocated along with the iovec when we detect IOCB_CMD_READV/WRITEV
is not available, I've a cache layer that does this and I'll just
provide an output selectable in iovec or iocb terms, with iocb
selectable depending if host os is linux and IOCB_CMD_READV/WRITEV is
not available.

> Threads will be there anyway for kvm smp.

Yes, I didn't mean those threads ;), I love threads, but I love
threads that are CPU bound and allow to exploit the whole power of the
system! But for storage, threads are purely overscheduling overhead as
far as I can tell, given we've an async api to use and we already have
to deal with the pain of async programming. So it worth we get the
full benefit of it (i.e. no thread/overscheduling overhead).

If aio inside the kernel is too complex than use kernel threads, it's
still better than user threads.

I mean if we keep only using threads we should get rid of bdrv_aio*
completely and move qcow2 code in a separate thread instead of keep
running it from the io thread. If we stick to threads then it worth to
get the full benefit of threads (i.e. not having to deal with the
pains of async programming and moving the qcow2 computation in a
separate CPU). Something I tried doing but I ended up having to add
locks all over qcow2 in order to submit multiple qcow2 requests in
parallel (otherwise the lock would be global and I couldn't
differentiate between a bdrv_read for qcow2 metadata that must be
executed with the qcow2 mutex held, and a bdrv_aio_readv that can run
lockless from the point of view of the current qcow2 instance - the
qcow2 parent may take its own locks then etc..). Basically it breaks
all backends something I'm not confortable with right now just to get
zerocopy dma working at platter speed. Hence I stick with async
programming for now...

> Well, wait for glibc isn't going to fly.  glibc waits for posix, and
> posix waits for a reference implementation (which will not be glibc).

Agree.

> > and kernel with preadv/pwritev
> 
> With that in place you don't need kernel aio any more, then you can
> really do it in userspace with threads.  But that probably would be
> linux-only  ^W^W^W

Waiting for preadv/pwritev is just the 'quicker' version of waiting
glibc aio_readv. And because it remains a linux-only, I prefer kernel
AIO that fixes cfq and should be the most optimal anyway (with or
without READV/WRITEV support).

So in the end: we either open the file 64 times (which I think is
perfectly coherent with nfs unless the nfs client is broken, but then
Anthony may know nfs better, I'm not heavy nfs user here), or we go
with kernel AIO... you know my preference. Said that opening the file
64 times is probably simpler, if it has been confirmed that it doesn't
break nfs. Breaking nfs is not possible here, nfs is the ideal shared
storage for migration (we surely want to exploit the fact we need so
weak semantics we need to do a safe migration, that it worth to keep
nfs supported as 100% KVM reliable virtualization shared storage).

  > > ahem: http://www.daemon-systems.org/man/preadv.2.html > >

Too bad nobody implemented it yet...

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1LAnrE-00008C-4B
	for qemu-devel@nongnu.org; Thu, 11 Dec 2008 10:53:44 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1LAnrD-00006j-8O
	for qemu-devel@nongnu.org; Thu, 11 Dec 2008 10:53:43 -0500
Received: from [199.232.76.173] (port=42491 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1LAnrC-00006b-TM
	for qemu-devel@nongnu.org; Thu, 11 Dec 2008 10:53:42 -0500
Received: from mx2.redhat.com ([66.187.237.31]:38881)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <aarcange@redhat.com>) id 1LAnrC-0005NU-DA
	for qemu-devel@nongnu.org; Thu, 11 Dec 2008 10:53:42 -0500
Date: Thu, 11 Dec 2008 16:53:35 +0100
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool
Message-ID: <20081211155335.GE14908@random.random>
References: <1228512061-25398-1-git-send-email-aliguori@us.ibm.com>
	<493E941D.4000608@redhat.com> <493E965E.5050701@us.ibm.com>
	<20081210164401.GF18814@random.random>
	<493FFAB6.2000106@codemonkey.ws> <493FFC8E.9080802@redhat.com>
	<49400F69.8080707@codemonkey.ws>
	<20081210190810.GG18814@random.random>
	<20081211131222.GA14908@random.random>
	<494130B5.2080800@redhat.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <494130B5.2080800@redhat.com>
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Gerd Hoffmann <kraxel@redhat.com>
Cc: kvm-devel <kvm@vger.kernel.org>, qemu-devel@nongnu.org

On Thu, Dec 11, 2008 at 04:24:37PM +0100, Gerd Hoffmann wrote:
> Well, linux kernel aio has its share of problems too:
> 
>   * Anthony mentioned it may block on certain circumstances (forgot
>     which ones), and you can't figure beforehand to turn off aio then.

We've worse problems as long as bdrv_read/write are used by qcow2. And
we can fix host kernel in the long run if this becomes an issue.

>   * It can't handle block allocation.  Kernel handles that by doing
>     such writes synchronously via VFS layer (instead of the separate
>     aio code paths).  Leads to horrible performance and bug reports
>     such as "installs on sparse files are very slow".

I think here you mean O_DIRECT regardless of aio/sync API, I doubt aio
has any relevance to block allocation in any way, so whatever problem
we have with kernel API and O_DIRECT should also be there with
sync-api + userland threads and O_DIRECT.

>   * support for vectored aio isn't that old.  IIRC it was added
>     somewhen around 2.6.20 (newer that current suse/redhat enterprise
>     versions).  Which IMHO means you can't expect it being present
>     unconditionally.

I think this is a false alarm: the whole point of kernel AIO is that
even if O_DIRECT is enabled, all bios are pushed to the disk before
the disk queue is unplugged which is all we care about to get decent
disk bandwidth with zerocopy dma. Or at least that's the way it's
supposed to work if aio is implemented correctly at the bio level.

So in kernels that don't support IOCB_CMD_READV/WRITEV, we've simply
to an array of iocb through io_submit (i.e. to conver the iov into a
vector of iocb, instead of a single iocb pointing to the
iov). Internally to io_submit a single dma command should be generated
and the same sg list should be built the same as if we used
READV/WRITEV. In theory READV/WRITEV should be just a cpu saving
feature, it shouldn't influence disk bandwidth. If it does, it means
the bio layer is broken and needs fixing.

If IOCB_CMD_READV/WRITEV is available, good, if not we go with
READ/WRITE and more iocb dynamically allocated. It just needs a
conversion routine from iovec, file, offset to iocb pointer when
IOCB_CMD_READV/WRITEV is not available. The iocb array can be
preallocated along with the iovec when we detect IOCB_CMD_READV/WRITEV
is not available, I've a cache layer that does this and I'll just
provide an output selectable in iovec or iocb terms, with iocb
selectable depending if host os is linux and IOCB_CMD_READV/WRITEV is
not available.

> Threads will be there anyway for kvm smp.

Yes, I didn't mean those threads ;), I love threads, but I love
threads that are CPU bound and allow to exploit the whole power of the
system! But for storage, threads are purely overscheduling overhead as
far as I can tell, given we've an async api to use and we already have
to deal with the pain of async programming. So it worth we get the
full benefit of it (i.e. no thread/overscheduling overhead).

If aio inside the kernel is too complex than use kernel threads, it's
still better than user threads.

I mean if we keep only using threads we should get rid of bdrv_aio*
completely and move qcow2 code in a separate thread instead of keep
running it from the io thread. If we stick to threads then it worth to
get the full benefit of threads (i.e. not having to deal with the
pains of async programming and moving the qcow2 computation in a
separate CPU). Something I tried doing but I ended up having to add
locks all over qcow2 in order to submit multiple qcow2 requests in
parallel (otherwise the lock would be global and I couldn't
differentiate between a bdrv_read for qcow2 metadata that must be
executed with the qcow2 mutex held, and a bdrv_aio_readv that can run
lockless from the point of view of the current qcow2 instance - the
qcow2 parent may take its own locks then etc..). Basically it breaks
all backends something I'm not confortable with right now just to get
zerocopy dma working at platter speed. Hence I stick with async
programming for now...

> Well, wait for glibc isn't going to fly.  glibc waits for posix, and
> posix waits for a reference implementation (which will not be glibc).

Agree.

> > and kernel with preadv/pwritev
> 
> With that in place you don't need kernel aio any more, then you can
> really do it in userspace with threads.  But that probably would be
> linux-only  ^W^W^W

Waiting for preadv/pwritev is just the 'quicker' version of waiting
glibc aio_readv. And because it remains a linux-only, I prefer kernel
AIO that fixes cfq and should be the most optimal anyway (with or
without READV/WRITEV support).

So in the end: we either open the file 64 times (which I think is
perfectly coherent with nfs unless the nfs client is broken, but then
Anthony may know nfs better, I'm not heavy nfs user here), or we go
with kernel AIO... you know my preference. Said that opening the file
64 times is probably simpler, if it has been confirmed that it doesn't
break nfs. Breaking nfs is not possible here, nfs is the ideal shared
storage for migration (we surely want to exploit the fact we need so
weak semantics we need to do a safe migration, that it worth to keep
nfs supported as 100% KVM reliable virtualization shared storage).

  > > ahem: http://www.daemon-systems.org/man/preadv.2.html > >

Too bad nobody implemented it yet...