From mboxrd@z Thu Jan  1 00:00:00 1970
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool
Date: Fri, 12 Dec 2008 18:52:13 +0100
Message-ID: <20081212175213.GP6809@random.random>
References: <493FFAB6.2000106@codemonkey.ws> <493FFC8E.9080802@redhat.com> <49400F69.8080707@codemonkey.ws> <20081210190810.GG18814@random.random> <20081212142435.GL6809@random.random> <494276CD.6060904@codemonkey.ws> <20081212154418.GM6809@random.random> <49429629.20309@codemonkey.ws> <20081212170916.GO6809@random.random> <49429EA3.8070008@codemonkey.ws>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: Gerd Hoffmann <kraxel@redhat.com>, qemu-devel@nongnu.org,
	kvm-devel <kvm@vger.kernel.org>
To: Anthony Liguori <anthony@codemonkey.ws>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx2.redhat.com ([66.187.237.31]:44692 "EHLO mx2.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1757246AbYLLRwb (ORCPT <rfc822;kvm@vger.kernel.org>);
	Fri, 12 Dec 2008 12:52:31 -0500
Content-Disposition: inline
In-Reply-To: <49429EA3.8070008@codemonkey.ws>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On Fri, Dec 12, 2008 at 11:25:55AM -0600, Anthony Liguori wrote:
> Hrm, that's more complex than I was expecting.  I was thinking the bdrv aio 
> infrastructure would always take an iovec.  Any details about the 
> underlying host's ability to handle the iovec would be insulated.

You can't remove the restart memory-capped mechanism from the dma api,
we've to handle dma to non-ram that potentially requires to copy the
whole buffer so we're forced to have a safe linearization at the dma
api layer. So it's not necessary to reinvent the same
restart-partial-transfer logic in the aio layer too. Just set the
define and teach the aio logic to use pread/pwrite if iovcnt == 1 and
you're done ;).

So what I'm suggesting is simpler than what you were expecting, not
more complex. It would be more complex to replicate the restart-bounce
logic in the aio layer too.

Old drivers using bdrv_aio_read/write will keep working, new drivers
using dma api can also use bdrv_aio_readv/writev and the linearization
will happen inside the dma api if aio misses preadv/pwritev support.

> If we artificially cap at say 50MB, then you do something like:
>
> while (buffer == NULL) {
>   buffer = try_to_bounce(offset, iov, iovcnt, &size);
>   if (buffer == NULL && errno == ENOMEM) {
>      pthread_wait_cond(more memory);
>   }
> }

What I meant is that you'll never get ENOMEM. The task will be instant
killed during memcpy... To hope to get any meaningful behavior from
the above you'd need to set overcommit = 1, otherwise you just need
two qemu to alloc 50M at the same time and then memcpy at the same
time to get one of the two killed with -9.

> This lets us expose a preadv/pwritev function that actually works.  The 
> expectation is that bouncing will outperform just doing pread/pwrite of 
> each vector.  Of course, you could get smart and if try_to_bounce fail, 
> fall back to pread/pwrite each vector.  Likewise, you can fast-path the 
> case of a single iovec to avoid bouncing entirely.

Yes, pread/pwrite can't perform if O_DIRECT is enabled. If O_DIRECT is
disabled they could perform to remotely reasonable levels depending on
the host-exception cost vs memcpy cost, but we'd rather bounce to be
sure: testing the dma api with a bounce buffer of 512bytes (so
maximizing the number of syscalls because of the flood of restarts)
slowdown the I/O like a crawl even if buffering is enabled. The
syscall overhead is clearly very significant, basically memcpy is
faster for 512bytes here.

But just let the dma api do the iovec thing. If you want to provide an
abstraction that works also if the dma api decides to send down a
iovcnt > 1 then you could simply implement the fallback, but I think
it's not worth it, it should never happen that you get a iovcnt > 1
when preadv/pwritev aren't available. So you'd be writing code with
the only result that it could hide a performance bug -> not worth it.

From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1LBCBe-0000aq-Ky
	for qemu-devel@nongnu.org; Fri, 12 Dec 2008 12:52:26 -0500
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1LBCBd-0000YZ-Sr
	for qemu-devel@nongnu.org; Fri, 12 Dec 2008 12:52:26 -0500
Received: from [199.232.76.173] (port=57690 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1LBCBd-0000YM-PM
	for qemu-devel@nongnu.org; Fri, 12 Dec 2008 12:52:25 -0500
Received: from mx2.redhat.com ([66.187.237.31]:43806)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <aarcange@redhat.com>) id 1LBCBc-0006Bt-GA
	for qemu-devel@nongnu.org; Fri, 12 Dec 2008 12:52:25 -0500
Date: Fri, 12 Dec 2008 18:52:13 +0100
From: Andrea Arcangeli <aarcange@redhat.com>
Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool
Message-ID: <20081212175213.GP6809@random.random>
References: <493FFAB6.2000106@codemonkey.ws> <493FFC8E.9080802@redhat.com>
	<49400F69.8080707@codemonkey.ws>
	<20081210190810.GG18814@random.random>
	<20081212142435.GL6809@random.random>
	<494276CD.6060904@codemonkey.ws>
	<20081212154418.GM6809@random.random>
	<49429629.20309@codemonkey.ws>
	<20081212170916.GO6809@random.random>
	<49429EA3.8070008@codemonkey.ws>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <49429EA3.8070008@codemonkey.ws>
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Gerd Hoffmann <kraxel@redhat.com>, kvm-devel <kvm@vger.kernel.org>, qemu-devel@nongnu.org

On Fri, Dec 12, 2008 at 11:25:55AM -0600, Anthony Liguori wrote:
> Hrm, that's more complex than I was expecting.  I was thinking the bdrv aio 
> infrastructure would always take an iovec.  Any details about the 
> underlying host's ability to handle the iovec would be insulated.

You can't remove the restart memory-capped mechanism from the dma api,
we've to handle dma to non-ram that potentially requires to copy the
whole buffer so we're forced to have a safe linearization at the dma
api layer. So it's not necessary to reinvent the same
restart-partial-transfer logic in the aio layer too. Just set the
define and teach the aio logic to use pread/pwrite if iovcnt == 1 and
you're done ;).

So what I'm suggesting is simpler than what you were expecting, not
more complex. It would be more complex to replicate the restart-bounce
logic in the aio layer too.

Old drivers using bdrv_aio_read/write will keep working, new drivers
using dma api can also use bdrv_aio_readv/writev and the linearization
will happen inside the dma api if aio misses preadv/pwritev support.

> If we artificially cap at say 50MB, then you do something like:
>
> while (buffer == NULL) {
>   buffer = try_to_bounce(offset, iov, iovcnt, &size);
>   if (buffer == NULL && errno == ENOMEM) {
>      pthread_wait_cond(more memory);
>   }
> }

What I meant is that you'll never get ENOMEM. The task will be instant
killed during memcpy... To hope to get any meaningful behavior from
the above you'd need to set overcommit = 1, otherwise you just need
two qemu to alloc 50M at the same time and then memcpy at the same
time to get one of the two killed with -9.

> This lets us expose a preadv/pwritev function that actually works.  The 
> expectation is that bouncing will outperform just doing pread/pwrite of 
> each vector.  Of course, you could get smart and if try_to_bounce fail, 
> fall back to pread/pwrite each vector.  Likewise, you can fast-path the 
> case of a single iovec to avoid bouncing entirely.

Yes, pread/pwrite can't perform if O_DIRECT is enabled. If O_DIRECT is
disabled they could perform to remotely reasonable levels depending on
the host-exception cost vs memcpy cost, but we'd rather bounce to be
sure: testing the dma api with a bounce buffer of 512bytes (so
maximizing the number of syscalls because of the flood of restarts)
slowdown the I/O like a crawl even if buffering is enabled. The
syscall overhead is clearly very significant, basically memcpy is
faster for 512bytes here.

But just let the dma api do the iovec thing. If you want to provide an
abstraction that works also if the dma api decides to send down a
iovcnt > 1 then you could simply implement the fallback, but I think
it's not worth it, it should never happen that you get a iovcnt > 1
when preadv/pwritev aren't available. So you'd be writing code with
the only result that it could hide a performance bug -> not worth it.