From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andrea Arcangeli Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool Date: Fri, 12 Dec 2008 18:52:13 +0100 Message-ID: <20081212175213.GP6809@random.random> References: <493FFAB6.2000106@codemonkey.ws> <493FFC8E.9080802@redhat.com> <49400F69.8080707@codemonkey.ws> <20081210190810.GG18814@random.random> <20081212142435.GL6809@random.random> <494276CD.6060904@codemonkey.ws> <20081212154418.GM6809@random.random> <49429629.20309@codemonkey.ws> <20081212170916.GO6809@random.random> <49429EA3.8070008@codemonkey.ws> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Gerd Hoffmann , qemu-devel@nongnu.org, kvm-devel To: Anthony Liguori Return-path: Received: from mx2.redhat.com ([66.187.237.31]:44692 "EHLO mx2.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757246AbYLLRwb (ORCPT ); Fri, 12 Dec 2008 12:52:31 -0500 Content-Disposition: inline In-Reply-To: <49429EA3.8070008@codemonkey.ws> Sender: kvm-owner@vger.kernel.org List-ID: On Fri, Dec 12, 2008 at 11:25:55AM -0600, Anthony Liguori wrote: > Hrm, that's more complex than I was expecting. I was thinking the bdrv aio > infrastructure would always take an iovec. Any details about the > underlying host's ability to handle the iovec would be insulated. You can't remove the restart memory-capped mechanism from the dma api, we've to handle dma to non-ram that potentially requires to copy the whole buffer so we're forced to have a safe linearization at the dma api layer. So it's not necessary to reinvent the same restart-partial-transfer logic in the aio layer too. Just set the define and teach the aio logic to use pread/pwrite if iovcnt == 1 and you're done ;). So what I'm suggesting is simpler than what you were expecting, not more complex. It would be more complex to replicate the restart-bounce logic in the aio layer too. Old drivers using bdrv_aio_read/write will keep working, new drivers using dma api can also use bdrv_aio_readv/writev and the linearization will happen inside the dma api if aio misses preadv/pwritev support. > If we artificially cap at say 50MB, then you do something like: > > while (buffer == NULL) { > buffer = try_to_bounce(offset, iov, iovcnt, &size); > if (buffer == NULL && errno == ENOMEM) { > pthread_wait_cond(more memory); > } > } What I meant is that you'll never get ENOMEM. The task will be instant killed during memcpy... To hope to get any meaningful behavior from the above you'd need to set overcommit = 1, otherwise you just need two qemu to alloc 50M at the same time and then memcpy at the same time to get one of the two killed with -9. > This lets us expose a preadv/pwritev function that actually works. The > expectation is that bouncing will outperform just doing pread/pwrite of > each vector. Of course, you could get smart and if try_to_bounce fail, > fall back to pread/pwrite each vector. Likewise, you can fast-path the > case of a single iovec to avoid bouncing entirely. Yes, pread/pwrite can't perform if O_DIRECT is enabled. If O_DIRECT is disabled they could perform to remotely reasonable levels depending on the host-exception cost vs memcpy cost, but we'd rather bounce to be sure: testing the dma api with a bounce buffer of 512bytes (so maximizing the number of syscalls because of the flood of restarts) slowdown the I/O like a crawl even if buffering is enabled. The syscall overhead is clearly very significant, basically memcpy is faster for 512bytes here. But just let the dma api do the iovec thing. If you want to provide an abstraction that works also if the dma api decides to send down a iovcnt > 1 then you could simply implement the fallback, but I think it's not worth it, it should never happen that you get a iovcnt > 1 when preadv/pwritev aren't available. So you'd be writing code with the only result that it could hide a performance bug -> not worth it. From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LBCBe-0000aq-Ky for qemu-devel@nongnu.org; Fri, 12 Dec 2008 12:52:26 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LBCBd-0000YZ-Sr for qemu-devel@nongnu.org; Fri, 12 Dec 2008 12:52:26 -0500 Received: from [199.232.76.173] (port=57690 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LBCBd-0000YM-PM for qemu-devel@nongnu.org; Fri, 12 Dec 2008 12:52:25 -0500 Received: from mx2.redhat.com ([66.187.237.31]:43806) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1LBCBc-0006Bt-GA for qemu-devel@nongnu.org; Fri, 12 Dec 2008 12:52:25 -0500 Date: Fri, 12 Dec 2008 18:52:13 +0100 From: Andrea Arcangeli Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool Message-ID: <20081212175213.GP6809@random.random> References: <493FFAB6.2000106@codemonkey.ws> <493FFC8E.9080802@redhat.com> <49400F69.8080707@codemonkey.ws> <20081210190810.GG18814@random.random> <20081212142435.GL6809@random.random> <494276CD.6060904@codemonkey.ws> <20081212154418.GM6809@random.random> <49429629.20309@codemonkey.ws> <20081212170916.GO6809@random.random> <49429EA3.8070008@codemonkey.ws> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <49429EA3.8070008@codemonkey.ws> Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Anthony Liguori Cc: Gerd Hoffmann , kvm-devel , qemu-devel@nongnu.org On Fri, Dec 12, 2008 at 11:25:55AM -0600, Anthony Liguori wrote: > Hrm, that's more complex than I was expecting. I was thinking the bdrv aio > infrastructure would always take an iovec. Any details about the > underlying host's ability to handle the iovec would be insulated. You can't remove the restart memory-capped mechanism from the dma api, we've to handle dma to non-ram that potentially requires to copy the whole buffer so we're forced to have a safe linearization at the dma api layer. So it's not necessary to reinvent the same restart-partial-transfer logic in the aio layer too. Just set the define and teach the aio logic to use pread/pwrite if iovcnt == 1 and you're done ;). So what I'm suggesting is simpler than what you were expecting, not more complex. It would be more complex to replicate the restart-bounce logic in the aio layer too. Old drivers using bdrv_aio_read/write will keep working, new drivers using dma api can also use bdrv_aio_readv/writev and the linearization will happen inside the dma api if aio misses preadv/pwritev support. > If we artificially cap at say 50MB, then you do something like: > > while (buffer == NULL) { > buffer = try_to_bounce(offset, iov, iovcnt, &size); > if (buffer == NULL && errno == ENOMEM) { > pthread_wait_cond(more memory); > } > } What I meant is that you'll never get ENOMEM. The task will be instant killed during memcpy... To hope to get any meaningful behavior from the above you'd need to set overcommit = 1, otherwise you just need two qemu to alloc 50M at the same time and then memcpy at the same time to get one of the two killed with -9. > This lets us expose a preadv/pwritev function that actually works. The > expectation is that bouncing will outperform just doing pread/pwrite of > each vector. Of course, you could get smart and if try_to_bounce fail, > fall back to pread/pwrite each vector. Likewise, you can fast-path the > case of a single iovec to avoid bouncing entirely. Yes, pread/pwrite can't perform if O_DIRECT is enabled. If O_DIRECT is disabled they could perform to remotely reasonable levels depending on the host-exception cost vs memcpy cost, but we'd rather bounce to be sure: testing the dma api with a bounce buffer of 512bytes (so maximizing the number of syscalls because of the flood of restarts) slowdown the I/O like a crawl even if buffering is enabled. The syscall overhead is clearly very significant, basically memcpy is faster for 512bytes here. But just let the dma api do the iovec thing. If you want to provide an abstraction that works also if the dma api decides to send down a iovcnt > 1 then you could simply implement the fallback, but I think it's not worth it, it should never happen that you get a iovcnt > 1 when preadv/pwritev aren't available. So you'd be writing code with the only result that it could hide a performance bug -> not worth it.