From mboxrd@z Thu Jan 1 00:00:00 1970 From: Anthony Liguori Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool Date: Fri, 12 Dec 2008 11:25:55 -0600 Message-ID: <49429EA3.8070008@codemonkey.ws> References: <493E965E.5050701@us.ibm.com> <20081210164401.GF18814@random.random> <493FFAB6.2000106@codemonkey.ws> <493FFC8E.9080802@redhat.com> <49400F69.8080707@codemonkey.ws> <20081210190810.GG18814@random.random> <20081212142435.GL6809@random.random> <494276CD.6060904@codemonkey.ws> <20081212154418.GM6809@random.random> <49429629.20309@codemonkey.ws> <20081212170916.GO6809@random.random> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: Gerd Hoffmann , qemu-devel@nongnu.org, kvm-devel To: Andrea Arcangeli Return-path: Received: from mail-qy0-f11.google.com ([209.85.221.11]:57526 "EHLO mail-qy0-f11.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758544AbYLLR0B (ORCPT ); Fri, 12 Dec 2008 12:26:01 -0500 Received: by qyk4 with SMTP id 4so1747895qyk.13 for ; Fri, 12 Dec 2008 09:26:00 -0800 (PST) In-Reply-To: <20081212170916.GO6809@random.random> Sender: kvm-owner@vger.kernel.org List-ID: Andrea Arcangeli wrote: > On Fri, Dec 12, 2008 at 10:49:45AM -0600, Anthony Liguori wrote: > >> I meant, if you wanted to pass a file descriptor as a raw device. So: >> >> qemu -hda raw:fd=4 >> >> Or something like that. We don't support this today. >> > > ah ok. > > >> I think bouncing the iov and just using pread/pwrite may be our best bet. >> It means memory allocation but we can cap it. Since we're using threads, >> > > It's already capped. However currently it generates an iovec, but > we've simply to check the iovcnt to be 1, if it's 1 we pread from > iov.iov_base, iov.iov_len. The dma api will take care to enforce > iovcnt to be 1 for the iovec if preadv/pwritev isn't detected at > compile time. > Hrm, that's more complex than I was expecting. I was thinking the bdrv aio infrastructure would always take an iovec. Any details about the underlying host's ability to handle the iovec would be insulated. >> we just can force a thread to sleep until memory becomes available so it's >> actually pretty straight forward. >> > > There's no way to detect that and wait for memory, If we artificially cap at say 50MB, then you do something like: while (buffer == NULL) { buffer = try_to_bounce(offset, iov, iovcnt, &size); if (buffer == NULL && errno == ENOMEM) { pthread_wait_cond(more memory); } } try_to_bounce allocs with malloc() but if you exceed 50MB, then you fail with an error of ENOMEM. In your bounce_free() function, you do a pthread_cond_broadcast() to wake up any threads potentially waiting to allocate memory. This lets us expose a preadv/pwritev function that actually works. The expectation is that bouncing will outperform just doing pread/pwrite of each vector. Of course, you could get smart and if try_to_bounce fail, fall back to pread/pwrite each vector. Likewise, you can fast-path the case of a single iovec to avoid bouncing entirely. Regards, Anthony Liguori > it'd sigkill before > you can check... at least with the default overcommit. The way the dma > api works, is that it doesn't send a mega large writev, but send it in > pieces capped by the max buffer size, with many iovecs with iovcnt = 1. > > >> We can use libaio on older Linux's to simulate preadv/pwritev. Use the >> proper syscalls on newer kernels, on BSDs, and bounce everything else. >> > > Given READV/WRITEV aren't available in not very recent kernels and > given that without O_DIRECT each iocb will become synchronous, we > can't use the libaio. Also once they fix linux-aio, if we do that, the > iocb logic would need to be largely refactored. So I'm not sure if it > worth it as it can't handle 2.6.16-18 when O_DIRECT is disabled (when > O_DIRECT is enabled we could just build an array of linear iocb). > From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43) id 1LBBm8-0001lh-3m for qemu-devel@nongnu.org; Fri, 12 Dec 2008 12:26:04 -0500 Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43) id 1LBBm6-0001kn-CI for qemu-devel@nongnu.org; Fri, 12 Dec 2008 12:26:03 -0500 Received: from [199.232.76.173] (port=40970 helo=monty-python.gnu.org) by lists.gnu.org with esmtp (Exim 4.43) id 1LBBm6-0001ki-93 for qemu-devel@nongnu.org; Fri, 12 Dec 2008 12:26:02 -0500 Received: from mail-qy0-f20.google.com ([209.85.221.20]:38756) by monty-python.gnu.org with esmtp (Exim 4.60) (envelope-from ) id 1LBBm5-0002aP-MB for qemu-devel@nongnu.org; Fri, 12 Dec 2008 12:26:02 -0500 Received: by qyk13 with SMTP id 13so2187819qyk.10 for ; Fri, 12 Dec 2008 09:26:00 -0800 (PST) Message-ID: <49429EA3.8070008@codemonkey.ws> Date: Fri, 12 Dec 2008 11:25:55 -0600 From: Anthony Liguori MIME-Version: 1.0 Subject: Re: [Qemu-devel] [RFC] Replace posix-aio with custom thread pool References: <493E965E.5050701@us.ibm.com> <20081210164401.GF18814@random.random> <493FFAB6.2000106@codemonkey.ws> <493FFC8E.9080802@redhat.com> <49400F69.8080707@codemonkey.ws> <20081210190810.GG18814@random.random> <20081212142435.GL6809@random.random> <494276CD.6060904@codemonkey.ws> <20081212154418.GM6809@random.random> <49429629.20309@codemonkey.ws> <20081212170916.GO6809@random.random> In-Reply-To: <20081212170916.GO6809@random.random> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Reply-To: qemu-devel@nongnu.org List-Id: qemu-devel.nongnu.org List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Andrea Arcangeli Cc: Gerd Hoffmann , kvm-devel , qemu-devel@nongnu.org Andrea Arcangeli wrote: > On Fri, Dec 12, 2008 at 10:49:45AM -0600, Anthony Liguori wrote: > >> I meant, if you wanted to pass a file descriptor as a raw device. So: >> >> qemu -hda raw:fd=4 >> >> Or something like that. We don't support this today. >> > > ah ok. > > >> I think bouncing the iov and just using pread/pwrite may be our best bet. >> It means memory allocation but we can cap it. Since we're using threads, >> > > It's already capped. However currently it generates an iovec, but > we've simply to check the iovcnt to be 1, if it's 1 we pread from > iov.iov_base, iov.iov_len. The dma api will take care to enforce > iovcnt to be 1 for the iovec if preadv/pwritev isn't detected at > compile time. > Hrm, that's more complex than I was expecting. I was thinking the bdrv aio infrastructure would always take an iovec. Any details about the underlying host's ability to handle the iovec would be insulated. >> we just can force a thread to sleep until memory becomes available so it's >> actually pretty straight forward. >> > > There's no way to detect that and wait for memory, If we artificially cap at say 50MB, then you do something like: while (buffer == NULL) { buffer = try_to_bounce(offset, iov, iovcnt, &size); if (buffer == NULL && errno == ENOMEM) { pthread_wait_cond(more memory); } } try_to_bounce allocs with malloc() but if you exceed 50MB, then you fail with an error of ENOMEM. In your bounce_free() function, you do a pthread_cond_broadcast() to wake up any threads potentially waiting to allocate memory. This lets us expose a preadv/pwritev function that actually works. The expectation is that bouncing will outperform just doing pread/pwrite of each vector. Of course, you could get smart and if try_to_bounce fail, fall back to pread/pwrite each vector. Likewise, you can fast-path the case of a single iovec to avoid bouncing entirely. Regards, Anthony Liguori > it'd sigkill before > you can check... at least with the default overcommit. The way the dma > api works, is that it doesn't send a mega large writev, but send it in > pieces capped by the max buffer size, with many iovecs with iovcnt = 1. > > >> We can use libaio on older Linux's to simulate preadv/pwritev. Use the >> proper syscalls on newer kernels, on BSDs, and bounce everything else. >> > > Given READV/WRITEV aren't available in not very recent kernels and > given that without O_DIRECT each iocb will become synchronous, we > can't use the libaio. Also once they fix linux-aio, if we do that, the > iocb logic would need to be largely refactored. So I'm not sure if it > worth it as it can't handle 2.6.16-18 when O_DIRECT is disabled (when > O_DIRECT is enabled we could just build an array of linear iocb). >